OpenAI Whisper: The Ultimate Tool for Audio Transcription and Sentiment Analysis
Dec 11, 2024
Explore the capabilities of OpenAI Whisper, the ultimate tool for audio transcription. In this video, we'll use Python, Whisper, and OpenAI's powerful GPT models to transcribe a Microsoft earnings call and then use the tool to summarize the call and determine its sentiment.
I'll take you step-by-step through the process of using Whisper for audio transcription and sentiment analysis and show you how it can save you time and improve accuracy. With Whisper, you can quickly and easily transcribe any audio file and analyze its sentiment using OpenAI's language models.
This video is a must-watch if you want to improve your audio transcription and sentiment analysis skills. Whether you're a data analyst, researcher, or just interested in the power of AI, you'll learn a lot from this demonstration of OpenAI Whisper. So, sit back, relax, and let's explore the amazing capabilities of OpenAI Whisper together!
đ Subscribe for more: https://bit.ly/3lLybeP
đ Follow along: https://analyzingalpha.com/openai-whisper-python-tutorial
âąď¸ Video Chapters
0:00 Intro
0:20 Getting Started with OpenAI Whisper
0:24 Installing OpenAI Whisper
0:44 OpenAI Whisper Model Types
1:01 Downloading Kaggle wav File
1:39 Transcribing Audio using model.transcribe
3:53 Detecting Audio Language using Mel Spectrogram
6:00 Earnings Call Mini Project
8:00 Transcribe Microsoft Earnings Call using OpenAI Whisper
8:51 Summarize Microsoft Earnings Whisper Transcription using OpenAI GPT-3
Show More Show Less View Video Transcript
0:00
have you ever wanted to be able to
0:02
transcribe audio for free quickly and
0:04
easily
0:05
today you're in luck in this tutorial
0:07
you're going to learn how to use open
0:08
ai's whisper to transcribe a Microsoft's
0:12
earnings call and we're also going to
0:14
summarize that earnings call and analyze
0:16
the sentiment if this is something that
0:17
you're interested in please hit the like
0:19
and the Subscribe and let's get coding
0:21
let's start by getting open ai's whisper
0:23
installed it's very easy to do we'll
0:25
prepend bang in front of pip so that way
0:27
we can get PIP to run in our virtual
0:29
environment install we'll upgrade all
0:32
existing packages open AI whisper I'm
0:36
not going to run this because I've got
0:37
it already installed and then loading
0:39
the model is really easy the most
0:41
challenging thing is deciding which
0:43
model you want here you can see all of
0:45
the various model types I'll just use
0:47
base but if you have something more
0:49
complicated or want more accuracy you
0:52
can use a larger model
0:54
we'll import whisper
0:57
and then model equals whisper
1:00
load model
1:02
base
1:04
you'll notice I don't have an API key
1:06
because everything is running locally
1:08
however openai released yesterday the
1:12
whisper API endpoint
1:14
if that's something you would rather use
1:16
but I'd rather keep the audio here and
1:18
do it for free now with our whisper
1:20
model selected we can now transcribe
1:22
audio so that we're all working on the
1:24
same page we'll use a sample audio files
1:27
for speech recognition from kaggle I'll
1:29
leave a link in the description below
1:31
I've already got that downloaded and in
1:34
order to transcribe this it's very easy
1:37
you just use the transcribe method
1:39
result equals model
1:42
transcribe
1:45
Harvard
1:46
dot wave which is the name of the file
1:49
and I'll pass true to verbose so we can
1:52
see the output and result
1:54
text and we'll see here
1:56
I made a typo transcribe
2:00
and you can see that our audio is
2:01
transcribed it's that easy now let's dig
2:04
into what each one of these segments
2:06
look like type result segments and you
2:09
can see here the ID start seek end
2:12
tokens temperature all of the various
2:15
information that you might be curious
2:16
about when analyzing segments and if you
2:19
wanted to iterate through these segments
2:21
it's very easy to do you can just use
2:24
the numerate so for I seg and enumerate
2:29
result
2:31
segments
2:34
print five plus one so we're not
2:36
starting at zero
2:38
and seg text and this will print out
2:42
I'm having a problem typing today
2:44
enumerate
2:45
there we go so now you can see that we
2:48
have six segments and here's the text
2:50
and if you're not sure what enumerate
2:52
does it simply adds a counter to any
2:54
iterable let's put this text into a data
2:57
frame because eventually we're going to
2:58
want to apply a function across the rows
3:01
to analyze the sentiment
3:03
so let's do that now
3:05
we'll first import pandas as PD then
3:09
we'll get this speech
3:10
equals PD data frame
3:13
we'll import pandas as PD
3:16
Beach equals PD data frame
3:20
from dict result and we'll grab those
3:23
segments
3:26
and then let's just see what that looks
3:28
like perfect and now we have our
3:30
segments and the IDC start everything
3:33
that we saw above in columns
3:37
and that's how we'll prepare our data
3:39
when doing sentiment analysis
3:42
but first there's one final thing I
3:44
wanted to cover before we jump into our
3:46
mini project and that is language
3:47
detection detecting language it's super
3:49
simple all we need to do is load the
3:51
audio trim it and then use something
3:54
called a Mel spectrogram let's see how
3:56
it's done audio equals whisper
3:59
load audio
4:01
again harvard.way that's a file from
4:04
kaggle that we loaded previously audio
4:06
equals whisper
4:08
pad or trim
4:10
and pad or trim the audio just to
4:13
prepare it and now what we need to do is
4:15
we need to use this thing called a Mel
4:18
spectrogram what we're doing is we're
4:20
taking this audio and converting it into
4:22
numbers and a spectrogram is a way to do
4:24
that and then we convert that
4:26
spectrogram back to the same device as
4:28
our model in my case it's Cuda but let's
4:31
walk through that Mel equals whisper
4:34
log male spectrogram
4:37
passing in our audio that was just added
4:40
or trimmed
4:42
then two model device
4:45
and then now oops now we want now we can
4:47
get the probabilities throw away the
4:50
first variable probs
4:53
equals model detect language
4:58
pass in the mail spectrogram run that
5:00
assuming no errors and then maybe the
5:03
most challenging piece of code is
5:05
actually sorting the result so use
5:07
sorted robs which is a dictionary
5:10
sort the items
5:12
equals Lambda which creates Anonymous
5:14
function to sort it get the values in
5:16
reverse
5:17
because we want the highest
5:19
probabilities first and then slice the
5:21
first 10.
5:23
now we can see overwhelmingly that
5:26
whisper believes
5:28
harvard.wave is English which is indeed
5:31
true now with the basics out of the way
5:33
it's time to start our earnings call
5:35
Mini project the first thing that we're
5:37
going to need to do is get the Microsoft
5:39
audio I will put the link in the
5:41
description below or you can just follow
5:42
along with this Jupiter nope that will
5:44
be linked on the website one of the
5:46
challenges is the earnings call doesn't
5:48
start until roughly the 30 minute Mark
5:49
we're going to have to trim that and
5:51
extract the audio from the video let's
5:53
go back and walk through this
5:55
call this the earnings
5:57
call Mini
6:00
project all right so the first thing we
6:02
want to do is
6:03
pip install Pi tube
6:07
I've already done that I'm going to
6:08
comment that out and then we'll need to
6:11
import Pi tube from PI tube import
6:14
YouTube and this is what will allow us
6:16
to download the audio
6:19
we'll just create a variable to store
6:20
the YouTube url YouTube
6:23
video URL
6:25
and then we can get the content using
6:28
the YouTube object so YouTube
6:30
video content equals YouTube
6:35
and then pass in the URL YouTube video
6:38
URL now recall that this is both video
6:41
and audio so what we want to do is
6:43
actually at the audio stream because
6:45
we're not going to be transcribing video
6:47
right that doesn't make any sense audio
6:49
streams because there's multiple streams
6:51
equals YouTube video content streams and
6:55
we want to filter for only audio and
6:58
then just so you understand what this
7:00
looks like for stream and audio streams
7:04
print screen
7:07
is actually multiple audio streams we've
7:10
got three of them here I'll just select
7:11
the middle one no reason specifically
7:14
audio stream equals audio streams one
7:19
print audio screen
7:24
there we go so now we have one audio
7:26
stream in our audio stream variable and
7:29
now what we want to do is we want to
7:30
trim it as recall earlier
7:32
there was what 30 minutes of
7:34
introductory music I'm going to use
7:36
ffmpeg to do this I want to see if I've
7:40
got ffmpeg installed I do so I'm running
7:42
Ubuntu and I have ffmpeg if you don't
7:45
have it installed there's a lot of great
7:47
content on the internet to help you and
7:49
here's the command for ffmpeg I'm not
7:52
going to run this because it takes a
7:53
long time but if you want the exact
7:55
command here it is
7:58
and I'll also
7:59
have these commands in the jupyter
8:01
notebook Linked In the description
8:03
and now we're back to the transcribing
8:05
we've seen this before import whisper
8:09
we use the base model equals whisper
8:11
load model
8:13
base
8:15
I mean technically we've already done
8:17
that so we need to do it again
8:20
result equals model transcribe
8:25
earnings
8:28
all Microsoft u4 2022 filtered MP4
8:35
result
8:36
you want the text and we'll just take
8:38
the first 400 characters
8:41
and let's see if that works and we see
8:44
that it did now we can take this text
8:46
and summarize it using open aisdbt
8:50
models
8:51
I'll create a new section here called
8:53
summarize whisper transcription
8:58
and if you haven't checked it out
8:59
already I've already created a full
9:01
length YouTube tutorial it's about an
9:04
hour long on all the GPT models in Dali
9:06
using open AI it's also can be found at
9:09
that link for a tutorial but basically
9:11
if this is the first time using GPT or
9:13
text models
9:15
you can install openai by pip install
9:18
open AI I'm not going to run that
9:21
because I've already got it installed
9:22
and in this case we actually do need an
9:24
API case I'll import OS import open AI
9:29
I'm going to use pandas import pandas as
9:32
DD I'm only doing this for people who
9:34
Skip directly to this section as I
9:35
already loaded it above and now to get
9:37
the API key from the environment we just
9:39
use OS get EnV open AI API key and then
9:45
pass that into openai so open AI API key
9:48
equals API key and that's all you need
9:51
to do for authentication and summarizing
9:54
text is super easy all we need to do is
9:57
pass
9:58
the text into the create method of
10:00
openai's completion and then give it the
10:03
prompt to summarizer tldr for short so
10:06
text equals result
10:09
text 985
10:12
summary equals open AI completion
10:16
create
10:18
the model will use will be the text
10:21
DaVinci zero zero three which is their
10:24
latest and greatest but again if you
10:26
have questions on all the various models
10:27
check out my prior tutorial DaVinci
10:30
prompt equals the text and then we just
10:33
pass it too long to read new line new
10:36
line escaped TL Dr you can read more
10:40
about that in the documentation if you
10:42
want but open ai's GPT DaVinci
10:44
understand what that means Max tokens
10:47
equal 200 because I won't pay a ton and
10:49
we don't need to get creative
10:51
temperature
10:52
let's scroll down here so you can see it
10:54
equals zero and now I'll print the text
10:57
so print summary
11:00
voices
11:01
zero
11:03
text
11:04
looks like I meant to do a dash down
11:07
equals
11:08
and let's see if this summarizes it now
11:11
this looks great it looks like a
11:12
wonderful summary obviously there's some
11:14
mistakes Satya nadella's name spelled
11:16
incorrectly but for a summary purpose
11:19
it's fantastic now let's use openai's
11:22
gpt3 models to analyze the whisper audio
11:25
segments
11:28
creating your heading whisper audio
11:30
sentiment analysis
11:34
using vpt3
11:38
and now what we're going to do is we're
11:39
going to create a data frame and store
11:41
those segments in there I'll sample 20
11:43
of them because I don't want to be
11:45
spending all my credits so earnings call
11:48
DF equals PD data frame from dict
11:53
result
11:56
the segments which we saw earlier
12:00
and then I only want the text and then
12:02
I'll sample 20 rows because I don't want
12:04
all of the data earnings call DF equals
12:08
earnings
12:10
earnings call DF
12:14
dot sample n equals 20 and then let's
12:17
preview the results earnings all DF head
12:22
now we see a data frame with the text
12:24
and row ID
12:26
now let's create a function to analyze a
12:28
sentiment across each row I'm going to
12:31
import regular expression so we can
12:33
remove special characters from the
12:35
response
12:36
def get sentiment text prompt
12:40
text equals classify the sentiment
12:45
of the following earnings call text as
12:49
positive
12:51
negative
12:52
or neutral
12:54
text
12:56
and sentiment
12:58
format and text all right now all we
13:01
need to do to get the sentiment of this
13:04
text is pass it to open AI sentiment
13:06
equals open AI completion
13:10
create model equals text DaVinci
13:15
zero zero three prompt equals prompt
13:18
text Max tokens equals 15 temperature
13:22
equals zero
13:24
now we just want to remove those special
13:26
characters sentiments equals re sub
13:30
we just want the words
13:32
if I set the text
13:34
and return sentiment okay now let's test
13:37
it by creating two strings one with
13:41
positive sentiment and one with negative
13:42
sentiment text equals this here has been
13:46
financially very profitable for us
13:50
print
13:51
get sentiment pass in the text text
13:55
equals we lost 50 percent of our assets
14:00
this year and we make a bankrupt that's
14:04
not just negative that's Ultra negative
14:07
print get sentiment
14:10
text and we should see positive negative
14:13
or a typo let's go back up here fix that
14:16
run it again
14:18
positive and negative fantastic so now
14:21
that we assume that our sentiment
14:23
function is working we can just apply it
14:25
across each row earnings call DF
14:29
sentiment equals earnings call DF next
14:34
reply get sentiment this simply applies
14:37
get sentiment for all of the text and
14:40
passes the result into sentiment and
14:42
let's preview the results earnings call
14:45
yeah head and it looks like that worked
14:48
and this is great but it's hard to
14:50
really understand what this sentiment
14:52
looks like we could count the numbers or
14:54
what I prefer is to plot them so let's
14:57
do that we can type pip install Seaborn
15:01
I already have it installed so I'm not
15:03
going to actually install it and I'll
15:06
import Seaborn import Seaborn as SNS
15:10
which is conventional SNS set style I'm
15:13
just setting it up to make it look a
15:14
little bit nicer dark grid
15:18
SNS set RC
15:22
bigger fig size
15:25
and seven
15:27
SNS set context poster now I'll print
15:32
out the value count and then plot it
15:34
print
15:35
earnings call DF
15:38
sentiment
15:40
value counts
15:43
SNS count plot
15:46
X
15:47
is sentiment
15:49
data equals earnings
15:52
earnings call
15:55
DF
15:58
and now you can see that we have 15
16:00
neutral four positive and one negative
16:02
and this might get a lot more
16:03
interesting if we sample more than 20
16:05
but you get the idea
16:07
but there's a problem with this
16:09
have you figured it out yet sentences
16:12
are arbitrarily cut because the segments
16:14
are fixed length what we really want to
16:16
do is we want to split up our text on
16:19
the periods and not use the default
16:22
transcription segments and this is
16:25
really easy to do
16:27
type segments
16:28
equals result text
16:32
we will split based on the period and
16:34
we'll use this comprehension segments
16:37
equals segment Plus
16:41
period
16:43
or segment
16:45
and segments and we'll see what the
16:47
length of that is obviously I misspelled
16:49
that there
16:50
segments
16:52
print
16:53
link segments
16:57
should have 312 of them let's just see
16:59
what that looks like
17:01
segments in order to take the first five
17:03
and now you can see that each sentence
17:06
is now a segment and now we're just
17:09
going to run through the same process
17:10
we're going to create a data frame for
17:13
the new segments earnings call DF custom
17:16
I'll call it PD data frame segments
17:20
columns equals
17:22
text
17:23
earnings
17:26
call DF custom
17:30
and and now we should see
17:32
a row for each sentence and this time
17:34
why don't we sample I don't know
17:37
40. so earnings call BF custom equals
17:42
earnings but I don't you know what I
17:44
don't want to destroy it here earnings
17:45
call DF custom equals earnings call DF
17:51
custom sample and equals
17:55
40 earnings call DF custom
17:58
sentiment equals earnings call
18:02
EF custom text
18:05
and apply our sentiment function
18:07
earnings call the f
18:10
custom ADD and see if that all get
18:15
sentiment
18:16
and we see that it did figure out the
18:18
sentiment for all of that text and one
18:21
more time print earnings call DF custom
18:26
sentiment value counts to see what we're
18:29
working with now
18:31
and we'll plot it
18:32
SNS count plot X
18:36
sentiment
18:38
could have just copied and pasted from
18:40
above but that's okay data equals
18:42
earnings call EF custom let's see what
18:46
we get
18:47
lots of neutral and mostly positive so
18:50
it appears that was a good earnings call
18:52
for Microsoft
18:54
and that's it for this tutorial
18:56
hopefully you enjoyed it you now are
18:58
able to take any audio transcribe it
19:00
analyze the sentiment summarize it
19:02
pretty much you're a whisper Master if
19:04
you like this video please hit a thumbs
19:06
up and I'd love to see in the next one
19:08
thanks bye
#Programming
#Podcasts
#Intelligent Personal Assistants
#Other
