OpenAI Whisper: The Ultimate Tool for Audio Transcription and Sentiment Analysis

0:00
have you ever wanted to be able to
0:02
transcribe audio for free quickly and
0:04
easily
0:05
today you're in luck in this tutorial
0:07
you're going to learn how to use open
0:08
ai's whisper to transcribe a Microsoft's
0:12
earnings call and we're also going to
0:14
summarize that earnings call and analyze
0:16
the sentiment if this is something that
0:17
you're interested in please hit the like
0:19
and the Subscribe and let's get coding
0:21
let's start by getting open ai's whisper
0:23
installed it's very easy to do we'll
0:25
prepend bang in front of pip so that way
0:27
we can get PIP to run in our virtual
0:29
environment install we'll upgrade all
0:32
existing packages open AI whisper I'm
0:36
not going to run this because I've got
0:37
it already installed and then loading
0:39
the model is really easy the most
0:41
challenging thing is deciding which
0:43
model you want here you can see all of
0:45
the various model types I'll just use
0:47
base but if you have something more
0:49
complicated or want more accuracy you
0:52
can use a larger model
0:54
we'll import whisper
0:57
and then model equals whisper
1:00
load model
1:02
base
1:04
you'll notice I don't have an API key
1:06
because everything is running locally
1:08
however openai released yesterday the
1:12
whisper API endpoint
1:14
if that's something you would rather use
1:16
but I'd rather keep the audio here and
1:18
do it for free now with our whisper
1:20
model selected we can now transcribe
1:22
audio so that we're all working on the
1:24
same page we'll use a sample audio files
1:27
for speech recognition from kaggle I'll
1:29
leave a link in the description below
1:31
I've already got that downloaded and in
1:34
order to transcribe this it's very easy
1:37
you just use the transcribe method
1:39
result equals model
1:42
transcribe
1:45
Harvard
1:46
dot wave which is the name of the file
1:49
and I'll pass true to verbose so we can
1:52
see the output and result
1:54
text and we'll see here
1:56
I made a typo transcribe
2:00
and you can see that our audio is
2:01
transcribed it's that easy now let's dig
2:04
into what each one of these segments
2:06
look like type result segments and you
2:09
can see here the ID start seek end
2:12
tokens temperature all of the various
2:15
information that you might be curious
2:16
about when analyzing segments and if you
2:19
wanted to iterate through these segments
2:21
it's very easy to do you can just use
2:24
the numerate so for I seg and enumerate
2:29
result
2:31
segments
2:34
print five plus one so we're not
2:36
starting at zero
2:38
and seg text and this will print out
2:42
I'm having a problem typing today
2:44
enumerate
2:45
there we go so now you can see that we
2:48
have six segments and here's the text
2:50
and if you're not sure what enumerate
2:52
does it simply adds a counter to any
2:54
iterable let's put this text into a data
2:57
frame because eventually we're going to
2:58
want to apply a function across the rows
3:01
to analyze the sentiment
3:03
so let's do that now
3:05
we'll first import pandas as PD then
3:09
we'll get this speech
3:10
equals PD data frame
3:13
we'll import pandas as PD
3:16
Beach equals PD data frame
3:20
from dict result and we'll grab those
3:23
segments
3:26
and then let's just see what that looks
3:28
like perfect and now we have our
3:30
segments and the IDC start everything
3:33
that we saw above in columns
3:37
and that's how we'll prepare our data
3:39
when doing sentiment analysis
3:42
but first there's one final thing I
3:44
wanted to cover before we jump into our
3:46
mini project and that is language
3:47
detection detecting language it's super
3:49
simple all we need to do is load the
3:51
audio trim it and then use something
3:54
called a Mel spectrogram let's see how
3:56
it's done audio equals whisper
3:59
load audio
4:01
again harvard.way that's a file from
4:04
kaggle that we loaded previously audio
4:06
equals whisper
4:08
pad or trim
4:10
and pad or trim the audio just to
4:13
prepare it and now what we need to do is
4:15
we need to use this thing called a Mel
4:18
spectrogram what we're doing is we're
4:20
taking this audio and converting it into
4:22
numbers and a spectrogram is a way to do
4:24
that and then we convert that
4:26
spectrogram back to the same device as
4:28
our model in my case it's Cuda but let's
4:31
walk through that Mel equals whisper
4:34
log male spectrogram
4:37
passing in our audio that was just added
4:40
or trimmed
4:42
then two model device
4:45
and then now oops now we want now we can
4:47
get the probabilities throw away the
4:50
first variable probs
4:53
equals model detect language
4:58
pass in the mail spectrogram run that
5:00
assuming no errors and then maybe the
5:03
most challenging piece of code is
5:05
actually sorting the result so use
5:07
sorted robs which is a dictionary
5:10
sort the items
5:12
equals Lambda which creates Anonymous
5:14
function to sort it get the values in
5:16
reverse
5:17
because we want the highest
5:19
probabilities first and then slice the
5:21
first 10.
5:23
now we can see overwhelmingly that
5:26
whisper believes
5:28
harvard.wave is English which is indeed
5:31
true now with the basics out of the way
5:33
it's time to start our earnings call
5:35
Mini project the first thing that we're
5:37
going to need to do is get the Microsoft
5:39
audio I will put the link in the
5:41
description below or you can just follow
5:42
along with this Jupiter nope that will
5:44
be linked on the website one of the
5:46
challenges is the earnings call doesn't
5:48
start until roughly the 30 minute Mark
5:49
we're going to have to trim that and
5:51
extract the audio from the video let's
5:53
go back and walk through this
5:55
call this the earnings
5:57
call Mini
6:00
project all right so the first thing we
6:02
want to do is
6:03
pip install Pi tube
6:07
I've already done that I'm going to
6:08
comment that out and then we'll need to
6:11
import Pi tube from PI tube import
6:14
YouTube and this is what will allow us
6:16
to download the audio
6:19
we'll just create a variable to store
6:20
the YouTube url YouTube
6:23
video URL
6:25
and then we can get the content using
6:28
the YouTube object so YouTube
6:30
video content equals YouTube
6:35
and then pass in the URL YouTube video
6:38
URL now recall that this is both video
6:41
and audio so what we want to do is
6:43
actually at the audio stream because
6:45
we're not going to be transcribing video
6:47
right that doesn't make any sense audio
6:49
streams because there's multiple streams
6:51
equals YouTube video content streams and
6:55
we want to filter for only audio and
6:58
then just so you understand what this
7:00
looks like for stream and audio streams
7:04
print screen
7:07
is actually multiple audio streams we've
7:10
got three of them here I'll just select
7:11
the middle one no reason specifically
7:14
audio stream equals audio streams one
7:19
print audio screen
7:24
there we go so now we have one audio
7:26
stream in our audio stream variable and
7:29
now what we want to do is we want to
7:30
trim it as recall earlier
7:32
there was what 30 minutes of
7:34
introductory music I'm going to use
7:36
ffmpeg to do this I want to see if I've
7:40
got ffmpeg installed I do so I'm running
7:42
Ubuntu and I have ffmpeg if you don't
7:45
have it installed there's a lot of great
7:47
content on the internet to help you and
7:49
here's the command for ffmpeg I'm not
7:52
going to run this because it takes a
7:53
long time but if you want the exact
7:55
command here it is
7:58
and I'll also
7:59
have these commands in the jupyter
8:01
notebook Linked In the description
8:03
and now we're back to the transcribing
8:05
we've seen this before import whisper
8:09
we use the base model equals whisper
8:11
load model
8:13
base
8:15
I mean technically we've already done
8:17
that so we need to do it again
8:20
result equals model transcribe
8:25
earnings
8:28
all Microsoft u4 2022 filtered MP4
8:35
result
8:36
you want the text and we'll just take
8:38
the first 400 characters
8:41
and let's see if that works and we see
8:44
that it did now we can take this text
8:46
and summarize it using open aisdbt
8:50
models
8:51
I'll create a new section here called
8:53
summarize whisper transcription
8:58
and if you haven't checked it out
8:59
already I've already created a full
9:01
length YouTube tutorial it's about an
9:04
hour long on all the GPT models in Dali
9:06
using open AI it's also can be found at
9:09
that link for a tutorial but basically
9:11
if this is the first time using GPT or
9:13
text models
9:15
you can install openai by pip install
9:18
open AI I'm not going to run that
9:21
because I've already got it installed
9:22
and in this case we actually do need an
9:24
API case I'll import OS import open AI
9:29
I'm going to use pandas import pandas as
9:32
DD I'm only doing this for people who
9:34
Skip directly to this section as I
9:35
already loaded it above and now to get
9:37
the API key from the environment we just
9:39
use OS get EnV open AI API key and then
9:45
pass that into openai so open AI API key
9:48
equals API key and that's all you need
9:51
to do for authentication and summarizing
9:54
text is super easy all we need to do is
9:57
pass
9:58
the text into the create method of
10:00
openai's completion and then give it the
10:03
prompt to summarizer tldr for short so
10:06
text equals result
10:09
text 985
10:12
summary equals open AI completion
10:16
create
10:18
the model will use will be the text
10:21
DaVinci zero zero three which is their
10:24
latest and greatest but again if you
10:26
have questions on all the various models
10:27
check out my prior tutorial DaVinci
10:30
prompt equals the text and then we just
10:33
pass it too long to read new line new
10:36
line escaped TL Dr you can read more
10:40
about that in the documentation if you
10:42
want but open ai's GPT DaVinci
10:44
understand what that means Max tokens
10:47
equal 200 because I won't pay a ton and
10:49
we don't need to get creative
10:51
temperature
10:52
let's scroll down here so you can see it
10:54
equals zero and now I'll print the text
10:57
so print summary
11:00
voices
11:01
zero
11:03
text
11:04
looks like I meant to do a dash down
11:07
equals
11:08
and let's see if this summarizes it now
11:11
this looks great it looks like a
11:12
wonderful summary obviously there's some
11:14
mistakes Satya nadella's name spelled
11:16
incorrectly but for a summary purpose
11:19
it's fantastic now let's use openai's
11:22
gpt3 models to analyze the whisper audio
11:25
segments
11:28
creating your heading whisper audio
11:30
sentiment analysis
11:34
using vpt3
11:38
and now what we're going to do is we're
11:39
going to create a data frame and store
11:41
those segments in there I'll sample 20
11:43
of them because I don't want to be
11:45
spending all my credits so earnings call
11:48
DF equals PD data frame from dict
11:53
result
11:56
the segments which we saw earlier
12:00
and then I only want the text and then
12:02
I'll sample 20 rows because I don't want
12:04
all of the data earnings call DF equals
12:08
earnings
12:10
earnings call DF
12:14
dot sample n equals 20 and then let's
12:17
preview the results earnings all DF head
12:22
now we see a data frame with the text
12:24
and row ID
12:26
now let's create a function to analyze a
12:28
sentiment across each row I'm going to
12:31
import regular expression so we can
12:33
remove special characters from the
12:35
response
12:36
def get sentiment text prompt
12:40
text equals classify the sentiment
12:45
of the following earnings call text as
12:49
positive
12:51
negative
12:52
or neutral
12:54
text
12:56
and sentiment
12:58
format and text all right now all we
13:01
need to do to get the sentiment of this
13:04
text is pass it to open AI sentiment
13:06
equals open AI completion
13:10
create model equals text DaVinci
13:15
zero zero three prompt equals prompt
13:18
text Max tokens equals 15 temperature
13:22
equals zero
13:24
now we just want to remove those special
13:26
characters sentiments equals re sub
13:30
we just want the words
13:32
if I set the text
13:34
and return sentiment okay now let's test
13:37
it by creating two strings one with
13:41
positive sentiment and one with negative
13:42
sentiment text equals this here has been
13:46
financially very profitable for us
13:50
print
13:51
get sentiment pass in the text text
13:55
equals we lost 50 percent of our assets
14:00
this year and we make a bankrupt that's
14:04
not just negative that's Ultra negative
14:07
print get sentiment
14:10
text and we should see positive negative
14:13
or a typo let's go back up here fix that
14:16
run it again
14:18
positive and negative fantastic so now
14:21
that we assume that our sentiment
14:23
function is working we can just apply it
14:25
across each row earnings call DF
14:29
sentiment equals earnings call DF next
14:34
reply get sentiment this simply applies
14:37
get sentiment for all of the text and
14:40
passes the result into sentiment and
14:42
let's preview the results earnings call
14:45
yeah head and it looks like that worked
14:48
and this is great but it's hard to
14:50
really understand what this sentiment
14:52
looks like we could count the numbers or
14:54
what I prefer is to plot them so let's
14:57
do that we can type pip install Seaborn
15:01
I already have it installed so I'm not
15:03
going to actually install it and I'll
15:06
import Seaborn import Seaborn as SNS
15:10
which is conventional SNS set style I'm
15:13
just setting it up to make it look a
15:14
little bit nicer dark grid
15:18
SNS set RC
15:22
bigger fig size
15:25
and seven
15:27
SNS set context poster now I'll print
15:32
out the value count and then plot it
15:34
print
15:35
earnings call DF
15:38
sentiment
15:40
value counts
15:43
SNS count plot
15:46
X
15:47
is sentiment
15:49
data equals earnings
15:52
earnings call
15:55
DF
15:58
and now you can see that we have 15
16:00
neutral four positive and one negative
16:02
and this might get a lot more
16:03
interesting if we sample more than 20
16:05
but you get the idea
16:07
but there's a problem with this
16:09
have you figured it out yet sentences
16:12
are arbitrarily cut because the segments
16:14
are fixed length what we really want to
16:16
do is we want to split up our text on
16:19
the periods and not use the default
16:22
transcription segments and this is
16:25
really easy to do
16:27
type segments
16:28
equals result text
16:32
we will split based on the period and
16:34
we'll use this comprehension segments
16:37
equals segment Plus
16:41
period
16:43
or segment
16:45
and segments and we'll see what the
16:47
length of that is obviously I misspelled
16:49
that there
16:50
segments
16:52
print
16:53
link segments
16:57
should have 312 of them let's just see
16:59
what that looks like
17:01
segments in order to take the first five
17:03
and now you can see that each sentence
17:06
is now a segment and now we're just
17:09
going to run through the same process
17:10
we're going to create a data frame for
17:13
the new segments earnings call DF custom
17:16
I'll call it PD data frame segments
17:20
columns equals
17:22
text
17:23
earnings
17:26
call DF custom
17:30
and and now we should see
17:32
a row for each sentence and this time
17:34
why don't we sample I don't know
17:37
40. so earnings call BF custom equals
17:42
earnings but I don't you know what I
17:44
don't want to destroy it here earnings
17:45
call DF custom equals earnings call DF
17:51
custom sample and equals
17:55
40 earnings call DF custom
17:58
sentiment equals earnings call
18:02
EF custom text
18:05
and apply our sentiment function
18:07
earnings call the f
18:10
custom ADD and see if that all get
18:15
sentiment
18:16
and we see that it did figure out the
18:18
sentiment for all of that text and one
18:21
more time print earnings call DF custom
18:26
sentiment value counts to see what we're
18:29
working with now
18:31
and we'll plot it
18:32
SNS count plot X
18:36
sentiment
18:38
could have just copied and pasted from
18:40
above but that's okay data equals
18:42
earnings call EF custom let's see what
18:46
we get
18:47
lots of neutral and mostly positive so
18:50
it appears that was a good earnings call
18:52
for Microsoft
18:54
and that's it for this tutorial
18:56
hopefully you enjoyed it you now are
18:58
able to take any audio transcribe it
19:00
analyze the sentiment summarize it
19:02
pretty much you're a whisper Master if
19:04
you like this video please hit a thumbs
19:06
up and I'd love to see in the next one
19:08
thanks bye

OpenAI Whisper: The Ultimate Tool for Audio Transcription and Sentiment Analysis

analyzingalpha.com

Extraer Audio De Vídeo | Destripador De Audio (Guía Sencilla)

Comprimir Audio Por Debajo De 16MB | Reducir El Tamaño Del Archivo De Audio en Línea (Guía Sencilla)

Operators in Python | Python Identifiers | AI & Machine Learning | Tutorialspoint

Control Flow in Python | Transfer, Iterative and Conditional Statements(if, if else)| Tutorialspoint

iOS 18 IS COMING! What NEW FEATURES To Expect From iOS 18

Build a FFMPEG WASM Video Cropper & Zoom Timeline Editor in Browser Using HTML & JavaScript

Registration with image upload Java Jdbc(Download Source code) | Part 4

The Ultimate Free AI Tool for programmers Codia AI

How To Use DeepSeek's DeepThink R1 Reasoning Model? (ChatGPT Killer?)

This Vercel AI Coder is FREE & INSANE to Build & Deploy Full Stack Apps Destroying Claude,Gemini??

How To Add AI Voice On CapCut (in 2025)

Top 5 ChatGPT Chrome Extensions to Skyrocket Your Productivity

Up next in 10

OpenAI Whisper: The Ultimate Tool for Audio Transcription and Sentiment Analysis

analyzingalpha.com