Speech-to-Text with OpenAI’s Whisper
Easy speech to text
OpenAI has recently released a new speech recognition model called Whisper. Unlike DALLE-2 and GPT-3, Whisper is a free and open-source model.
Whisper is an automatic speech recognition model trained on 680,000 hours of multilingual data collected from the web. As per OpenAI, this model is robust to accents, background noise and technical language. In addition, it supports 99 different languages’ transcription and translation from those languages into English.
This article explains how to convert speech into text using the Whisper model and Python. And, it won’t cover how the model works or the model architecture. You can check more about the Whisper here.
Whisper has five models (refer to the below table). Below is the table available on OpenAI’s GitHub page. According to OpenAI, four models for English-only applications, which is denoted as
.en. The model performs better for
base.en, however, differences would become less significant for the
For this article, I am converting Youtube video into audio and passing the audio into a whisper model to convert it into text.
I used Google Colab with GPU to execute the below code.
Importing Pytube Library
!pip install -— upgrade pytube
Reading Youtube video and downloading as a MP4 file to transcribe
In the first example, I am reading the famous Taken movie dialogue as per the below YouTube video
#Importing Pytube library import pytube # Reading the above Taken movie Youtube link video = 'https://www.youtube.com/watch?v=-LIIf7E-qFI' data = pytube.YouTube(video) # Converting and downloading as 'MP4' file audio = data.streams.get_audio_only() audio.download()
The above YouTube link has been downloaded as an ‘MP4’ file and stored under content.
Now, the next step is to convert audio into text. We can do this in three lines of code using whisper.
Importing Whisper library
# Installing Whisper libary !pip install git+https://github.com/openai/whisper.git -q import whisper
I am using
medium multilingual model here and passing the above audio file
I will find You I will Kill You Taken Movie best scene ever liam neeson.mp4 and stored as a text object
model = whisper.load_model("large") text = model1.transcribe("I will find YouI will Kill You Taken Movie best scene ever liam neeson.mp4") #printing the transcribe text['text']
Below is the text from the audio. It exactly matches the audio.
I don’t know who you are. I don’t know what you want. If you are looking for ransom, I can tell you I don’t have money. But what I do have are a very particular set of skills. Skills I have acquired over a very long career. Skills that make me a nightmare for people like you. If you let my daughter go now, that will be the end of it. I will not look for you. I will not pursue you. But if you don’t, I will look for you. I will find you. And I will kill you. Good luck.
How about converting different audio language?
As we know, Whisper supports 99 languages; I am trying with
Tamil Indian language and using the below movie clip video into text.
In this example, I used
#Importing Pytube library import pytube # Reading the above tamil movie clip from Youtube link video = 'https://www.youtube.com/watch?v=H1HPYH2uMfQ' data = pytube.YouTube(video) # Converting and downloading as ‘MP4’ file audio = data.streams.get_audio_only() audio.download()
Loading Large Model
#Loading large model model = whisper.load_model("large") text = model1.transcribe("Petta mass dialogue with WhatsApp status 30 Seconds.mp4") #printing the transcribe text['text']
Model converted above Tamil audio clip into text. The model transcribed the audio well; however, I can see some small variation in the language.
சிறப்பான தரமான சம்பவங்களை இனிமேல் தான் பார்க்கப் போகிறேன். ஏய்.. ஏய்.. ஏய்.. சத்தியமா சொல்கிறேன். அடிச்சி அண்டு வேண்டும் என்று ஓழ்வு விட்டுடுவேன். மானம் போலம் திருப்பி வராது பார்த்துவிடு. ஏய்.. யாருக்காவது பொண்டாட்டி குழந்தைக் குட்டியன் சென்றும் குட்டும் என்று செய்துவிட்டு இருந்தால் அப்டியே ஓடி போய்டு.
I mainly tried medium and large models. It is robust and exactly transcribes the audio. Also, I transcribed a long audio maximum of 10 min using Azure Synapse notebook with GPU, which works very well.
And this is fully open source and free; we can directly use it for our speech recogonition application in your projects. We can translate other languages into English as well. I will cover it in my next article with long audio and different languages in English.
You can check more about the Whisper model; please visit Whisper’s Github page.
Thanks for reading. Keep learning, and stay tuned for more!