Speech-to-Text with OpenAI’s Whisper

Speech-to-Text with OpenAI’s Whisper
Photo by Guillaume de Germain on Unsplash

Easy speech to text

OpenAI has recently released a new speech recognition model called Whisper. Unlike DALLE-2 and GPT-3, Whisper is a free and open-source model.

Whisper is an automatic speech recognition model trained on 680,000 hours of multilingual data collected from the web. As per OpenAI, this model is robust to accents, background noise and technical language. In addition, it supports 99 different languages’ transcription and translation from those languages into English.

This article explains how to convert speech into text using the Whisper model and Python. And, it won’t cover how the model works or the model architecture. You can check more about the Whisper here.

Whisper has five models (refer to the below table). Below is the table available on OpenAI’s GitHub page. According to OpenAI, four models for English-only applications, which is denoted as .en. The model performs better for tiny.en and base.en, however, differences would become less significant for the small.en and medium.en models.

Ref: OpenAI’s GitHHub Page

For this article, I am converting Youtube video into audio and passing the audio into a whisper model to convert it into text.

Image by author

I used Google Colab with GPU to execute the below code.

Importing Pytube Library

!pip install -— upgrade pytube

Reading Youtube video and downloading as a MP4 file to transcribe
In the first example, I am reading the famous Taken movie dialogue as per the below YouTube video

#Importing Pytube library

import pytube

# Reading the above Taken movie Youtube link

video = 'https://www.youtube.com/watch?v=-LIIf7E-qFI'
data = pytube.YouTube(video)

# Converting and downloading as 'MP4' file

audio = data.streams.get_audio_only()
audio.download()

Output

The above YouTube link has been downloaded as an ‘MP4’ file and stored under content.
Now, the next step is to convert audio into text. We can do this in three lines of code using whisper.


Importing Whisper library

# Installing Whisper libary

!pip install git+https://github.com/openai/whisper.git -q
import whisper

Loading model

I am using medium multilingual model here and passing the above audio file I will find You I will Kill You Taken Movie best scene ever liam neeson.mp4 and stored as a text object

model = whisper.load_model("large")

text = model1.transcribe("I will find YouI will Kill You Taken Movie best scene ever liam neeson.mp4")

#printing the transcribe

text['text']

Output

Below is the text from the audio. It exactly matches the audio.

I don’t know who you are. I don’t know what you want. If you are looking for ransom, I can tell you I don’t have money. But what I do have are a very particular set of skills. Skills I have acquired over a very long career. Skills that make me a nightmare for people like you. If you let my daughter go now, that will be the end of it. I will not look for you. I will not pursue you. But if you don’t, I will look for you. I will find you. And I will kill you. Good luck.

How about converting different audio language?

As we know, Whisper supports 99 languages; I am trying with Tamil Indian language and using the below movie clip video into text.

In this example, I used large model

#Importing Pytube library

import pytube

# Reading the above tamil movie clip from Youtube link

video = 'https://www.youtube.com/watch?v=H1HPYH2uMfQ'
data = pytube.YouTube(video)

# Converting and downloading as ‘MP4’ file

audio = data.streams.get_audio_only()
audio.download()

Output

Loading Large Model

#Loading large model

model = whisper.load_model("large")
text = model1.transcribe("Petta mass dialogue with WhatsApp status 30 Seconds.mp4")

#printing the transcribe
text['text']

Output

Model converted above Tamil audio clip into text. The model transcribed the audio well; however, I can see some small variation in the language.

சிறப்பான தரமான சம்பவங்களை இனிமேல் தான் பார்க்கப் போகிறேன். ஏய்.. ஏய்.. ஏய்.. சத்தியமா சொல்கிறேன். அடிச்சி அண்டு வேண்டும் என்று ஓழ்வு விட்டுடுவேன். மானம் போலம் திருப்பி வராது பார்த்துவிடு. ஏய்.. யாருக்காவது பொண்டாட்டி குழந்தைக் குட்டியன் சென்றும் குட்டும் என்று செய்துவிட்டு இருந்தால் அப்டியே ஓடி போய்டு.

I mainly tried medium and large models. It is robust and exactly transcribes the audio. Also, I transcribed a long audio maximum of 10 min using Azure Synapse notebook with GPU, which works very well.

And this is fully open source and free; we can directly use it for our speech recogonition application in your projects. We can translate other languages into English as well. I will cover it in my next article with long audio and different languages in English.

You can check more about the Whisper model; please visit Whisper’s Github page.

Thanks for reading. Keep learning, and stay tuned for more!

Reference

  1. https://github.com/openai/whisper
  2. https://openai.com/blog/whisper/