In this section, we are going to see how to transcribe audio file to text using the OpenAI Whisper model and then label the audio transcription using the OpenAI large language model (LLM).
Whisper is an open source ASR model developed by OpenAI. It is trained on nearly 700,000 hours of multilingual speech data and is capable of transcribing audio to text in almost 100 different languages. According to OpenAI, Whisper “approaches human level robustness and accuracy on English speech recognition.”
In a recent benchmark study, Whisper was compared to other open source ASR models, such as wav2vec 2.0 and Kaldi. The study found that Whisper performed better than wav2vec 2.0 in terms of accuracy and speed across five different use cases, including conversational AI, phone calls, meetings, videos, and earnings calls.
Whisper is also known for its affordability, accuracy, and features. It is best suited for audio-to-text use cases and is not well-suited for text-to-audio or speech synthesis tasks.
The Whisper model can be imported as a Python library. The other option is to use the Whisper model available in the model catalog at Azure Machine Learning studio.
Let’s see the process of transcribing audio using the OpenAI Whisper ASR using the Python library now. It’s crucial to ensure the existence and accessibility of the specified audio file for successful transcription. The transcribed text is likely stored in text[‘text’], as indicated by the print statement.
First, we need to install the whisper model, as mentioned in the technical requirements section. Then, we import the OpenAI Whisper model.
Step 1 – importing the Whisper model
Let us import the required Python libraries:
import whisper
Import pytube
The whisper library is imported, which is the library providing access to the OpenAI Whisper ASR model. The pytube library is imported to download YouTube videos.
Step 2 – loading the base Whisper model
Let us load the base Whisper model:
model = whisper.load_model(“base”)
The Whisper model is loaded using the whisper.load_model function with the “base” argument. This loads the base version of the Whisper ASR model.
Let us download the audio stream from a YouTube video. Even though we are using a video file, we are only focusing on the audio of the YouTube video and downloading an audio stream from it. Alternatively, you can directly use any audio file:
we are importing Pytube library
import pytube
we are downloading YouTube video from YouTube link
video = “https://youtu.be/g8Q452PEXwY”
data = pytube.YouTube(video)
The YouTube video URL is specified. Using the pytube.YouTube class, the video data is fetched:
Converting and downloading as ‘MP4’ file
audio = data.streams.get_audio_only()
audio.download()
This code utilizes the pytube library to download the audio stream from a video hosted on a platform such as YouTube. Let’s examine the preceding code snippet:
- audio = data.streams.get_audio_only(): This line fetches the audio stream of the video. It uses the get_audio_only() method to obtain a stream containing only the audio content.
- audio.download(): Once the audio stream is obtained, this line downloads the audio content. The download is performed in the default format, which is typically an MP4 file containing only the audio data.
In summary, the code extracts the audio stream from a video and downloads it as an MP4 file, preserving only the audio content.