Workshop on ASR in Python – Kevin Hirschi, PhD

This workshop provides a step-by-step guide to creating a PyCharm project that processes ASR audio files using OpenAI’s Whisper ASR and wav2vec 2.0

Setting up your computer for Python and Audio Processing

STEP 1

Install Python first using this link.

Once Python installs, open up the Python 3.13 (or later version) folder and launch the script “Install Certificates.command” You can launch it by double-clicking the file.

A terminal window should pop up and run a script to update Python’s ability to use secure websites. When it’s done, it should look like this:

Close the window and exit the Terminal application.

STEP 2

Install PyCharm Community (the free version) on your mac or PC. Follow the instructions as necessary.

STEP 3

Once PyCharm CE is installed, open it and create a new PyCharm CE Project named ASR_Workshop (or similar) by changing “PythonProject” to “ASR_Workshop” in the location.

All other settings can remain as the default.

STEP 4

Create a new Python file named WhisperDoodle.py by right-clicking on the folder “ASR_Workshop” and selecting “New,” and then “Python File.” Name the file “Whisperdoodle.py” or something similar.

STEP 5

Check to make sure Python is running. Here are two examples you can try. Click the Green arrow button at the top to run your Python script.

#test
print("Hello World!")

#Try some math
a = 2
b = 3
print(a + b)

STEP 6

Install ffmpeg using the terminal window inside PyCharm. ffmpeg is a low-level system library for reading a variety of audio file types. It is necessary for many audio processing applications. Click on the terminal window (see screenshot below) and type the following depending on if you are using Mac. To run a command in the terminal, hit enter.

# Mac Only
brew install ffmpeg

**If you receive a syntax error message, you may not have brew installed. You can download and install homebrew here. On Windows, you can install ffmpeg through Chocolatey using:

# Windows Only
choco install ffmpeg

**You may need to install Chocolatey first.

Note that it may take a few minutes to install ffmpeg. Confirm that it is installed successfully before going on to the next step.

STEP 7

Install OpenAI’s Whisper using the PyCharm Package Manager. The Python Packages button in the bottom left will open a new panel where you can search for python packages. Type in “openai-whisper,” click on the result, and then click install. If you’re given a list of version options, choose the highest version number.

PyCharm will show you progress bars along the bottom as it loads whisper.

Congratulations, you have whisper installed!

Using OpenAI’s Whisper for Transcription

Let’s download and try a speech sample from the Speech Accent Archive. You can move any audio files you’d like to in your PyCharm project folder, but this code will download an Amazigh speaker’s recording. Copy/paste them into your WhisperDoodle.py workfile and run. You should see amazigh1.mp3 in the file list on the left.

from urllib.request import urlretrieve
webfile = 'http://accent.gmu.edu/soundtracks/amazigh1.mp3'
savefileas = 'amazigh1.mp3'
urlretrieve(webfile, savefileas)

Now, you can process amazigh1.mp3 on Whisper using the base model. Copy/paste this code into your WhisperDoodle.py workfile and run.

import whisper
model = whisper.load_model("base")
result = model.transcribe("amazigh1.mp3")
print(result)

This step might take a few minutes, especially the first time when Whisper will download the model. You might receive some red warning messages as well, which we can address later.

The output contains a lot of information. If you scroll up, you can see the ‘text’ which contains the general transcription, ‘segments’ for each of the longer stretches of speech. Each segment contains information for the seek (the audio frames used of processing), start and end timestamps, the transcription, the tokens, and some model performance values).

Different transcription settings

You can experiment with different models and transcription settings. To adjust the settings given to the Whisper transcribe function, try specifying a language (the default is multilingual). Whisper model performance for 99 langauges is presented here, but you will need to use the corresponding language code from this list.

import whisper
model = whisper.load_model("base")

mp3_file="amazigh1.mp3"
transcription = ""
result = model.transcribe(mp3_file, language="en")
for segment in result['segments']:
transcription += segment['text'] + "\n"
print(transcription)

Using different models

You can increase transcription accuracy with a larger model. Review the model details. To change the model you want to use, simply change model = whisper.load_model("base") to model = whisper.load_model("turbo"). Keep in mind these models can take a lot of hard disk space, and they are tricky to track down and delete.

Saving Transcriptions

Saving the transcription to a text file might make the most sense. You can see the notes about loops below to imagine how this might be helpful for processing many files at once. In order to get all the text from Whisper, we need to loop through each segment, gather them up in one string, and write them to a file. The code below does exactly this.

import whisper
model = whisper.load_model("base")

mp3_file="amazigh1.mp3"
transcription = ""
result = model.transcribe(mp3_file)
for segment in result['segments']:
    transcription += segment['text'] + "\n"
with open("output.txt", "w") as f:
    f.write(transcription)

You can also save all of the output information for each segment. Notice that Whisper’s segments are quite long, which gives you an idea of how the model uses context to make predictions

import whisper
model = whisper.load_model("base")

mp3_file="amazigh1.mp3"
csv_file = "file,text,segment_id,seek,start,end,temperature,avg_logprob,compression_ratio,no_speech_prob\n"
result = model.transcribe(mp3_file)
for segment in result['segments']:
    csv_file += f"{mp3_file},\"{segment['text']}\",{segment['id']},{segment['seek']},{segment['start']},{segment['end']},{segment['temperature']},{segment['avg_logprob']},{segment['compression_ratio']},{segment['no_speech_prob']}\n"
with open("output.csv", "w") as f:
    f.write(csv_file)

Notes about loops

To loop through all audio files, you need to first load the model and then run the loop for better efficiency. A sample code set might look like:

import whisper
model = whisper.load_model("base")

filelist = ["amazigh1.mp3", "amazigh2.mp3"]
for file in filelist:
    result = model.transcribe(file)
    print(result)