‹ back home

Transcribing audio with whisper.cpp

2024-09-22 #experiment #how-to #tts

I wanted a quick setup where I can speak to my laptop, and it will record the audio and then give me a text transcript of what I have said.

My intention is to produce long text by speaking it out loud and then doing some minor refinement by changing potential typos.

Preparation

The first thing that was on the whisper.cpp repository.

git clone https://github.com/ggerganov/whisper.cpp/

Then I built the main example which just takes a wave file as input and produces text as output.

make main

Next, I had to download the models for recognizing English speech.

The script did not work at first. I had to patch it to use curl instead of wget, because it uses few specific wget flags which don’t work on busybox.

--- a/models/download-ggml-model.sh
+++ b/models/download-ggml-model.sh
@@ -101,11 +101,7 @@ if [ -f "ggml-$model.bin" ]; then
     exit 0
 fi

-if [ -x "$(command -v wget2)" ]; then
-    wget2 --no-config --progress bar -O ggml-"$model".bin $src/$pfx-"$model".bin
-elif [ -x "$(command -v wget)" ]; then
-    wget --no-config --quiet --show-progress -O ggml-"$model".bin $src/$pfx-"$model".bin
-elif [ -x "$(command -v curl)" ]; then
+if [ -x "$(command -v curl)" ]; then
     curl -L --output ggml-"$model".bin $src/$pfx-"$model".bin
 else
     printf "Either wget or curl is required to download models.\n"

I then downloaded the model files using the patched script.

sh ./models/download-ggml-model.sh base.en

Usage

Next, I needed to record audio snippets with me speaking the actual text. For this, I used the arecord utility. My first example was rejected because the audio needs to be recorded at 16 kilohertz. I quickly found the flag to change the rate at which the audio is recorded.

I used the following command to record showed audio snippets, usually one paragraph at a time. I would terminate recording by pressing Ctrl+C.

arecord --format=cd -r 16000 file.wav

Finally, I used the following command to transcribe the audio files into text.

./main -m ./models/ggml-base.en.bin -f file.wav

This would usually take about perhaps a second or two. I then copied the text into this file and simply remove the timestamps because they are not relevant.

Results

All prose text in this article was dictated by me to my laptop and converted by whisper with steps mentioned above. Only the URLs and code samples were copy-pasted from elsewhere.

I only had to fix a few minor typos as described below.

I also had to apply some editorial fixes which were more related to my own grammar than the transcription process itself.

I was quite impressed by the fact that wget and curl were actually spelled correctly.

I found the quality of the transcriptions to be quite reliable and convenient to use. I will experiment further with using them to take notes. At this stage there is minor annoyance of having two separate steps for recording and converting audio. This is simply because I am using an example binary for my test.

I expect that I should be able to write a small program which uses whisper and simply pipes input from my microphone straight to whisper so I can have a continuous stream of text which I would then copy paste perhaps into articles and maybe refine small typos and mistakes.

Future ideas

I would also like to set up an automated messaging account such that I can send an instant message with an audio recording to it, and it will relay the transcript to my email. This would allow me to take spoken notes on my phone, which I can later process as text on my laptop.

Have comments or want to discuss this topic?
Send an email to ~whynothugo/public-inbox@lists.sr.ht (mailing list etiquette)

— § —