After reading James Somers' article "Whispers of A.I.’s Modular Future" about Georgi Gerganov's open source C++ implementation of OpenAI’s Whisper AI-model, I decided to try and run Whisper.cpp on my own machine.

What does it do?

Whisper.cpp transcribes audio input into text, and it works for lots of languages out of the box. Whereas in the 90ies you had to train your speech software to listen to your specific voice and speak precise and with clear pronunciation, Whisper.cpp is much more forgiving.

Recording your voice

I used the built-in memo app of my iphone, but really any audio app should be good enough. The default app has the advantage of being free and without ads, and whatever I say is not uploaded to some unknown third party server.

Cleaning up the recording

Whisper.cpp is known to suffer from occasional hallucinations when given random low level noise (so longer periods where nothing is said), and since processing longer files takes a lot of time, I wanted to remove silence and long pauses from the recording. This can be done with the free open source swiss-army-knife of media transformation, FFmpeg.

FFmpeg is also useful since it can convert Apples m4a files into proper wav files which is what Whisper.cpp is using as input. The following command removes silent periods of more than 2s:

ffmpeg -i "my-recording.m4a" -af "silenceremove=start_periods=1:stop_periods=-1:start_threshold=-30dB:stop_threshold=-30dB:start_silence=2:stop_silence=2" "my-recording.wav"

Source: Stackoverflow; depending on the noise levels you may have to experiment a little with the threshold values.

Installation

"Just" downloading and installing Whisper.cpp is easy - at least, on a Linux system.

git checkout https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
./models/download-ggml-model.sh large
make

is all, now a "simple" command like

./main -pc -l de -m models/ggml-large.bin -f my-speech.wav -otxt -of transcription.txt

will transform a German (using -l de for language selection) .wav file into txt.

jadewald

The programm marks the parts it finds difficult / problematic from red (bad quality) to green (good quality). In line 7 it did not recognize "Stein" and wrote "klein".

Resulting text (abbreviated)

 Weitere Ideen für den Jadewald. Eine Art Efeu, der sich um einen Baum herumrangt und die
 Mana-Energien aus diesem Baum konzentriert in seinen Blättern, so dass man die vielleicht
 wie einen Bonbon lutschen kann oder so, um über die Zeit etwas Gesundheit wiederzugewinnen.
 Das nächste wäre natürlich eine Schlange, die im Unterholz lebt, deren Biss zeitweilig
 versteinert wird. Das Ganze als richtige Bedrohung wäre ein Baum, der etwas Böseres hat, ein
 Todesaura oder ein Gesicht. Wer das Gesicht sieht, der muss sich schützen, dass er zu
 klein wird und dass der Baum ihm dann seine Lebenskraft mit den Wurzeln aussaugt. Danach
 verkrümelt der Stein, so dass die Abenteurer, wenn sie sich dem Baum nähern sollten, erst
 einmal einige komplett versteinerte Wesen sehen, die nur noch Spuren von Lebenskieren
 enthalten und auch eine ganze Menge Geröll um den Baum herum. ...

This was recorded on a windy day out in the fields while taking a walk.

The cleaned up result can be found here

Thoughts

  • Quality is fair, but not perfect.

  • I suck at unprepared creative speaking - but it looks like a skill I can train. Exploring ideas by talking into the wind is very different from talking to a person.

  • Now I need an AI to remove my empty words and phrases (I seem to use "natürlich"(=naturally) too often).

  • The bluetooth mic is about as good as the built-in microphone, but using an external in-ear device keeps my hands free.

  • It’s a good tool to record spontaneous thoughts on a walk.

  • I have to experiment with recording speech without the cold wind blowing away my words :)