One speculation I heard about Whisper is that it was developed to generate text training data to train GPT-4. GPT-3 used Common Crawl, and there is no larger text dataset in existence. OpenAI is running out of text, and Whisper is the answer.
This explains some strange features of Whisper. There is a mode to translate to English. Input is 30 seconds chunks, and there is no way to stream. Both are strange for a speech recognition system, but make perfect sense for a text training data generator.
Strangest thing I’ve encountered so far is occasional “hallucinations” where it emits a whole phrase not corresponding to any input audio at all just because of the language context.
I have no experience in machine learning, but for a long time I have wanted to add subtitles to foreign videos from Youtube. Thus, I gave Whisper a try. But to me it feels horribly slow on CPU. It has been running on the default model (I think small) for about 10 minutes to detect less than 1 minute. Have to say I am a bit disappointed that you probably also need GPUs for inference nowadays.
Does somebody have experience with the speed on GPU?
It transcribed a 1:00:31 YouTube video[1] (with what looks to be >95% accuracy) in 29:18 on my 3080 using the medium model. Using the small model took 15:55 (looks like mostly punctuation difference from medium and a few things like “laced” being wrongly transcribed as “least”), and base was 7:44 (looks about the same transcription as small.)
One thing I found very interesting when I tested the “small” model with Croatian: I was feeding it a Croatian news show and it was producing words like “Harvatski”. To me this seems to show that the model did not really understand the Croatian language, otherwise it would have known that the word clearly must be “Hrvatski”. It seems to have a basic understanding of what Croatian words look like or something like that and produces words from that. At least that’s what I estimate.
Another example, it wrote “sasmo bolnici”. That indeed was the pronunciation in the audio, but of course it should be “sad smo (u) bolnici”.
On the other hand, I am wondering how it’s supposed to translate to English if it lacks understanding of the language.
Have now also tested the “large” model on one Croatian series episode. After about 18 hours it’s at 17 minutes, so about 1 hour per minute up to now.
The quality looks so much better than with the other models! Have only checked two minutes, I hope it stays like this. If it’s consistently at this quality, I can use it for my goal to learn vocabulary before watching Croatian series. Some mistakes, but good enough for me.
One speculation I heard about Whisper is that it was developed to generate text training data to train GPT-4. GPT-3 used Common Crawl, and there is no larger text dataset in existence. OpenAI is running out of text, and Whisper is the answer.
This explains some strange features of Whisper. There is a mode to translate to English. Input is 30 seconds chunks, and there is no way to stream. Both are strange for a speech recognition system, but make perfect sense for a text training data generator.
Do you want to say they want to collect more textual data from existing audio data? But is there really so much more audio/video data than text data?
Not the parent commenter, but yes, that’s what they appear to do.
The amount of non-English audio data certainly is bigger than English textual data.
Strangest thing I’ve encountered so far is occasional “hallucinations” where it emits a whole phrase not corresponding to any input audio at all just because of the language context.
The code is available on github
You can try it on playground. It’s insanely good. You can swap languages mid sentences and it will translate to English.
I have no experience in machine learning, but for a long time I have wanted to add subtitles to foreign videos from Youtube. Thus, I gave Whisper a try. But to me it feels horribly slow on CPU. It has been running on the default model (I think small) for about 10 minutes to detect less than 1 minute. Have to say I am a bit disappointed that you probably also need GPUs for inference nowadays.
Does somebody have experience with the speed on GPU?
It transcribed a 1:00:31 YouTube video[1] (with what looks to be >95% accuracy) in 29:18 on my 3080 using the medium model. Using the small model took 15:55 (looks like mostly punctuation difference from medium and a few things like “laced” being wrongly transcribed as “least”), and base was 7:44 (looks about the same transcription as small.)
[1] https://www.youtube.com/watch?v=ffRmE69pQhM
My 24 core CPU seemed close to 1x on small, probably a bit worse. So depends on your CPU a bit.
GPU in server that doesn’t cost all my money would be nice.
One thing I found very interesting when I tested the “small” model with Croatian: I was feeding it a Croatian news show and it was producing words like “Harvatski”. To me this seems to show that the model did not really understand the Croatian language, otherwise it would have known that the word clearly must be “Hrvatski”. It seems to have a basic understanding of what Croatian words look like or something like that and produces words from that. At least that’s what I estimate.
Another example, it wrote “sasmo bolnici”. That indeed was the pronunciation in the audio, but of course it should be “sad smo (u) bolnici”.
On the other hand, I am wondering how it’s supposed to translate to English if it lacks understanding of the language.
Have now also tested the “large” model on one Croatian series episode. After about 18 hours it’s at 17 minutes, so about 1 hour per minute up to now.
The quality looks so much better than with the other models! Have only checked two minutes, I hope it stays like this. If it’s consistently at this quality, I can use it for my goal to learn vocabulary before watching Croatian series. Some mistakes, but good enough for me.