-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluate Whisper transcription algorithm #1335
Comments
Preliminary result of the largest Whisper model on TEDx pt-BR dataset resulted in 20,6% WER. Numbers for other models here: https://github.com/sepinf-inc/IPED/wiki/User-Manual#wav2vec2 The largest whisper model is more than 1 order of magnitude slower than wav2vec2 w/ 1B params on RTX 3090, so it is not usable in practice. Maybe one of the smaller whisper models could have reasonable accuracy and speed. |
I tried to transcribe ~10h of audios using the largest whisper model on RTX 3090, the estimated time to finish was 4 days, so I aborted the test, it is not feasible to use in practice. Current wav2vec2 algorithm with 1B params took about 22min to transcribe ~29h of audios using 3 RTX 3090 (in 2 nodes), so the largest whisper model is more than 2 orders of magnitude slower than what we have today. I'll try their smallest model (36x faster) to see how the accuracy is on the test datasets. |
Hi, is there a way we can test whisper with IPED? Is there a snapshot with it so we could use? |
I think I didn't push the POC implementation, the 250x time cost comparing to wav2vec2 made me very acceptic to use whisper in production. I didn't test their smaller model yet, but maybe the accuracy will drop a lot. If you really would like to try, it is easy to change script below with whisper example code in their github main page: |
Their smaller model should still be 7x slower than wav2vec2 according to my tests and their published relative model costs. |
Thanks @lfcnassif , I don't know how to program very well but I'll see if a colleague can help me. This was a request from my superiors. |
Hi @rafael844, I just found the multilanguage (crazy to me!) whisper models on huggingface: https://huggingface.co/openai/whisper-large-v2 So maybe you just need to set Jonatas Grosman also fine tunned that multilanguage model to portuguese (https://huggingface.co/jonatasgrosman/whisper-large-pt-cv11), although it is not a need, so you can also try But I warn you, my past tests resulted in 250x slowdown comparing to wav2vec2. That large whisper model accuracy seems to be better and also have punctuation and caps, but I don't think the 250x cost is worth to pay on scale. You may try smaller whisper models, but accuracy should drop down: https://huggingface.co/openai |
Just tested, it doesn't work out of the box, needs code changes. |
Thank you @lfcnassif . Ill take a look. But with my lack of programing skills and with those results, we Will keep with wav2vec2 and the already models. It would be nice have ponctuation and caps, but as you said, 250x is not worth. Wav2vec2 do a good job, even with our cheap and weak gpu, we can spread the job in multiple machines, wich is great, and the results are good so far. |
You are welcome! |
Price is 1/3 comparing to Microsoft/Google. |
Try whisper.cpp. |
Thanks for pointing. Unfortunately they don't support GPU and transcribing a 4s audio on a 48 threads CPU took 32s using the medium size model in a first run/test (the large model should be 2x slower). Strangely the second run took 73s and a third run took 132s... |
Strange. On my Ubuntu linux, in a docker container the compiled whisper.cpp ./main runs the large model ~2.9 GB on 4 CPU cores at about 4 time the recorded time. Create and use image for running with docker.
|
Another optimized implementation to be tested, they say it is 4x times faster than the original OpenAI model on the GPU: |
The project claims to transcribe 13min audio in ~1min using Tesla V100S (an older GPU than ours), that's just ~3x slower than the 1B parameters wav2vec2 model used by us on RTX 3090. Given the 4.5x speed up reached by them, that is incompatible to my past tests that have shown a 250x slowdown when switching from 1B wav2vec2 to whisper large model, I'll try to run again the performance tests... |
Another promising one: By processing audios in batches + TPUs it can give up to 70x speed up. |
Hi @nassif We’ve been trying to use wav2vec2 to transcribe our audios but the results we were getting was a bit disappointing as often the transcription was barely readable, especially if compared to Azure (which we don’t have a contract with). For that reason, we looked for other options and found OpenAi’s Whisper project. Although slower than wav2vec the results were A LOT better, comparable to Azure’s transcription. For our tests I tried Whisper and Faster-Whisper implementations (and probably will try Whisper-JAX latter, although we don’t have a graphic card with TPU). The tests were done on a HP Z4 with a Xeon W-2175, 64 GB of RAM and a QUADRO P4000. Wav2Vec2: 3,7 s Whisper: 23,02 s Faster-Whipser: 8,59 s Azure: It would be nice to have Whisper as an option to use with IPED as it's free, runs locally (no need to send data to the cloud), has punctuation (which makes reading considerably better) and the results are comparable to Azure’s service. |
What model have you used? Have you used
Have you measured WER on your data set? How many audios do you have, what is the total duration? If you can help to compare whisper models properly to wav2vec2 models, I can send you the public data sets used in this study:
On what data set?
So you have a GPU without TPUs, right?
Well, I think it is not enough to represent the variability we can find in seized data sets... Anyway, have you computed WER on this 42 seconds audio so we can also have an objective measure instead of just feelings (which is also important).
I understand this is an advantage not counted by traditional WER... |
We tried both large and small models (from jonatasgrosman and edresson). They had similiar results but the large one took a lot longer to transcript.
We didn't measure the WER. That value was informed by the model owner. Our tests were made using 6000 audios from an exam that roughly adds to 1000 minutes (or 16,9 hours). Based on every test we made with wav2vec (on several exams), we concluded that it was simply better not to send the transcriptions, as many of the times they were unreadable. And as a notice, I probably didn’t implement whisper and faster-whisper in the best way possible, meaning that there is probably room for improvement in speed. About the dataset, I could try testing it. I don't know how these datasets are made and if they include the "kind" of audio we normally need to transcribe. Let’s say it’s a multitude of forms of Portuguese.
According to the author: common_voice_11_0 dataset
Yes, no TPU here.
As I said before, unfortunately just feelings. But the general feeling here is that it's way better :-) |
This feature would be very interesting for those who do not have a contract for audio transcription with third parties (which I believe is the majority of Brazilian states). |
Hi @DHoelz and @leosol, thanks for your contributions to the project.
Well, looking at the numbers of the tests I referenced, I think 25% less errors are a reasonable difference. Of course this can change depending on the data set...
It looks better for this audio, but without the gold standard, I can't come to any scientific conclusion about which model is better. I also refer to Whisper, Faster-Whisper, Whisper-JAX, which is better? Please also notice there is an open enhancement for wav2vec2 #1312 to avoid wrong words (out of vocabulary).
Well, our users are quite satisfied, of course if we can provide better results in an acceptable response time, that's good, that's the goal of this ticket. How have you tested Wav2vec2, using IPED or externally in a standalone application?
Common Voice cuts are usually easy data sets, CORAA is a much more difficult portuguese one, it would be interesting to evalute the author's model on CORAA.
We also don't have a commercial contract here, that's why I integrated Vosk and Wav2Vec2 later. In summary, this ticket is to evaluate Whisper models using an objective metric on the same data sets we evaluated the other models. We can use a more difficult real world data set, running all models again, if you are able to share the audios and their correct transcriptions validated by humans. If we come to a fundamented conclusion it is better on different data sets without a huge performance penalty (maybe a 2x-3x would be acceptable), I'll add the implementation, when I have available time... Of course contributions to implement it into IPED are very welcome, please send a PR, I'll be happy to test and review. |
Thanks I was aware about the first reference, not about the second. But I didn't finish, I will try to normalize numbers and try to run wav2vec2 with a language model. |
Hi @lfcnassif , I did some tests with fast-whisper. With that test script of yours and replacing the contents of the 'Wav2Vec2Process.py' file. I got better transcriptions than wav2vec, but the performance is worse, 2x slower. The strange thing is that when configuring the OMP_NUM_THREADS thread parameter with half of the total logical cores, I got better performance. Both locally and on the IPED transcription server. I also managed to get the 'finalscore' on fast-whisper. Then check if it is correct. Below, the result of a small test I did running IPED Server ( Only in CPU mode ). Machine: 2 sockets - 24 logical cores ( 2 python process for transcription ) OMP_NUM_THREADS = num of threads 10 audios 530 seconds - 12 threads ( total cpu usage 100%) Perhaps the best configuration of "OMP_NUM_THREADS" is: import psutil Now a question. Would it be possible to make the wav2vec2 remote server more generic to also accept fast-whisper (Through configuration parameter)? I was also able to get fast-whisper to work offline. Modified script to compute the finalscore. `
` |
Thank you @gfd2020!
Did you measure WER or used other evaluation metric?
Thank you very much, that is very important!
Sure. That is the goal, the final integration will use a configuration approach. |
Unfortunately I did not measure WER. It was just a manual checking of the texts obtained. |
Seems whisper.cpp improved a lot since last time I tested. Now they have NVIDIA GPU support: https://github.com/ggerganov/whisper.cpp#nvidia-gpu-support It may be worth another try, what do you think @fsicoli? |
Tested the speed some minutes ago: for a 434s audio, the medium model took 35s and the large-v3 model took 65s to transcribe using 1 RTX 3090. Seems a bit faster than faster-whisper on that GPU. |
Is there some snapshot for testing? Or script we could put in iped as the above. |
No, I just did a preliminary test of whisper.cpp directly on a single audio from command line without IPED. |
I changed the parameter from beam_size=5 to beam_size=1 and the performance improved by 35% and the quality was more or less the same. |
If it is integrated into iped, would it be via java JNA and the DLL? |
You mean this? https://github.com/ggerganov/whisper.cpp/blob/master/bindings/java/README.md Possibly. Since native code directly linked may cause application crashes (like I experimented with faster-whisper), there are other options too, like whisper server: Or a custom server process code without the http overhead. |
I also fiddled around with several Whisper solutions and ended with a simple client-server solution. On the one hand there ist an IPED python task which pushes all audio and video files for further processing to a network share. On the other hand there is a separate background process which wathces those shares, transcribes and translates the media files and writes back a JSON file with the results. These JSON files are finally parsed by the IPED task and merged into the metadata of the files. This gives you three advantages:
Here are the repositories for the task and the background process:
Maybe you find the solution useful. Greetings, Ronny |
Thanks @hilderonny for sharing your solution! Which Whisper implementation are you using? Standard whisper, faster-whisper, whisper.cpp, whisper-jax? |
I am using faster-whisper because this implementation is also able to separate speakers by splitting up the transcription into parts and is a lot faster in processing the media files. |
I'm evaluating other 3 Whisper implementations: Whisper.cpp, Insanely-Fast-Whisper and WhisperX. The last 2 are much much faster for long audios, since they break them into 30s pieces and execute batch inference on many audio segments at the same time, at the cost of higher GPU VRAM usage. Quoting #2165 (comment): Speed numbers over a single 442s audio using 1 RTX 3090, medium model, float16 precision (except whisper.cpp since it can't be set):
Running over the 151 small real world audios dataset with total duration of 2758s:
PS: Whisper.cpp seems to parallelize better than others using multiple processes, so its last number could be improved. |
Updating stats with Whisper.cpp (medium, largeV2 & largeV3 models), Faster-Whisper + LargeV3, running time of Whisper models and number of empty transcriptions from all 22,246 audios: Comments:
To finish this evaluation, 2 tasks are needed:
|
Transcribing the same 442s audio, medium model, int8 precision (except whisper.cpp since it can't be set), but on a 24 threads CPU:
|
Just found this study from March 2024: https://amgadhasan.substack.com/p/sota-asr-tooling-long-form-transcription It also supports our current choice for WhisperX is a good one (I'm just not very happy with WhisperX dependencies size...) |
Cant it be put as an option? So who decides to use WhisperX instead Faster-Whisper has to download dependencies pack, as pytorch and others. |
It could, but I don't plan to do, package size is the less important aspect to us in my opinion (otherwise we should use Whisper.cpp, it is very small), WhisperX has similar accuracy, is generally faster and much faster with long audios, that's more important from my point of view. And keeping 2 different implementations increases the maintenance effort. |
Hi @marcus6n. How is the real world audio-transcription data set curation going? It's just 1h of audios, do you think you can finish it today or on Thursday? |
Started the evaluation on the 1h real world non public audio data set yesterday, thanks @marcus6n for double checking the transcriptions and @wladimirleite for sending half of them! Preliminary results below (averages still not updated): |
For those interested, Whisper recently published a new PS: This converted model can be used with faster-whisper: https://huggingface.co/deepdml/faster-whisper-large-v3-turbo-ct2 |
Recently made public:
https://openai.com/blog/whisper/
https://github.com/openai/whisper
Interesting, they have some multilingual models that can be used for multiple languages without fine tuning for each language. They claim their models generalize better than models that need fine tuning, like wav2vec. Some numbers on Fleurs dataset (e.g. 4.8% WER on Portuguese subset):
https://github.com/openai/whisper#available-models-and-languages
The text was updated successfully, but these errors were encountered: