-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#1823 whisper transcription #2165
Conversation
I think this is finished. @marcus6n, I would appreciate very much if you could test this on Monday, thank you. |
@lfcnassif Yes, I can test it! |
@lfcnassif I've run the tests and everything is working properly. |
I was waiting for this PR. Thank you. I will test this PR with GPU CUDA. @lfcnassif , a suggestion. Another thing, does this PR also close issue #1335? |
Hi @gfd2020! Additional tests will be very welcome!
I took the final score computation from your previous code suggestion, thank you! Good to know, we can replace the function, but I think the time difference will not be noticeable.
No, I'll keep it open, since I didn't finish all my planned tests. I'm integrating this because some users asked for it. Beyond Whisper.cpp which improved a lot in the last months and added full CUDA support, I also found WhisperX (which uses Faster-Whisper under the hood) and Insanely-Faster-Whisper. Those last 2 libs break long audios into 30s parts and executes batch inference on the audio segments simultaneously, resulting in up to 10x speed up because of batching, at the cost of increased GPU memory usage. I did a quick test with them and they are really really fast for long audios indeed! But their approach can decrease the final accuracy, since default Whisper algorithm uses previous transcribed tokens to help transcribing the next ones. AFAIK, those libraries break the audio in parts and the transcription is done independently on the 30s audio segments. As I didn't measure WER for those libraries yet, I'm concerned about integrating them. If they could accept many different audios as input and transcribe them using batch inference instead of breaking the audios, that would be a safer approach. But that would require more work from our side, to group audios with similar duration before transcription, decide waiting or not to group audios, signal last audio, etc. |
Using float16 precision instead of int8 gave almost a 50% speed up on RTX 3090. |
On CPU too? |
Possibly not, I'll check and report back. |
@gfd2020 thanks for asking about the effect of float16 on CPU. Actually it doesn't work on CPU at all, just pushed commit fixing it. About float32 x int8 speed on CPU, testing with ~160 audios on 48 threads CPU, medium Whisper model:
|
Speed numbers of other implementations over a single 442s audio using 1 RTX 3090, medium model, float16 precision (except whisper.cpp, it couldn't be defined):
Running over the 160 real world small audios dataset above (total duration of 2758s):
PS: Whisper.cpp seems to parallelize better than others using multiple processes, so its last number could be improved. |
Hi, @lfcnassif I don't have a very powerful GPU but it has a tensor cores and the following error occurred: So I changed it to float32 and it gave the following error: finally, change to int8 and worked fine on GPU. So, I have two suggestions:
I'm still doing other tests |
Thanks for testing @gfd2020! Both are good suggestions and I was already planning to externalize the compute_type (precision) parameter, and also the batch_size if we switch to WhisperX, I'm running accuracy tests and should post the results soon. About the float16 not supported, what is your CUDA Toolkit installed version? |
NVIDIA CUDA 11.7.99 driver on Quadro P620 This was the only version that I managed to make work on these weaker GPUs (Quadro P620 and T400) |
Couldn't you put ffmpeg.exe in the iped tools? Is the problem putting it in the path?
Ok, just to let you know about them. |
It's possible, but on #1267 @wladimirleite did a good job to remove ffmpeg as dependency, since we already use mplayer for video related stuff...
Thanks! |
My fault, I tested again into the VM and WhisperX returns error without FFmpeg. I just added an explicit check and better error message to the user if it is not found. |
Is there no way to modify the Python code to search for ffmpeg in a relative path within iped? |
We can set the PATH env var of the main IPED process from the start up process and point to an embedded ffmpeg. But I'm not sure if we should embed ffmpeg and actually I'm thinking about offering both faster-whipser and whisperx as suggested by @rafael844 because faster-whisper doesn't have ffmpeg dependency and whisperx has many dependencies that may cause conflicts with other modules (now or in the future). |
Can I do a small step by step guide to install the requirements on the GPU? I had to make some modifications to the code to be able to use it in an environment without an internet connection and point to the local model. So the modelName parameter accepts the model name, relative path ( iped folder) and absolute path. Examples:
|
If it is independent of user environment or hardware, for sure! The wiki is publicly editable. Maybe above code won't work if IPED is executed outside from its folder. For that, we use |
Without code above, does it need to be always connected to the Internet or just in the first run to download models? |
Windows only, any graphics card.
Thanks. I'll try.
Just the first run. But my idea is to create a customized IPED package with the models. This way, you would just install this package without internet access. |
That would be totally enough, thank you @gfd2020 for trying to improve the manual! |
@gfd2020, out of curiosity, have you played with the |
Not yet. thanks for reminding me |
hi @lfcnassif , I did some tests with the batchSize values. Regarding the speedup, I didn't notice a big difference but I have to test it with a larger case with several audios. Then I do these tests. Offboard Card - NVIDIA Quadro P620 - 2GB VRAM
|
Just try the code and does not work. "No module named 'java'" from java.lang import System @lfcnassif , Something I didn't do right? |
It should make a difference just with audios longer than 30s, the longer the better.
Sorry, my mistake, that works just into python tasks, the current python code runs as a separate independent python process, it won't see java classes or objects. |
About the wiki part below: cd IPED_ROOT/python I did it a little differently, so I didn't need to set the path or interfere with another installed python: Go to stand alone Iped python folder and install packages ( example ): @lfcnassif , what do you think? |
It's better! I also thought to change it in the past exactly to avoid mixing with an env installed python, those warnings never brought issues to me too. |
@wladimirleite, what do you think about embedding ffmpeg? In the long run, we should stay with WhisperX, since we should be able to paralellize small audios transcription on the GPU with an improved version of it. |
I think it is perfectly fine! |
We were using it to break wav audios on 60s boundaries, it was not possible with mplayer, but you came up with a 100% java solution for that usage. |
You are right, I completely forgot about that :-) |
Just pushed changes to support both whisperx and faster_whisper as @rafael844 suggested. Most users won't benefit from whisperx since it needs a GPU with good VRAM to speed up transcribing long audios. For CPU users, faster_whisper is enough, it doesn't need FFmpeg and it is much smaller. Thanks @gfd2020 and @marcus6n for testing this! If you find any issues with my last commits, please let me know. |
When finished this will close #1823.
Already tested on CPU. I still need to test on GPU, test the remote service and verify Wav2Vec2 backwards compatibility.