-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit realtime transcription time #117
Comments
I can't introduce a fixed max duration because that could cut into words and result in wrong/unprecise transcriptions. Please set post_speech_silence_duration parameter to lower values like 0.1, 0.2 or 0.3 to make it detect sentences or sentence fragments faster. |
Realtime transcription is working pretty fast but i have issues with time for fullSentence recognition. python -c "import torch; print(torch.cuda.is_available())" return true for me, Cuda version is 12.4, maybe i'm missing something and needs to check something else? Thanks |
It would be helpful to see your AudioToTextRecorder class constructor parameters. You can change the model parameter to a smaller model like "medium" (maybe try a distil model from systran) and / or reduce the beam size to make it faster. You could also try to raise realtime_processing_pause parameter or try use_main_model_for_realtime to only work with a single model. |
'spinner': False,
|
large-v2 as realtime_model_type could be the problem, this is a big model for realtime transcriptions. With realtime_processing_pause = 0 the realtime model large-v2 nonstop transcribes and could consume too much GPU resources.
No, it would only use one single model then. Now it loads 2x large-v2 and does the processing in parallel, it's a difference. I recommend try above settings ('realtime_model_type': 'tiny.en', 'realtime_processing_pause': 0.5) without 'use_main_model_for_realtime': True first. |
i have tried that and looks like nothing changed. for example here is the sentence "Transcribe one of those work meetings that you missed, or copy-paste the link to a YouTube documentary you're curious about. See what comes out!" I'm receiving fullSentence event only after almost 3 seconds which is too much. Are you sure that it's possible to receive fullSentence transcription with large-v2 in a second? |
I am absolutely sure, I'm under 300 milliseconds here for a large-v2 transcription. I'd recommend testing the faster_whisper library directly to find the issue, I feel this might be beyond the scope of RealtimeSTT. |
Can you suggest on how to debug this please? I'm not a python developer and this application is running on GPU instances because my mac doesn't have NVIDIA so it's pretty tough for me to test it, locally it will be slow. I'm using browser client example Also large-v2 realtime transcription works ok, fullSentence is requiring 1-2 seconds after it to finalize the response P.S. Also can you share your system preferences? Maybe i'll try to set up in cloud the same one |
That realtime transcription with large-v2 works indicates that the general transcription time seems to be fast enough. Hard to tell where the problem is in this setup. Please change the server.py and add: import logging
logging.basicConfig(level=logging.DEBUG) as first lines. Also please add 'level' : logging.DEBUG, to the recorder config. This should gives you extended logging on the server for more insights. You might also want to hook into on_recording_start, on_recording_stop and on_transcription_start callbacks and log your own timestamps, so you can see where the time is spent. |
2024-09-25.20-25-08.movhere is example of how it's working and it's much slower than in your example. I have 2 GPU for 15GB each |
Ok that looks like says cuda.is_available(): True but still does not use GPU. |
Are you sure that it's not using a GPU? I think that for large-v2 CPU would be much slower, no? |
I would say if it used the GPU it should be much faster. Verify if GPU load goes up while transcripting and VRAM usage goes up while loading the model. It might be that you need to use rocm for pytorch install (--index-url https://download.pytorch.org/whl/rocm5.6 and +rocm instead of +cu121, but I have no mac, so I don't know for sure...) |
Here is debug console looks like some issue with ffmpeg but it's installed on machine. Can it affect on performance? Also nvidia-smi shows usage of gpu after launch server.py By the way i'm testing on Debian remotely |
2024-09-25.20-53-24.movHere is live usage. Volatile CPU-Util is going to 90-100% per transcription so is it normal? |
The ffmpeg messages are totally normal, this does not affect performance. Most probably Tesla T4 is not transcribing faster. Here is another user with this GPU who says: |
Try using a distil model as I already suggested. Just by changing "large-v2" to "distil-large-v2" you should be able to get a 2x faster transcription and if you can live with the transcription quality of "distil-medium.en" you might even get 4x or 6x speed. Maybe finetune beam size a bit (maybe if transcription quality of distil-medium.en is not good enough increasing beam size a bit can help while not adding too much latency) |
Thanks for advice, distil-large-v2 is not too accurate as large-v2 so i left it. I have changed mine T4 on NVIDIA V100 and it's faster on 1-1.5 seconds, now that sentence is produced in 1.5-1.9 seconds. Seems that to receive your results it needs to pay a lot for super powerful machine. If i will add additional GPU and pass in devices array like [0,1] will i receive better results or it does not depend? |
Thank you for providing the videos, that helped a lot getting an idea of what was going on. I was using a RTX 2080 for a long time. The video on the front page of this repo is showing the performance with this card. So you don't need to put in the money for a 4090 or so. But I can't tell which card works best for your money regarding faster whisper. Maybe you might want to try some other CTranslate2 models like distil-large-v3 or something else. Additional GPUs passed in like [0,1] will not speed up a single transcription. This is only used to parallelize multiple transcriptions. |
i'm stack with this and got frustrated, i've created an instances on vastai with RTX 3090 and RTX 4090 and the same result, 2-3 seconds for short sentences, wtf... |
C:\Dev\Audio\RealtimeSTT\RealtimeSTT\tests>python simple_test.py System: RTX 4090, Windows 11, Python 3.10.9, CUDA 12.1, cuDNN v8.9.7 |
i have no idea why it's like that. I've tried a lot of instances, cloud providers and always result is the same. Only difference is that you are running it locally while me in cloud by passing data via websockets. Also can you check what's happened after your last commit? |
Will look into that but not today anymore, it's late here. Just use "pip install RealtimeSTT==0.2.41" if the current version makes problems for now please. I recommend doing some tests with faster_whisper itself. I think you need to start to measure and find out how your transcription times really are with large-v2 instead. |
Tons of stuff under the hood basically. Can you post a full debug log with
as first lines and AudioToTextRecorder called with level=logging.DEBUG? Might see what goes wrong then. |
Tested the new version, I can start the server.py here and also see the "RealtimeSTT initialized" message: (venv) C:\Dev\Audio\RealtimeSTT\RealtimeSTT\example_browserclient>python server.py
Starting server, please wait...
Initializing RealtimeSTT...
RealtimeSTT initialized
Server started. Press Ctrl+C to stop the server.
Client connected
Sentence: Hey there, this is a little test. Please start the server with full log and paste it here, so hopefully we can see better what's going wrong then. |
the new one version on linux doesn't work as expected, used 0.2.41 as you suggested and it's working fine. Will provide a logs a little bit later |
2024-09-27.13-01-13.movhere are the logs with new version. Also don't see a warnings regarding ffmpeg anymore 2024-09-27.12-50-28.movabove is video of logs without websockets. So is it looks like faster whisper works slower then it should? |
Hm, which one is the video where it does not work? Because in both videos it looks like as if it works in the logging. Here a test for browserclient on my system for comparison (sorry my voice is very loud), so you can see how fast this should be in theory: Browserclient.Test.mp4
I don't think it's a faster whisper issue but this is so hard to tell without knowing everything about your hardware and environment, and even then it might be difficult. You need to measure the transcription time to be sure. Maybe the browser client sends huge chunks. Like records over a longer time and then sends the chunk, idk. I'm no web developer. |
first video. There is no RealtimeSTT Initialized in console and also no printing of sentences, check the end of video. The client receives the data but anyway the behavior of server is weird after new push. Regarding speed your transcription also is more than 1 second for short sentences. Mine is ~1.8 and on second video in logs you can see that faster whisper transcription is taking 1.5 seconds for 'Hello' or 'How are you' which is far from ideal unfortunately |
Server log indicates that here is everything okay. Maybe the stdout routing to get log messages transferred from the spawned processes back to the main process somehow fails on your system. Hard to test here, I only have a Windows system. But it feels like it works in general but don't print the results now anymore. |
I don't really see a reason why this would fail but maybe you can try uncommenting the stdout thread start in the AudioToTextRecorder to see if that helps:
|
Please also remove or uncomment this in _transcription_worker, I feel this could maybe cause the mess up of your main prints: |
I would agree with the original comment that processing all of it introduces delay as the session time increases. Note: The logs do not show the complete section. It actually started with 00:00.000. I cannot retrieve the complete section or rerun another test at this time.
After a minute or two of new audio, the previous audio is irrelevant and does not need to be processed again each time. Even though the transcription is quite fast on my machine, this would cause huge delay on long running sessions. Main model: medium Slightly related issue (might open new issue for this): If I use main model for realtime transcription (small, medium or even tiny), it does not transcribe. When not using main model for realtime, text received is always 'realtime', never 'fullSentence'. |
Currently if real time transcription continues for a long time then sentence appear to be a huge so it's processing takes a lot of time even with using Cuda. Is that possible to set something like maximum_audio_duration etc to process smaller chunks of text instead of it all?
The text was updated successfully, but these errors were encountered: