Replies: 1 comment
-
It's not practical to limit Whisper's RAM usage - the model needs what it needs. I'd strongly advise against running Whisper on CPU as it's not designed for that. You need a GPU. For your use case, honestly, I wouldn't even recommend Whisper. You'll probably get much better results using Google Gemini 1.5 Flash with good prompting. It'll be cheaper and easier. I'd suggest prototyping with Zapier, connecting and trying different models via APIs like OpenAI Whisper API, Google Gemini API, or DeepGram API. Trying to run Whisper on a CPU is a bad idea - you won't get what you want. Whisper only uses audio, not video. You're much better off just using an API rather than trying to run it yourself, in my humble opinion. If you really want to run Whisper on your own machine, buy a Mac mini from Costco, implement Whisper.cpp, and run it from there. |
Beta Was this translation helpful? Give feedback.
-
Greetings,
I am new to Whisper and had a few questions. I am using it on a few small test files right now. I intend to build a web server (Ubuntu) where users can upload videos to the web, and then our server will use Whisper to create Closed Caption files for people with bad ears. Also, I do not have any GPU on this web server. It's only vCPU, so this might affect the efficiency/command parameters with Whisper.
Here is my current Whisper command:
whisper test.mp4 --model medium --language en --fp16 False --output_format vtt
Question 1:
I am running Whisper on a small 4 minute long MP4 video of 7MB in size. I first ran it on "--model tiny" and it takes about 30-60 minutes. (I interrupted the process after 20 minutes or so, took very long time and I noticed inaccuracies). Server was using 500MB of ram. So then I ran "--model large-v2" and then my server started using 5GB of RAM, which is way too much, so I had to quit the process. Then I ran it with "--model medium" and the server is up to 3.8GB RAM, which could be doable, but is taking an extremely long time - so far takes half an hour to get 2 lines of text transcribed...
Firstly, Is there a way to put a cap on RAM usage? To prevent my web server from crashing, especially if someone uploads a 1GB video in the future?
Question 2:
How much better would Whisper run if I use FFMPEG to grab the MP3 out of MP4 first, and then run Whisper on just the MP3 audio portion only? Or does it not matter if I use the MP4 file as the input?
Question 3:
How much more accurate is the "large-v2" model vs say the "medium" or "small" model? I don't need exact perfect accuracy. If "large-v2" is 99.99% accurate, would "medium" be like 99.90% accurate? or is it much lower like 95% accurate?
Question 4:
The website I'm using will allow users to post general videos of all sorts, like they would on facebook/twitter. Closed Captions aren't required to be 100% perfect, but they should be as accurate as those on YouTube. What model should I use to get that kind of accurate? I need a practical strategy that will work in the real world without crashing my server from too much RAM usage, and it shouldn't take hours to transcribe a single 10 minute video. I will be using Whisper on a cronjob, and iterate through new video uploads.
What parameter options/command would you use to cap Whisper's RAM usage to say 2GB (or 4GB), and get a 10 minute video transcribed in under 15 minutes with reasonable accuracy?
Beta Was this translation helpful? Give feedback.
All reactions