-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Whisper support #5964
base: main
Are you sure you want to change the base?
Whisper support #5964
Conversation
Thank you for the PR! @huseinzol05 Currently our infrastructure support for encode-decoder is still WIP (@robertgshaw2-neuralmagic should be able to provide more context here), so I think it's probably a good idea to work on whisper until the underlying infra is ready. |
PRs for infrastructure are about to land
We are starting to build models on top of this (starting with BART to keep it simple) Could you build this PR on top of #4942 cc @afeldman-nm |
Seconding what @robertgshaw2-neuralmagic said - #4942 provides the support for encoder attention & cross-attention KV cache which Whisper will need. I am planning to have BART working by EOD today or thereabouts, which can serve as an example of implementing an encoder/decoder model. Hoping to have all tests passing soon. Hopefully you can try building your implementation of Whisper on top of #4942 , it would be great to know if you run into any issues. At the level of kernel invocation, The following new Specifically, There are also additional changes to support scheduling & adding requests for encoder/decoder models; you can see an example of invoking BART below: [WIP] BART model invocation example Some additional example code (links are to files in #4942 ): [WIP] BART model implementation: [WIP] BART e2e test: compare output logits against HuggingFace implementation |
Sure! I would look at that branch |
@afeldman-nm , let me solve the trashed outputs first, after that I will upstream to #4942 |
Solved trashed outputs and added cuda graph model |
Streaming SRT format, Screen.Recording.2024-07-02.at.4.04.26.PM.movStreaming JSON format, Screen.Recording.2024-07-02.at.4.05.09.PM.mov |
@robertgshaw2-neuralmagic we posted a blog about this, https://mesolitica.com/blog/vllm-whisper |
Hi @huseinzol05 this is great, I gave your blog a look. FYI: #4888 took a little longer than expected to land but it has been landed, enabling the xFormers backend to support encoder attention, decoder self-attention, and decoder cross-attention. #4837 and #4888 (both of which have been landed) were prerequisites for #4942 , which completes end-to-end support for encoder/decoder models with the xFormers backend & also introduces the BART model into vLLM. #4942 is still WIP but hoping to complete it soon. |
Nice! anything you guys need help that i can help? |
Thanks for your efforts Husein, will this implementation support continuous batching? |
hi @huseinzol05, i use your report at https://github.com/mesolitica/vllm-whisper/ to host openai/whisper-large-v3 on my own machine using A100 80G but it not successful The error python3.10/site-packages/torch/nn/modules/conv.py", line 306, in _conv_forward can you have any update to fix this or can you province your python library Thank you. |
Yes |
Below is my step to run, pip3.10 install git+https://github.com/mesolitica/vllm-whisper
python3.10 -m vllm.entrypoints.openai.api_server --model openai/whisper-large-v3 --dtype bfloat16 --whisper-input-type input_features --max-model-len 448 --max-size-mb-whisper 100
wget https://github.com/mesolitica/malaya-speech/raw/master/speech/7021-79759-0004.wav
curl -X 'POST' 'http://localhost:8000/audio/transcriptions' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F '[email protected];type=audio/mpeg' \
-F 'model=whisper' \
-F 'response_format=json' \
-F 'stream=true' output,
my pip freeze,
|
@huseinzol05 I created a new Conda environment and was able to load and infer using I tried to use whisper_example.py from examples folder:
|
@huseinzol05 your latest commit brought it into life, working like a charm. I don't know if, taking vLLM into account, its technically possible to have token/word level timestamps? |
Hi @huseinzol05 Thanks for the whisper support and the example. Can you make Long audios works smoothly when running the application. |
Long audio has all different kind of strategies. It's not really a integrated part of the "model" part of Whisper. It's also rather complicated and I am not sure how well it fits with vLLMs other abstractions. I think Whisper for longform has enough exceptions to LLMs that it might be better to implement is as decoupled as possible for the time being, plus Whisper is somewhat overdue and probably replaced in the semi-short term. I don't think that many of Whispers particularities will be very relevant in the future. @afeldman-nm sorry to tag you, but I think this is something for the vLLM team to consider. Not sure what is currently implemented, but option 2 below plus always use "<|notimestamps|>" as a forced token is probably vLLMs best fit. I think there are basically 4 options for long form chunking:
In my (subjective) experience, although I might dinged up some implementations/evaluations:
|
Thanks @MarktHart , I am will incorporate this into the encoder/decoder RFC I am working on. "Basic" encoder/decoder model support should land soon; the RFC covers the significant follow-on work involved in maturing encoder/decoder support to a degree that is commensurate with decoder support (i.e. adding more encoder/decoder models like Whisper, feature compatibility with encoder/decoder, etc.) I have not studied the audio-length problem which you are discussing in-depth just yet, my guess is it will impact three key parts of the vLLM encoder/decoder model inference process:
Thoughts @MarktHart @huseinzol05 ? |
@MarktHart if you use FastAPI entrypoints, it can process long audio, https://mesolitica.com/blog/vllm-whisper#Process-any-length-of-audio-using-Torchaudio Basically it's a naive chunking or non-overlap sliding window, it use TorchAudio to stream audio segment for 1s, enough 30s, it will pass to whisper to decode, you can check the implementation at https://github.com/mesolitica/vllm-whisper/blob/main/vllm/entrypoints/openai/serving_whisper.py
|
Thanks @huseinzol05 I have tried naive chunking, it has good speed but caused a big increase in WER for long audios. Is it possible to implement a VAD-based batching? It requires an additional (VAD) model but works best with it as the model natively supports batching. |
When I got free time, I will try to add overlap sliding window like HuggingFace implementation or VAD |
Whisper tiny.en/small.en/medium.en aren't transcribing well (quality is low) Whisper tiny/small/medium are doing good. Can someone please explain this ? |
Can you please have lora support for whisper? |
If you check the source code, first we use to predict lang token, probably |
Hello @huseinzol05 I successfully started your fork at my A100 80gb gpu. Are you sure about continuous batching? |
@Temirulan messed in term? totally gibberish? |
Hey, just checking in! Do you have any updates on the status of this pull request? Curious when it might be ready to merge. 😊 |
Please share the inference code how you do the concurrency |
Any update on this? Seems promising. |
I hope this gets added soon... |
@huseinzol05 I guess a lot of work can be delayed or removed completely by limiting the input length to 30s, and let users decide how they want to chunk files longer than that, or at least leave it for another PR after the main engine itself is tested by users. |
Sorry, I do not have capacity to develop this for now, feel free to continue this fork to add VAD or overlapping, should be quick |
Initial support for Whisper, able to load and infer but the outputs are trashed, example script, https://github.com/mesolitica/vllm-whisper/blob/main/examples/whisper_example.py, might be bugs related to weights or attention, few hiccups,
xops.memory_efficient_attention_forward
like T5 branch, this is not ideal because vLLM got their attention backend?FIX #180