Whisper support #5964

huseinzol05 · 2024-06-28T14:48:11Z

Initial support for Whisper, able to load and infer but the outputs are trashed, example script, https://github.com/mesolitica/vllm-whisper/blob/main/examples/whisper_example.py, might be bugs related to weights or attention, few hiccups,

still try to figure out kv cache for Encoder hidden state or else each steps will recompute Encoder hidden state.
No non causal attention for Encoder and Cross Attention in Decoder, seems like all attention implementation in VLLM is for causal, so I just use xops.memory_efficient_attention_forward like T5 branch, this is not ideal because vLLM got their attention backend?
Reuse KV Cache Cross Attention from the first step for the next steps.

FIX #180

ywang96 · 2024-06-28T16:36:10Z

Thank you for the PR! @huseinzol05

Currently our infrastructure support for encode-decoder is still WIP (@robertgshaw2-neuralmagic should be able to provide more context here), so I think it's probably a good idea to work on whisper until the underlying infra is ready.

robertgshaw2-neuralmagic · 2024-06-28T17:08:04Z

PRs for infrastructure are about to land

[Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) #4888
[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) #4942

We are starting to build models on top of this (starting with BART to keep it simple)

Could you build this PR on top of #4942

cc @afeldman-nm

afeldman-nm · 2024-06-28T17:34:18Z

PRs for infrastructure are about to land

[Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) #4888

[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) #4942

We are starting to build models on top of this (starting with BART to keep it simple)

Could you build this PR on top of #4942

cc @afeldman-nm

Seconding what @robertgshaw2-neuralmagic said - #4942 provides the support for encoder attention & cross-attention KV cache which Whisper will need.

I am planning to have BART working by EOD today or thereabouts, which can serve as an example of implementing an encoder/decoder model. Hoping to have all tests passing soon.

Hopefully you can try building your implementation of Whisper on top of #4942 , it would be great to know if you run into any issues.

At the level of kernel invocation, Attention.forward() now has an attn_type argument which consumes one of three possible AttentionType enum values: ENCODER (encoder attention), DECODER (decoder self-attention), ENCODER_DECODER (encoder/decoder cross-attention):

https://github.com/neuralmagic/nm-vllm/blob/a5c28fca8f5e21653c6e5874719467e08d3d8503/tests/kernels/test_encoder_decoder_attn.py#L697-L702

The following new attn_metadata fields enable the attn_type=ENCODER and attn_type=ENCODER_DECODER scenarios:

https://github.com/neuralmagic/nm-vllm/blob/a5c28fca8f5e21653c6e5874719467e08d3d8503/tests/kernels/utils.py#L862-L869

Specifically, cross_block_tables and cross_slot_mapping holds the block tables and slot mappings for the cross-attention KV cache.

There are also additional changes to support scheduling & adding requests for encoder/decoder models; you can see an example of invoking BART below:

[WIP] BART model invocation example
https://github.com/neuralmagic/nm-vllm/blob/a5c28fca8f5e21653c6e5874719467e08d3d8503/examples/offline_inference_encoder_decoder.py

Some additional example code (links are to files in #4942 ):

[WIP] BART model implementation:
https://github.com/neuralmagic/nm-vllm/blob/a5c28fca8f5e21653c6e5874719467e08d3d8503/vllm/model_executor/models/bart.py

[WIP] BART e2e test: compare output logits against HuggingFace implementation
https://github.com/neuralmagic/nm-vllm/blob/a5c28fca8f5e21653c6e5874719467e08d3d8503/tests/models/test_bart.py

huseinzol05 · 2024-06-28T23:48:38Z

PRs for infrastructure are about to land

[Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) #4888

[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) #4942

We are starting to build models on top of this (starting with BART to keep it simple)

Could you build this PR on top of #4942

cc @afeldman-nm

Sure! I would look at that branch

huseinzol05 · 2024-06-29T02:42:25Z

@afeldman-nm , let me solve the trashed outputs first, after that I will upstream to #4942

huseinzol05 · 2024-07-01T12:00:25Z

Solved trashed outputs and added cuda graph model

huseinzol05 · 2024-07-02T08:07:42Z

Streaming SRT format,

Screen.Recording.2024-07-02.at.4.04.26.PM.mov

Streaming JSON format,

Screen.Recording.2024-07-02.at.4.05.09.PM.mov

huseinzol05 · 2024-07-03T06:21:54Z

@robertgshaw2-neuralmagic we posted a blog about this, https://mesolitica.com/blog/vllm-whisper

afeldman-nm · 2024-07-08T21:45:56Z

@robertgshaw2-neuralmagic we posted a blog about this, https://mesolitica.com/blog/vllm-whisper

Hi @huseinzol05 this is great, I gave your blog a look.

FYI:

#4888 took a little longer than expected to land but it has been landed, enabling the xFormers backend to support encoder attention, decoder self-attention, and decoder cross-attention. #4837 and #4888 (both of which have been landed) were prerequisites for #4942 , which completes end-to-end support for encoder/decoder models with the xFormers backend & also introduces the BART model into vLLM. #4942 is still WIP but hoping to complete it soon.

huseinzol05 · 2024-07-09T02:00:25Z

@robertgshaw2-neuralmagic we posted a blog about this, https://mesolitica.com/blog/vllm-whisper

Hi @huseinzol05 this is great, I gave your blog a look.

FYI:

#4888 took a little longer than expected to land but it has been landed, enabling the xFormers backend to support encoder attention, decoder self-attention, and decoder cross-attention. #4837 and #4888 (both of which have been landed) were prerequisites for #4942 , which completes end-to-end support for encoder/decoder models with the xFormers backend & also introduces the BART model into vLLM. #4942 is still WIP but hoping to complete it soon.

Nice! anything you guys need help that i can help?

MahmoudAshraf97 · 2024-07-09T08:17:14Z

Thanks for your efforts Husein, will this implementation support continuous batching?

AlexBlack2202 · 2024-07-11T08:08:59Z

Initial support for Whisper, able to load and infer but the outputs are trashed, example script, https://github.com/mesolitica/vllm-whisper/blob/main/examples/whisper_example.py, might be bugs related to weights or attention, few hiccups,

still try to figure out kv cache for Encoder hidden state or else each steps will recompute Encoder hidden state.

No non causal attention for Encoder and Cross Attention in Decoder, seems like all attention implementation in VLLM is for causal, so I just use xops.memory_efficient_attention_forward like T5 branch, this is not ideal because vLLM got their attention backend?

Reuse KV Cache Cross Attention from the first step for the next steps.

hi @huseinzol05, i use your report at https://github.com/mesolitica/vllm-whisper/ to host openai/whisper-large-v3 on my own machine using A100 80G but it not successful

The error

python3.10/site-packages/torch/nn/modules/conv.py", line 306, in _conv_forward
return F.conv1d(input, weight, bias, self.stride,
RuntimeError: GET was unable to find an engine to execute this computation

can you have any update to fix this or can you province your python library

Thank you.

huseinzol05 · 2024-07-12T02:22:58Z

Thanks for your efforts Husein, will this implementation support continuous batching?

Yes

huseinzol05 · 2024-07-12T02:23:32Z

Initial support for Whisper, able to load and infer but the outputs are trashed, example script, https://github.com/mesolitica/vllm-whisper/blob/main/examples/whisper_example.py, might be bugs related to weights or attention, few hiccups,

still try to figure out kv cache for Encoder hidden state or else each steps will recompute Encoder hidden state.

No non causal attention for Encoder and Cross Attention in Decoder, seems like all attention implementation in VLLM is for causal, so I just use xops.memory_efficient_attention_forward like T5 branch, this is not ideal because vLLM got their attention backend?

Reuse KV Cache Cross Attention from the first step for the next steps.

hi @huseinzol05, i use your report at https://github.com/mesolitica/vllm-whisper/ to host openai/whisper-large-v3 on my own machine using A100 80G but it not successful

The error

python3.10/site-packages/torch/nn/modules/conv.py", line 306, in _conv_forward return F.conv1d(input, weight, bias, self.stride, RuntimeError: GET was unable to find an engine to execute this computation

can you have any update to fix this or can you province your python library

Thank you.

Below is my step to run,

pip3.10 install git+https://github.com/mesolitica/vllm-whisper
python3.10 -m vllm.entrypoints.openai.api_server --model openai/whisper-large-v3 --dtype bfloat16 --whisper-input-type input_features --max-model-len 448 --max-size-mb-whisper 100
wget https://github.com/mesolitica/malaya-speech/raw/master/speech/7021-79759-0004.wav
curl -X 'POST' 'http://localhost:8000/audio/transcriptions' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F '[email protected];type=audio/mpeg' \
-F 'model=whisper' \
-F 'response_format=json' \
-F 'stream=true'

output,

data: {"token": "<|en|><|0.0|>"}

data: {"token": " without"}

data: {"token": " going"}

data: {"token": " to"}

data: {"token": " any"}

...

my pip freeze,

accelerate==0.32.1
aiohttp==3.9.5
aiosignal==1.3.1
annotated-types==0.7.0
anyio==3.7.1
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
asttokens==2.4.1
async-timeout==4.0.3
attrs==23.2.0
auto-gptq @ file:///home/ubuntu/AutoGPTQ/dist/auto_gptq-0.8.0.dev0%2Bcu1210-cp310-cp310-linux_x86_64.whl
beautifulsoup4==4.12.3
bleach==6.1.0
certifi==2024.7.4
cffi==1.16.0
chardet==3.0.4
charset-normalizer==3.3.2
click==8.1.7
cloudpickle==3.0.0
cmake==3.30.0
comm==0.2.2
datasets==2.20.0
dbus-python==1.2.16
debugpy==1.8.2
decorator==5.1.1
defusedxml==0.7.1
dill==0.3.8
diskcache==5.6.3
distro==1.9.0
distro-info==0.23+ubuntu1.1
dnspython==2.6.1
email_validator==2.2.0
exceptiongroup==1.2.1
executing==2.0.1
fastapi==0.111.0
fastapi-cli==0.0.4
fastjsonschema==2.20.0
filelock==3.15.4
frozenlist==1.4.1
fsspec==2024.5.0
gekko==1.2.1
h11==0.14.0
httpcore==1.0.5
httptools==0.6.1
httpx==0.27.0
huggingface-hub==0.23.4
idna==2.8
interegular==0.3.3
ipykernel==6.29.5
ipython==8.21.0
ipython-genutils==0.2.0
ipywidgets==8.1.3
jedi==0.19.1
Jinja2==3.1.4
jsonschema==4.22.0
jsonschema-specifications==2023.12.1
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-server==1.18.0
jupyter-server-proxy==3.2.1
jupyter_client==8.6.2
jupyter_core==5.7.2
jupyterlab_pygments==0.3.0
jupyterlab_widgets==3.0.11
lark==1.1.9
llvmlite==0.43.0
lm-format-enforcer==0.10.1
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib-inline==0.1.7
mdurl==0.1.2
mistune==3.0.2
mpmath==1.3.0
msgpack==1.0.8
multidict==6.0.5
multiprocess==0.70.16
nbclient==0.10.0
nbconvert==7.16.4
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==3.3
ninja==1.11.1.1
notebook==6.4.12
numba==0.60.0
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py==12.555.43
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.5.82
nvidia-nvtx-cu12==12.1.105
openai==1.35.13
orjson==3.10.6
outlines==0.0.46
packaging==24.1
pandas==2.2.2
pandocfilters==1.5.1
parso==0.8.4
peft==0.11.1
pexpect==4.9.0
pillow==10.4.0
platformdirs==4.2.2
prometheus-fastapi-instrumentator==7.0.0
prometheus_client==0.20.0
prompt_toolkit==3.0.47
protobuf==5.27.2
psutil==6.0.0
ptyprocess==0.7.0
pure-eval==0.2.2
py-cpuinfo==9.0.0
pyairports==2.1.1
pyarrow==16.1.0
pyarrow-hotfix==0.6
pycountry==24.6.1
pycparser==2.22
pydantic==2.8.2
pydantic_core==2.20.1
Pygments==2.18.0
PyGObject==3.36.0
python-apt==2.0.1+ubuntu0.20.4.1
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-multipart==0.0.9
pytz==2024.1
PyYAML==6.0.1
pyzmq==26.0.3
qtconsole==5.5.2
QtPy==2.4.1
ray==2.32.0
referencing==0.35.1
regex==2024.5.15
requests==2.32.3
requests-unixsocket==0.2.0
rich==13.7.1
rouge==1.0.1
rpds-py==0.18.1
safetensors==0.4.3
Send2Trash==1.8.3
sentencepiece==0.2.0
shellingham==1.5.4
simpervisor==1.0.0
six==1.14.0
sniffio==1.3.1
soupsieve==2.5
stack-data==0.6.3
starlette==0.37.2
sympy==1.12.1
terminado==0.18.1
threadpoolctl==3.5.0
tiktoken==0.7.0
tinycss2==1.3.0
tokenizers==0.19.1
torch==2.3.0
torchaudio==2.3.0
torchvision==0.18.0
tornado==6.4.1
tqdm==4.66.4
traitlets==5.9.0
transformers==4.42.3
triton==2.3.0
typer==0.12.3
typing_extensions==4.12.2
tzdata==2024.1
ujson==5.10.0
unattended-upgrades==0.1
urllib3==2.2.2
uvicorn==0.30.1
uvloop==0.19.0
vllm @ git+https://github.com/mesolitica/vllm-whisper@fa81def0aab015cf183b662ea8cb2d89ab1be428
vllm-flash-attn==2.5.9
watchfiles==0.22.0
wcwidth==0.2.13
webencodings==0.5.1
websocket-client==1.8.0
websockets==12.0
widgetsnbextension==4.0.11
xformers==0.0.26.post1
xxhash==3.4.1
yarl==1.9.4

dkakaie · 2024-07-28T07:13:02Z

@huseinzol05 I created a new Conda environment and was able to load and infer using
python -m vllm.entrypoints.openai.api_server --model openai/whisper-large-v3 --dtype bfloat16 --whisper-input-type input_features --max-model-len 448 --max-size-mb-whisper 100 --gpu_memory_utilization=0.80. Works smoothly

I tried to use whisper_example.py from examples folder:

Trying llm = LLM( model="openai/whisper-large-v3", max_num_seqs = 1, max_model_len = 448, gpu_memory_utilization = 0.4, dtype = 'bfloat16') succeds if you add whisper_input_type="input_features"
Generation, output_lang = llm.generate( { "prompt_token_ids": [50258], "multi_modal_data": AudioData(y), }, sampling_params=SamplingParams(max_tokens=1, temperature=0), ) fails with error "Multi-modal inputs are only supported by vision language models."

huseinzol05 · 2024-07-28T14:23:54Z

@dkakaie my bad, this whisper should not use multimodal interface, fixed the example, ebf1cbf

dkakaie · 2024-07-28T18:51:54Z

@huseinzol05 your latest commit brought it into life, working like a charm. I don't know if, taking vLLM into account, its technically possible to have token/word level timestamps?

Jiltseb · 2024-07-29T10:49:13Z

Hi @huseinzol05 Thanks for the whisper support and the example.

Can you make whisper_example.py also support long audio (> 30 seconds)? The example currently works up to the first 30 sec of a long audio.

Long audios works smoothly when running the application.

MarktHart · 2024-08-04T11:06:13Z

Hi @huseinzol05 Thanks for the whisper support and the example.

Can you make whisper_example.py also support long audio (> 30 seconds)? The example currently works up to the first 30 sec of a long audio.

Long audios works smoothly when running the application.

Long audio has all different kind of strategies. It's not really a integrated part of the "model" part of Whisper. It's also rather complicated and I am not sure how well it fits with vLLMs other abstractions. I think Whisper for longform has enough exceptions to LLMs that it might be better to implement is as decoupled as possible for the time being, plus Whisper is somewhat overdue and probably replaced in the semi-short term. I don't think that many of Whispers particularities will be very relevant in the future.

@afeldman-nm sorry to tag you, but I think this is something for the vLLM team to consider. Not sure what is currently implemented, but option 2 below plus always use "<|notimestamps|>" as a forced token is probably vLLMs best fit.

I think there are basically 4 options for long form chunking:

Decoding as described in the Whisper paper: use the last decoder tag to see where the model stopped and feed in the audio again from that point on.
Sliding window without overlap.
Sliding window with overlap.
Use VAD to create chunks.

In my (subjective) experience, although I might dinged up some implementations/evaluations:

Quality-wise it's 4 > 3 > 1 > 2. Probably because 2 create windows that start in the middle of a sentence/word. 3 and 4 have quite some extra complexity; stitching together results for 3 and have another model+options for 4.
Speed-wise it's 2/4 > 3 >>> 1. Mostly because 2, 3 and 4 are parallelizable. When using previous context in the decoder (matters less than you'd think), it's still a lot faster despite some parts needing to go sequential anyways.
For complexity of implementation it's probably 2 > 1 > 4 > 3. For 3 it mostly depends on how the overlap should be stitched together again.

afeldman-nm · 2024-08-04T19:20:12Z

Hi @huseinzol05 Thanks for the whisper support and the example.
Can you make whisper_example.py also support long audio (> 30 seconds)? The example currently works up to the first 30 sec of a long audio.
Long audios works smoothly when running the application.

Long audio has all different kind of strategies. It's not really a integrated part of the "model" part of Whisper. It's also rather complicated and I am not sure how well it fits with vLLMs other abstractions. I think Whisper for longform has enough exceptions to LLMs that it might be better to implement is as decoupled as possible for the time being, plus Whisper is somewhat overdue and probably replaced in the semi-short term. I don't think that many of Whispers particularities will be very relevant in the future.

@afeldman-nm sorry to tag you, but I think this is something for the vLLM team to consider. Not sure what is currently implemented, but option 2 below plus always use "<|notimestamps|>" as a forced token is probably vLLMs best fit.

I think there are basically 4 options for long form chunking:

Decoding as described in the Whisper paper: use the last decoder tag to see where the model stopped and feed in the audio again from that point on.

Sliding window without overlap.

Sliding window with overlap.

Use VAD to create chunks.

In my (subjective) experience, although I might dinged up some implementations/evaluations:

Quality-wise it's 4 > 3 > 1 > 2. Probably because 2 create windows that start in the middle of a sentence/word. 3 and 4 have quite some extra complexity; stitching together results for 3 and have another model+options for 4.

Speed-wise it's 2/4 > 3 >>> 1. Mostly because 2, 3 and 4 are parallelizable. When using previous context in the decoder (matters less than you'd think), it's still a lot faster despite some parts needing to go sequential anyways.

For complexity of implementation it's probably 2 > 1 > 4 > 3. For 3 it mostly depends on how the overlap should be stitched together again.

Thanks @MarktHart , I am will incorporate this into the encoder/decoder RFC I am working on. "Basic" encoder/decoder model support should land soon; the RFC covers the significant follow-on work involved in maturing encoder/decoder support to a degree that is commensurate with decoder support (i.e. adding more encoder/decoder models like Whisper, feature compatibility with encoder/decoder, etc.)

I have not studied the audio-length problem which you are discussing in-depth just yet, my guess is it will impact three key parts of the vLLM encoder/decoder model inference process:

The semantics of submitting a request to vLLM (i.e. how does a single vLLM request map onto your "four basic options for long-form chunking")
The information which a vLLM request must return to the caller in order to know where transcription left off
The process of injecting control tokens (i.e. <|notimestamps|>, language choice, task, etc.) into Whisper decoder input during the autoregressive decoding process

Thoughts @MarktHart @huseinzol05 ?

CC @robertgshaw2-neuralmagic

huseinzol05 · 2024-08-05T03:01:35Z

@MarktHart if you use FastAPI entrypoints, it can process long audio, https://mesolitica.com/blog/vllm-whisper#Process-any-length-of-audio-using-Torchaudio

Basically it's a naive chunking or non-overlap sliding window, it use TorchAudio to stream audio segment for 1s, enough 30s, it will pass to whisper to decode, you can check the implementation at https://github.com/mesolitica/vllm-whisper/blob/main/vllm/entrypoints/openai/serving_whisper.py

Feel free to add overlap sliding window.
Feel free to parallelize the sliding windows, because the model support continuous batching.

Jiltseb · 2024-08-05T13:02:04Z

@MarktHart if you use FastAPI entrypoints, it can process long audio, https://mesolitica.com/blog/vllm-whisper#Process-any-length-of-audio-using-Torchaudio

Basically it's a naive chunking or non-overlap sliding window, it use TorchAudio to stream audio segment for 1s, enough 30s, it will pass to whisper to decode, you can check the implementation at https://github.com/mesolitica/vllm-whisper/blob/main/vllm/entrypoints/openai/serving_whisper.py

Feel free to add overlap sliding window.

Feel free to parallelize the sliding windows, because the model support continuous batching.

Thanks @huseinzol05 I have tried naive chunking, it has good speed but caused a big increase in WER for long audios. Is it possible to implement a VAD-based batching? It requires an additional (VAD) model but works best with it as the model natively supports batching.
Also eagerly waiting for the caching speed up of encoder outputs and cross attention layers.

huseinzol05 · 2024-08-05T23:33:29Z

@MarktHart if you use FastAPI entrypoints, it can process long audio, https://mesolitica.com/blog/vllm-whisper#Process-any-length-of-audio-using-Torchaudio
Basically it's a naive chunking or non-overlap sliding window, it use TorchAudio to stream audio segment for 1s, enough 30s, it will pass to whisper to decode, you can check the implementation at https://github.com/mesolitica/vllm-whisper/blob/main/vllm/entrypoints/openai/serving_whisper.py

Feel free to add overlap sliding window.

Feel free to parallelize the sliding windows, because the model support continuous batching.

Thanks @huseinzol05 I have tried naive chunking, it has good speed but caused a big increase in WER for long audios. Is it possible to implement a VAD-based batching? It requires an additional (VAD) model but works best with it as the model natively supports batching. Also eagerly waiting for the caching speed up of encoder outputs and cross attention layers.

When I got free time, I will try to add overlap sliding window like HuggingFace implementation or VAD

Jeevi10 · 2024-08-16T16:14:58Z

Whisper tiny.en/small.en/medium.en aren't transcribing well (quality is low) Whisper tiny/small/medium are doing good. Can someone please explain this ?

Jeevi10 · 2024-08-16T16:31:50Z

Can you please have lora support for whisper?

huseinzol05 · 2024-08-17T03:05:57Z

Whisper tiny.en/small.en/medium.en aren't transcribing well (quality is low) Whisper tiny/small/medium are doing good. Can someone please explain this ?

If you check the source code, first we use to predict lang token, probably .en predicted different lang token so next tokens probably messed up.

Temirulan · 2024-09-06T02:10:44Z

Hello @huseinzol05
Thanks for your contribution.

I successfully started your fork at my A100 80gb gpu. Are you sure about continuous batching?
Noticed that if I query with several different audios it messes output tokens between them.

huseinzol05 · 2024-09-21T11:07:10Z

@Temirulan messed in term? totally gibberish?

lebaudantoine · 2024-10-07T14:37:19Z

Hey, just checking in! Do you have any updates on the status of this pull request? Curious when it might be ready to merge. 😊

huseinzol05 · 2024-10-09T15:24:10Z

@Temirulan messed in term? totally gibberish?

Please share the inference code how you do the concurrency

cruzanstx · 2024-11-06T14:57:13Z

Any update on this? Seems promising.

ArmykOliva · 2024-11-07T00:50:17Z

I hope this gets added soon...

MahmoudAshraf97 · 2024-11-07T10:48:44Z

@huseinzol05 I guess a lot of work can be delayed or removed completely by limiting the input length to 30s, and let users decide how they want to chunk files longer than that, or at least leave it for another PR after the main engine itself is tested by users.
In other words, implement the model only and let the users decide how to use it

huseinzol05 · 2024-11-08T01:24:56Z

Sorry, I do not have capacity to develop this for now, feel free to continue this fork to add VAD or overlapping, should be quick

huseinzol05 added 7 commits June 27, 2024 13:46

initial

2d3347f

added whisper encoder decoder

860b70a

added configs

c15bafc

able to load

2c82ba7

able to forward

d46c5c8

added example

6051782

fix load weights

3a8258f

huseinzol05 added 5 commits June 29, 2024 12:22

fix whisper

8a837c0

added predict lang in whisper example

2d62f45

added cudagraph, whisper no longer as multimodal

1eb13dd

able to decode properly

3db307f

initial whisper serving

00ed7ea

huseinzol05 added 2 commits July 2, 2024 10:54

improve whisper serving

5b16982

added streaming token

7828dc9

added non streaming

fa81def

ywang96 mentioned this pull request Jul 12, 2024

[RFC]: Multi-modality Support Refactoring #4194

Open

fix whisper example

ebf1cbf

afeldman-nm mentioned this pull request Aug 12, 2024

[RFC]: Encoder/decoder models & feature compatibility #7366

Open

Temirulan approved these changes Aug 26, 2024

View reviewed changes

hmellor mentioned this pull request Sep 24, 2024

Whisper support #180

Open

Jiltseb mentioned this pull request Oct 22, 2024

Accept variable-length batch prompts for Whisper OpenNMT/CTranslate2#1784

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper support #5964

Whisper support #5964

huseinzol05 commented Jun 28, 2024 •

edited by DarkLight1337

Loading

ywang96 commented Jun 28, 2024

robertgshaw2-neuralmagic commented Jun 28, 2024 •

edited

Loading

afeldman-nm commented Jun 28, 2024 •

edited

Loading

huseinzol05 commented Jun 28, 2024

huseinzol05 commented Jun 29, 2024

huseinzol05 commented Jul 1, 2024

huseinzol05 commented Jul 2, 2024

huseinzol05 commented Jul 3, 2024

afeldman-nm commented Jul 8, 2024

huseinzol05 commented Jul 9, 2024

MahmoudAshraf97 commented Jul 9, 2024

AlexBlack2202 commented Jul 11, 2024

huseinzol05 commented Jul 12, 2024

huseinzol05 commented Jul 12, 2024 •

edited

Loading

dkakaie commented Jul 28, 2024 •

edited

Loading

huseinzol05 commented Jul 28, 2024

dkakaie commented Jul 28, 2024

Jiltseb commented Jul 29, 2024 •

edited

Loading

MarktHart commented Aug 4, 2024

afeldman-nm commented Aug 4, 2024 •

edited

Loading

huseinzol05 commented Aug 5, 2024

Jiltseb commented Aug 5, 2024

huseinzol05 commented Aug 5, 2024

Jeevi10 commented Aug 16, 2024

Jeevi10 commented Aug 16, 2024

huseinzol05 commented Aug 17, 2024

Temirulan commented Sep 6, 2024

huseinzol05 commented Sep 21, 2024

lebaudantoine commented Oct 7, 2024

huseinzol05 commented Oct 9, 2024

cruzanstx commented Nov 6, 2024

ArmykOliva commented Nov 7, 2024

MahmoudAshraf97 commented Nov 7, 2024

huseinzol05 commented Nov 8, 2024

Whisper support #5964

Are you sure you want to change the base?

Whisper support #5964

Conversation

huseinzol05 commented Jun 28, 2024 • edited by DarkLight1337 Loading

ywang96 commented Jun 28, 2024

robertgshaw2-neuralmagic commented Jun 28, 2024 • edited Loading

afeldman-nm commented Jun 28, 2024 • edited Loading

huseinzol05 commented Jun 28, 2024

huseinzol05 commented Jun 29, 2024

huseinzol05 commented Jul 1, 2024

huseinzol05 commented Jul 2, 2024

huseinzol05 commented Jul 3, 2024

afeldman-nm commented Jul 8, 2024

huseinzol05 commented Jul 9, 2024

MahmoudAshraf97 commented Jul 9, 2024

AlexBlack2202 commented Jul 11, 2024

huseinzol05 commented Jul 12, 2024

huseinzol05 commented Jul 12, 2024 • edited Loading

dkakaie commented Jul 28, 2024 • edited Loading

huseinzol05 commented Jul 28, 2024

dkakaie commented Jul 28, 2024

Jiltseb commented Jul 29, 2024 • edited Loading

MarktHart commented Aug 4, 2024

afeldman-nm commented Aug 4, 2024 • edited Loading

huseinzol05 commented Aug 5, 2024

Jiltseb commented Aug 5, 2024

huseinzol05 commented Aug 5, 2024

Jeevi10 commented Aug 16, 2024

Jeevi10 commented Aug 16, 2024

huseinzol05 commented Aug 17, 2024

Temirulan commented Sep 6, 2024

huseinzol05 commented Sep 21, 2024

lebaudantoine commented Oct 7, 2024

huseinzol05 commented Oct 9, 2024

cruzanstx commented Nov 6, 2024

ArmykOliva commented Nov 7, 2024

MahmoudAshraf97 commented Nov 7, 2024

huseinzol05 commented Nov 8, 2024

huseinzol05 commented Jun 28, 2024 •

edited by DarkLight1337

Loading

robertgshaw2-neuralmagic commented Jun 28, 2024 •

edited

Loading

afeldman-nm commented Jun 28, 2024 •

edited

Loading

huseinzol05 commented Jul 12, 2024 •

edited

Loading

dkakaie commented Jul 28, 2024 •

edited

Loading

Jiltseb commented Jul 29, 2024 •

edited

Loading

afeldman-nm commented Aug 4, 2024 •

edited

Loading