Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whisper support #5964

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open

Whisper support #5964

wants to merge 16 commits into from

Conversation

huseinzol05
Copy link

@huseinzol05 huseinzol05 commented Jun 28, 2024

Initial support for Whisper, able to load and infer but the outputs are trashed, example script, https://github.com/mesolitica/vllm-whisper/blob/main/examples/whisper_example.py, might be bugs related to weights or attention, few hiccups,

  1. still try to figure out kv cache for Encoder hidden state or else each steps will recompute Encoder hidden state.
  2. No non causal attention for Encoder and Cross Attention in Decoder, seems like all attention implementation in VLLM is for causal, so I just use xops.memory_efficient_attention_forward like T5 branch, this is not ideal because vLLM got their attention backend?
  3. Reuse KV Cache Cross Attention from the first step for the next steps.

FIX #180

@ywang96
Copy link
Member

ywang96 commented Jun 28, 2024

Thank you for the PR! @huseinzol05

Currently our infrastructure support for encode-decoder is still WIP (@robertgshaw2-neuralmagic should be able to provide more context here), so I think it's probably a good idea to work on whisper until the underlying infra is ready.

@robertgshaw2-neuralmagic
Copy link
Collaborator

robertgshaw2-neuralmagic commented Jun 28, 2024

PRs for infrastructure are about to land

We are starting to build models on top of this (starting with BART to keep it simple)

Could you build this PR on top of #4942

cc @afeldman-nm

@afeldman-nm
Copy link
Contributor

afeldman-nm commented Jun 28, 2024

PRs for infrastructure are about to land

We are starting to build models on top of this (starting with BART to keep it simple)

Could you build this PR on top of #4942

cc @afeldman-nm

Seconding what @robertgshaw2-neuralmagic said - #4942 provides the support for encoder attention & cross-attention KV cache which Whisper will need.

I am planning to have BART working by EOD today or thereabouts, which can serve as an example of implementing an encoder/decoder model. Hoping to have all tests passing soon.

Hopefully you can try building your implementation of Whisper on top of #4942 , it would be great to know if you run into any issues.

At the level of kernel invocation, Attention.forward() now has an attn_type argument which consumes one of three possible AttentionType enum values: ENCODER (encoder attention), DECODER (decoder self-attention), ENCODER_DECODER (encoder/decoder cross-attention):

https://github.com/neuralmagic/nm-vllm/blob/a5c28fca8f5e21653c6e5874719467e08d3d8503/tests/kernels/test_encoder_decoder_attn.py#L697-L702

The following new attn_metadata fields enable the attn_type=ENCODER and attn_type=ENCODER_DECODER scenarios:

https://github.com/neuralmagic/nm-vllm/blob/a5c28fca8f5e21653c6e5874719467e08d3d8503/tests/kernels/utils.py#L862-L869

Specifically, cross_block_tables and cross_slot_mapping holds the block tables and slot mappings for the cross-attention KV cache.

There are also additional changes to support scheduling & adding requests for encoder/decoder models; you can see an example of invoking BART below:

[WIP] BART model invocation example
https://github.com/neuralmagic/nm-vllm/blob/a5c28fca8f5e21653c6e5874719467e08d3d8503/examples/offline_inference_encoder_decoder.py

Some additional example code (links are to files in #4942 ):

[WIP] BART model implementation:
https://github.com/neuralmagic/nm-vllm/blob/a5c28fca8f5e21653c6e5874719467e08d3d8503/vllm/model_executor/models/bart.py

[WIP] BART e2e test: compare output logits against HuggingFace implementation
https://github.com/neuralmagic/nm-vllm/blob/a5c28fca8f5e21653c6e5874719467e08d3d8503/tests/models/test_bart.py

@huseinzol05
Copy link
Author

PRs for infrastructure are about to land

We are starting to build models on top of this (starting with BART to keep it simple)

Could you build this PR on top of #4942

cc @afeldman-nm

Sure! I would look at that branch

@huseinzol05
Copy link
Author

@afeldman-nm , let me solve the trashed outputs first, after that I will upstream to #4942

@huseinzol05
Copy link
Author

Solved trashed outputs and added cuda graph model

@huseinzol05
Copy link
Author

Streaming SRT format,

Screen.Recording.2024-07-02.at.4.04.26.PM.mov

Streaming JSON format,

Screen.Recording.2024-07-02.at.4.05.09.PM.mov

@huseinzol05
Copy link
Author

@robertgshaw2-neuralmagic we posted a blog about this, https://mesolitica.com/blog/vllm-whisper

@afeldman-nm
Copy link
Contributor

@robertgshaw2-neuralmagic we posted a blog about this, https://mesolitica.com/blog/vllm-whisper

Hi @huseinzol05 this is great, I gave your blog a look.

FYI:

#4888 took a little longer than expected to land but it has been landed, enabling the xFormers backend to support encoder attention, decoder self-attention, and decoder cross-attention. #4837 and #4888 (both of which have been landed) were prerequisites for #4942 , which completes end-to-end support for encoder/decoder models with the xFormers backend & also introduces the BART model into vLLM. #4942 is still WIP but hoping to complete it soon.

@huseinzol05
Copy link
Author

@robertgshaw2-neuralmagic we posted a blog about this, https://mesolitica.com/blog/vllm-whisper

Hi @huseinzol05 this is great, I gave your blog a look.

FYI:

#4888 took a little longer than expected to land but it has been landed, enabling the xFormers backend to support encoder attention, decoder self-attention, and decoder cross-attention. #4837 and #4888 (both of which have been landed) were prerequisites for #4942 , which completes end-to-end support for encoder/decoder models with the xFormers backend & also introduces the BART model into vLLM. #4942 is still WIP but hoping to complete it soon.

Nice! anything you guys need help that i can help?

@MahmoudAshraf97
Copy link
Contributor

Thanks for your efforts Husein, will this implementation support continuous batching?

@AlexBlack2202
Copy link

Initial support for Whisper, able to load and infer but the outputs are trashed, example script, https://github.com/mesolitica/vllm-whisper/blob/main/examples/whisper_example.py, might be bugs related to weights or attention, few hiccups,

  1. still try to figure out kv cache for Encoder hidden state or else each steps will recompute Encoder hidden state.
  2. No non causal attention for Encoder and Cross Attention in Decoder, seems like all attention implementation in VLLM is for causal, so I just use xops.memory_efficient_attention_forward like T5 branch, this is not ideal because vLLM got their attention backend?
  3. Reuse KV Cache Cross Attention from the first step for the next steps.

hi @huseinzol05, i use your report at https://github.com/mesolitica/vllm-whisper/ to host openai/whisper-large-v3 on my own machine using A100 80G but it not successful

The error

python3.10/site-packages/torch/nn/modules/conv.py", line 306, in _conv_forward
return F.conv1d(input, weight, bias, self.stride,
RuntimeError: GET was unable to find an engine to execute this computation

can you have any update to fix this or can you province your python library

Thank you.

@huseinzol05
Copy link
Author

Thanks for your efforts Husein, will this implementation support continuous batching?

Yes

@huseinzol05
Copy link
Author

huseinzol05 commented Jul 12, 2024

Initial support for Whisper, able to load and infer but the outputs are trashed, example script, https://github.com/mesolitica/vllm-whisper/blob/main/examples/whisper_example.py, might be bugs related to weights or attention, few hiccups,

  1. still try to figure out kv cache for Encoder hidden state or else each steps will recompute Encoder hidden state.
  2. No non causal attention for Encoder and Cross Attention in Decoder, seems like all attention implementation in VLLM is for causal, so I just use xops.memory_efficient_attention_forward like T5 branch, this is not ideal because vLLM got their attention backend?
  3. Reuse KV Cache Cross Attention from the first step for the next steps.

hi @huseinzol05, i use your report at https://github.com/mesolitica/vllm-whisper/ to host openai/whisper-large-v3 on my own machine using A100 80G but it not successful

The error

python3.10/site-packages/torch/nn/modules/conv.py", line 306, in _conv_forward return F.conv1d(input, weight, bias, self.stride, RuntimeError: GET was unable to find an engine to execute this computation

can you have any update to fix this or can you province your python library

Thank you.

Below is my step to run,

pip3.10 install git+https://github.com/mesolitica/vllm-whisper
python3.10 -m vllm.entrypoints.openai.api_server --model openai/whisper-large-v3 --dtype bfloat16 --whisper-input-type input_features --max-model-len 448 --max-size-mb-whisper 100
wget https://github.com/mesolitica/malaya-speech/raw/master/speech/7021-79759-0004.wav
curl -X 'POST' 'http://localhost:8000/audio/transcriptions' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F '[email protected];type=audio/mpeg' \
-F 'model=whisper' \
-F 'response_format=json' \
-F 'stream=true'

output,

data: {"token": "<|en|><|0.0|>"}

data: {"token": " without"}

data: {"token": " going"}

data: {"token": " to"}

data: {"token": " any"}

...

my pip freeze,

accelerate==0.32.1
aiohttp==3.9.5
aiosignal==1.3.1
annotated-types==0.7.0
anyio==3.7.1
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
asttokens==2.4.1
async-timeout==4.0.3
attrs==23.2.0
auto-gptq @ file:///home/ubuntu/AutoGPTQ/dist/auto_gptq-0.8.0.dev0%2Bcu1210-cp310-cp310-linux_x86_64.whl
beautifulsoup4==4.12.3
bleach==6.1.0
certifi==2024.7.4
cffi==1.16.0
chardet==3.0.4
charset-normalizer==3.3.2
click==8.1.7
cloudpickle==3.0.0
cmake==3.30.0
comm==0.2.2
datasets==2.20.0
dbus-python==1.2.16
debugpy==1.8.2
decorator==5.1.1
defusedxml==0.7.1
dill==0.3.8
diskcache==5.6.3
distro==1.9.0
distro-info==0.23+ubuntu1.1
dnspython==2.6.1
email_validator==2.2.0
exceptiongroup==1.2.1
executing==2.0.1
fastapi==0.111.0
fastapi-cli==0.0.4
fastjsonschema==2.20.0
filelock==3.15.4
frozenlist==1.4.1
fsspec==2024.5.0
gekko==1.2.1
h11==0.14.0
httpcore==1.0.5
httptools==0.6.1
httpx==0.27.0
huggingface-hub==0.23.4
idna==2.8
interegular==0.3.3
ipykernel==6.29.5
ipython==8.21.0
ipython-genutils==0.2.0
ipywidgets==8.1.3
jedi==0.19.1
Jinja2==3.1.4
jsonschema==4.22.0
jsonschema-specifications==2023.12.1
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-server==1.18.0
jupyter-server-proxy==3.2.1
jupyter_client==8.6.2
jupyter_core==5.7.2
jupyterlab_pygments==0.3.0
jupyterlab_widgets==3.0.11
lark==1.1.9
llvmlite==0.43.0
lm-format-enforcer==0.10.1
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib-inline==0.1.7
mdurl==0.1.2
mistune==3.0.2
mpmath==1.3.0
msgpack==1.0.8
multidict==6.0.5
multiprocess==0.70.16
nbclient==0.10.0
nbconvert==7.16.4
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==3.3
ninja==1.11.1.1
notebook==6.4.12
numba==0.60.0
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py==12.555.43
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.5.82
nvidia-nvtx-cu12==12.1.105
openai==1.35.13
orjson==3.10.6
outlines==0.0.46
packaging==24.1
pandas==2.2.2
pandocfilters==1.5.1
parso==0.8.4
peft==0.11.1
pexpect==4.9.0
pillow==10.4.0
platformdirs==4.2.2
prometheus-fastapi-instrumentator==7.0.0
prometheus_client==0.20.0
prompt_toolkit==3.0.47
protobuf==5.27.2
psutil==6.0.0
ptyprocess==0.7.0
pure-eval==0.2.2
py-cpuinfo==9.0.0
pyairports==2.1.1
pyarrow==16.1.0
pyarrow-hotfix==0.6
pycountry==24.6.1
pycparser==2.22
pydantic==2.8.2
pydantic_core==2.20.1
Pygments==2.18.0
PyGObject==3.36.0
python-apt==2.0.1+ubuntu0.20.4.1
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-multipart==0.0.9
pytz==2024.1
PyYAML==6.0.1
pyzmq==26.0.3
qtconsole==5.5.2
QtPy==2.4.1
ray==2.32.0
referencing==0.35.1
regex==2024.5.15
requests==2.32.3
requests-unixsocket==0.2.0
rich==13.7.1
rouge==1.0.1
rpds-py==0.18.1
safetensors==0.4.3
Send2Trash==1.8.3
sentencepiece==0.2.0
shellingham==1.5.4
simpervisor==1.0.0
six==1.14.0
sniffio==1.3.1
soupsieve==2.5
stack-data==0.6.3
starlette==0.37.2
sympy==1.12.1
terminado==0.18.1
threadpoolctl==3.5.0
tiktoken==0.7.0
tinycss2==1.3.0
tokenizers==0.19.1
torch==2.3.0
torchaudio==2.3.0
torchvision==0.18.0
tornado==6.4.1
tqdm==4.66.4
traitlets==5.9.0
transformers==4.42.3
triton==2.3.0
typer==0.12.3
typing_extensions==4.12.2
tzdata==2024.1
ujson==5.10.0
unattended-upgrades==0.1
urllib3==2.2.2
uvicorn==0.30.1
uvloop==0.19.0
vllm @ git+https://github.com/mesolitica/vllm-whisper@fa81def0aab015cf183b662ea8cb2d89ab1be428
vllm-flash-attn==2.5.9
watchfiles==0.22.0
wcwidth==0.2.13
webencodings==0.5.1
websocket-client==1.8.0
websockets==12.0
widgetsnbextension==4.0.11
xformers==0.0.26.post1
xxhash==3.4.1
yarl==1.9.4

@dkakaie
Copy link

dkakaie commented Jul 28, 2024

@huseinzol05 I created a new Conda environment and was able to load and infer using
python -m vllm.entrypoints.openai.api_server --model openai/whisper-large-v3 --dtype bfloat16 --whisper-input-type input_features --max-model-len 448 --max-size-mb-whisper 100 --gpu_memory_utilization=0.80. Works smoothly

I tried to use whisper_example.py from examples folder:

  1. Trying llm = LLM( model="openai/whisper-large-v3", max_num_seqs = 1, max_model_len = 448, gpu_memory_utilization = 0.4, dtype = 'bfloat16') succeds if you add whisper_input_type="input_features"

  2. Generation, output_lang = llm.generate( { "prompt_token_ids": [50258], "multi_modal_data": AudioData(y), }, sampling_params=SamplingParams(max_tokens=1, temperature=0), ) fails with error "Multi-modal inputs are only supported by vision language models."

@huseinzol05
Copy link
Author

@dkakaie my bad, this whisper should not use multimodal interface, fixed the example, ebf1cbf

@dkakaie
Copy link

dkakaie commented Jul 28, 2024

@huseinzol05 your latest commit brought it into life, working like a charm. I don't know if, taking vLLM into account, its technically possible to have token/word level timestamps?

@Jiltseb
Copy link

Jiltseb commented Jul 29, 2024

Hi @huseinzol05 Thanks for the whisper support and the example.

Can you make whisper_example.py also support long audio (> 30 seconds)? The example currently works up to the first 30 sec of a long audio.

Long audios works smoothly when running the application.

@MarktHart
Copy link

Hi @huseinzol05 Thanks for the whisper support and the example.

Can you make whisper_example.py also support long audio (> 30 seconds)? The example currently works up to the first 30 sec of a long audio.

Long audios works smoothly when running the application.

Long audio has all different kind of strategies. It's not really a integrated part of the "model" part of Whisper. It's also rather complicated and I am not sure how well it fits with vLLMs other abstractions. I think Whisper for longform has enough exceptions to LLMs that it might be better to implement is as decoupled as possible for the time being, plus Whisper is somewhat overdue and probably replaced in the semi-short term. I don't think that many of Whispers particularities will be very relevant in the future.

@afeldman-nm sorry to tag you, but I think this is something for the vLLM team to consider. Not sure what is currently implemented, but option 2 below plus always use "<|notimestamps|>" as a forced token is probably vLLMs best fit.

I think there are basically 4 options for long form chunking:

  1. Decoding as described in the Whisper paper: use the last decoder tag to see where the model stopped and feed in the audio again from that point on.
  2. Sliding window without overlap.
  3. Sliding window with overlap.
  4. Use VAD to create chunks.

In my (subjective) experience, although I might dinged up some implementations/evaluations:

  • Quality-wise it's 4 > 3 > 1 > 2. Probably because 2 create windows that start in the middle of a sentence/word. 3 and 4 have quite some extra complexity; stitching together results for 3 and have another model+options for 4.
  • Speed-wise it's 2/4 > 3 >>> 1. Mostly because 2, 3 and 4 are parallelizable. When using previous context in the decoder (matters less than you'd think), it's still a lot faster despite some parts needing to go sequential anyways.
  • For complexity of implementation it's probably 2 > 1 > 4 > 3. For 3 it mostly depends on how the overlap should be stitched together again.

@afeldman-nm
Copy link
Contributor

afeldman-nm commented Aug 4, 2024

Hi @huseinzol05 Thanks for the whisper support and the example.
Can you make whisper_example.py also support long audio (> 30 seconds)? The example currently works up to the first 30 sec of a long audio.
Long audios works smoothly when running the application.

Long audio has all different kind of strategies. It's not really a integrated part of the "model" part of Whisper. It's also rather complicated and I am not sure how well it fits with vLLMs other abstractions. I think Whisper for longform has enough exceptions to LLMs that it might be better to implement is as decoupled as possible for the time being, plus Whisper is somewhat overdue and probably replaced in the semi-short term. I don't think that many of Whispers particularities will be very relevant in the future.

@afeldman-nm sorry to tag you, but I think this is something for the vLLM team to consider. Not sure what is currently implemented, but option 2 below plus always use "<|notimestamps|>" as a forced token is probably vLLMs best fit.

I think there are basically 4 options for long form chunking:

  1. Decoding as described in the Whisper paper: use the last decoder tag to see where the model stopped and feed in the audio again from that point on.
  2. Sliding window without overlap.
  3. Sliding window with overlap.
  4. Use VAD to create chunks.

In my (subjective) experience, although I might dinged up some implementations/evaluations:

  • Quality-wise it's 4 > 3 > 1 > 2. Probably because 2 create windows that start in the middle of a sentence/word. 3 and 4 have quite some extra complexity; stitching together results for 3 and have another model+options for 4.
  • Speed-wise it's 2/4 > 3 >>> 1. Mostly because 2, 3 and 4 are parallelizable. When using previous context in the decoder (matters less than you'd think), it's still a lot faster despite some parts needing to go sequential anyways.
  • For complexity of implementation it's probably 2 > 1 > 4 > 3. For 3 it mostly depends on how the overlap should be stitched together again.

Thanks @MarktHart , I am will incorporate this into the encoder/decoder RFC I am working on. "Basic" encoder/decoder model support should land soon; the RFC covers the significant follow-on work involved in maturing encoder/decoder support to a degree that is commensurate with decoder support (i.e. adding more encoder/decoder models like Whisper, feature compatibility with encoder/decoder, etc.)

I have not studied the audio-length problem which you are discussing in-depth just yet, my guess is it will impact three key parts of the vLLM encoder/decoder model inference process:

  1. The semantics of submitting a request to vLLM (i.e. how does a single vLLM request map onto your "four basic options for long-form chunking")
  2. The information which a vLLM request must return to the caller in order to know where transcription left off
  3. The process of injecting control tokens (i.e. <|notimestamps|>, language choice, task, etc.) into Whisper decoder input during the autoregressive decoding process

Thoughts @MarktHart @huseinzol05 ?

CC @robertgshaw2-neuralmagic

@huseinzol05
Copy link
Author

@MarktHart if you use FastAPI entrypoints, it can process long audio, https://mesolitica.com/blog/vllm-whisper#Process-any-length-of-audio-using-Torchaudio

Basically it's a naive chunking or non-overlap sliding window, it use TorchAudio to stream audio segment for 1s, enough 30s, it will pass to whisper to decode, you can check the implementation at https://github.com/mesolitica/vllm-whisper/blob/main/vllm/entrypoints/openai/serving_whisper.py

  1. Feel free to add overlap sliding window.
  2. Feel free to parallelize the sliding windows, because the model support continuous batching.

@Jiltseb
Copy link

Jiltseb commented Aug 5, 2024

@MarktHart if you use FastAPI entrypoints, it can process long audio, https://mesolitica.com/blog/vllm-whisper#Process-any-length-of-audio-using-Torchaudio

Basically it's a naive chunking or non-overlap sliding window, it use TorchAudio to stream audio segment for 1s, enough 30s, it will pass to whisper to decode, you can check the implementation at https://github.com/mesolitica/vllm-whisper/blob/main/vllm/entrypoints/openai/serving_whisper.py

  1. Feel free to add overlap sliding window.
  2. Feel free to parallelize the sliding windows, because the model support continuous batching.

Thanks @huseinzol05 I have tried naive chunking, it has good speed but caused a big increase in WER for long audios. Is it possible to implement a VAD-based batching? It requires an additional (VAD) model but works best with it as the model natively supports batching.
Also eagerly waiting for the caching speed up of encoder outputs and cross attention layers.

@huseinzol05
Copy link
Author

@MarktHart if you use FastAPI entrypoints, it can process long audio, https://mesolitica.com/blog/vllm-whisper#Process-any-length-of-audio-using-Torchaudio
Basically it's a naive chunking or non-overlap sliding window, it use TorchAudio to stream audio segment for 1s, enough 30s, it will pass to whisper to decode, you can check the implementation at https://github.com/mesolitica/vllm-whisper/blob/main/vllm/entrypoints/openai/serving_whisper.py

  1. Feel free to add overlap sliding window.
  2. Feel free to parallelize the sliding windows, because the model support continuous batching.

Thanks @huseinzol05 I have tried naive chunking, it has good speed but caused a big increase in WER for long audios. Is it possible to implement a VAD-based batching? It requires an additional (VAD) model but works best with it as the model natively supports batching. Also eagerly waiting for the caching speed up of encoder outputs and cross attention layers.

When I got free time, I will try to add overlap sliding window like HuggingFace implementation or VAD

@Jeevi10
Copy link

Jeevi10 commented Aug 16, 2024

Whisper tiny.en/small.en/medium.en aren't transcribing well (quality is low) Whisper tiny/small/medium are doing good. Can someone please explain this ?

@Jeevi10
Copy link

Jeevi10 commented Aug 16, 2024

Can you please have lora support for whisper?

@huseinzol05
Copy link
Author

Whisper tiny.en/small.en/medium.en aren't transcribing well (quality is low) Whisper tiny/small/medium are doing good. Can someone please explain this ?

If you check the source code, first we use to predict lang token, probably .en predicted different lang token so next tokens probably messed up.

@Temirulan
Copy link

Hello @huseinzol05
Thanks for your contribution.

I successfully started your fork at my A100 80gb gpu. Are you sure about continuous batching?
Noticed that if I query with several different audios it messes output tokens between them.

@huseinzol05
Copy link
Author

@Temirulan messed in term? totally gibberish?

@hmellor hmellor mentioned this pull request Sep 24, 2024
@lebaudantoine
Copy link

Hey, just checking in! Do you have any updates on the status of this pull request? Curious when it might be ready to merge. 😊

@huseinzol05
Copy link
Author

@Temirulan messed in term? totally gibberish?

Please share the inference code how you do the concurrency

@cruzanstx
Copy link

Any update on this? Seems promising.

@ArmykOliva
Copy link

I hope this gets added soon...

@MahmoudAshraf97
Copy link
Contributor

@huseinzol05 I guess a lot of work can be delayed or removed completely by limiting the input length to 30s, and let users decide how they want to chunk files longer than that, or at least leave it for another PR after the main engine itself is tested by users.
In other words, implement the model only and let the users decide how to use it

@huseinzol05
Copy link
Author

Sorry, I do not have capacity to develop this for now, feel free to continue this fork to add VAD or overlapping, should be quick

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Whisper support