-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accept variable-length batch prompts for Whisper #1784
base: master
Are you sure you want to change the base?
Conversation
Just got notice of this, cool! However, it makes me wonder...is it really necessary? For example, in the Here's my summary, but I'll leave it to you to analyze the relevant source code files within BBC SUMMARY OF WHISPERS2T PIPELINE HANDLING A MULTIPLE AUDIO FILES
|
Forgot to attach and outline of how BBC OUTLINE OF WHISPERS2T PIPELINE FOR SINGLE AUDIO FILE
|
no speedup gains will be noticed unless continuous batching is implemented which is different from regular batching, in regular batching the speedups will eventually plateau because the longest sequence often runs alone until completion which means in large batch sizes the actual batch size will be 1 because all other sequences have already finished, this is an analysis of how the generation efficiency drops greatly with larger batch sizes even if the GPU is not yet saturated
achieving 100% efficiency is already possible with transformers backend regardless of the batch size |
When I run WhisperS2T it has no problem fully saturating the CUDA cores. I'll look more into the distinction you're drawing, but my initial impression is that as long as you construct the pipeline eloquently before the data is sent to |
Regarding your specific test results...can you link in the private repo I created the actual audio files and the script you used...and the overall processing time? I'd like to do a comparison if you don't mind. |
The CUDA cores will be saturated ofcourse with one tiny problem, it's outputting garbage that will be discarded anyways |
Would you mind sharing the audio files you tested like I mentioned? |
I tested with this |
75 segments not files, 34min / 30s ~= 75 segments |
@MahmoudAshraf97 Can you provide the script you used for your benchmark please? |
This is a good starting point to reproduce using any inference engine: %%timeit
batch_size = 4
encoder_batch_size = 4
total_cycles = 0
total_tokens = 0
encoder_outputs = []
for i in range(0, len(features), encoder_batch_size):
encoder_outputs.extend(
model.encoder.get_audio_features(
features[i : i + encoder_batch_size]
).unbind()
)
torch.cuda.empty_cache()
filtered_outputs = []
for i in range(0,len(encoder_outputs),batch_size):
inputs = list(encoder_outputs[i:i+batch_size])
decoder_input_ids = prompt_id.repeat(len(inputs), 1)
outputs = model.decoder.generate(decoder_input_ids, inputs, model.eot_id, max_new_tokens=124, num_beams=1)
for output in outputs:
filtered_outputs.append([token for token in output[0][4:] if token != model.eot_id])
total_cycles += max([len(output) for output in filtered_outputs])
total_tokens += sum([len(output) for output in filtered_outputs])
torch.cuda.empty_cache()
print("effeciency: ",total_tokens / batch_size/total_cycles) |
Mathematically, both are identical |
I don't do "good starting points". When I ask for a script I expect a script, and out of respect, I do the same when someone I'm supposedly collaborating with asks me for a SCRIPT. |
My dear friend we are on the same boat here,
|
I deleted all comments that were unduly inflammatory but left the ones that, while a little inflammatory, still directly pertain to this pull request. Next time please lead with the "NDA" reason. Also, you didn't clarify if this prevents you from sending the tensorrt code on the private repo like you promised. This is all voluntary so...tell me if you will NOT do it rather than say you will, but don't do it. Thanks. |
How can I help? I would like to use this in my pipeline. |
Even after this PR is accepted, we will need to find a way to stop the generation once a single sequence finishes rather that wait for all of them to finish, this is easily done in |
I tested the code and it works for my usecase as is! I'm curious what the speedup is between this and full IFB, but don't have the time to dedicate to that big of a code change, unfortunately. Thank you for this, its a huge win 🙇. |
Hi @MahmoudAshraf97 Cool! How different it is from vllm-whisper with continuous batching? |
Hi Jilt, TRT-LLM just released continuous batching support for whisper last week, I'm still improving my implementation using transformers backend to have better KV cache utilization and will release it publicly as soon as it's ready. |
I have tried vllm-whisper with CB but the results seem to suffer because they used 3sec audio+padding instead of relying on semantics to split the audio. |
This is a continuation of #1457, the final goal is to enable continuous batching for whisper models which bring large speedups with large batch size