-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step #6338
[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step #6338
Conversation
9711114
to
bfa1647
Compare
@comaniac addressed review comments. Looks much better now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's much better now!
Left some minor comments. The major question I have is whether it's beneficial to also include attention metadata updating in the CUDA kernel.
@comaniac addressed the second set of review comments, ready for final pass. Now working on getting all tests green. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks! Last batch of miner comments. If possible, it'd be great to have some e2e benchmarks (I could help with that if needed. Just let me know)
I'll leave to @cadedaniel to take a pass.
I will take a look at this tomorrow. One thing that needs validation is that the draft acceptance rate is still 100% when using the same draft and target model. I ran this PR locally and got <100%, which indicates some correctness issue. |
That's a good way to evaluate. We should add this test to the CI in this PR. |
@cadedaniel @comaniac addressed the acceptance rate issue, currently 100% for both GPU case and non GPU case of prepare_inputs |
Addressing other review comments now |
Addressed review comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, all comments are nits/docs. Missing thing is a testing story: can we guarantee that this codepath runs in CI, and fail a test if it doesn't? e.g. in the test for this codepath we assert that the GPU backend is being used at least once. This is important as otherwise we could lose coverage of this codepath silently if there is a dependency change in CI.
Would be great to have an explicit case for when the GPU path is disabled as well, so both are tested regardless of CI conditions.
If these are too hard, let's add a big docstring about it and revisit.
# Update query lengths. Note that we update only queries and not seqs, | ||
# since tensors may be padded due to captured cuda graph batch size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @LiuXiaoxuanPKU , want to make sure you're aware for CUDA graph <> MQA integration
Added a test and fixed the fallback |
btw can you rebase or merge the latest main to make sure everything is up to date? |
bb9c4d8
to
e5f4265
Compare
rebased |
Added Cody's mock test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great contribution @alexm-neuralmagic! thanks also to @comaniac for all the help
I'll merge this first and make a small patch for the variable naming. |
…e_step (vllm-project#6338) Signed-off-by: Alvant <[email protected]>
This PR moves the implementation of prepare_inputs(..) for advance_step(..) (inside draft_model_runner.py) to GPU. It adds a new GPU kernel to modify prepare_inputs tensors on GPU directly, and also introduces some improvements to the sampler inside draft_model_runner to reduce unnecessary GPU<=>CPU overheads. Here is a quick performance check on A100 GPU. For 50% percentile latency, the current main gets 500ms, with sampler improvements of this PR, it gets to 477ms, and this sampler improvements + lowering of prepare_inputs to GPU it gets to 329ms, so 65% improvement vs main.
TODOs
draft_model_runner: new - prepare_inputs on GPU + sampler improvements
draft_model_runner: new - prepare_inputs on CPU + sampler improvements
draft_model_runner: main - prepare_inputs on CPU