-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SpecDec] Remove Batch Expansion (2/3) #9298
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. What happen if we enable CUDA graph in this PR?
Yeah I know. I mean what users will see in this case and we should provide a proper error message. |
It will fall back to batch expansion if cuda graph is enabled as shown here. |
Great! Thanks for your work @LiuXiaoxuanPKU Below I've shared the details of my experiment, which is quite similar to yours. ========= scoring time
Avg latency
|
@wooyeonlee0 For the last column MQA with cuda graph, we implemented a quick version (without handling edge cases) to test the performance. As seen from the numbers, we are thinking maybe instead of adding cuda graph support for MQA, which might introduce memory overhead because of the change of bucketing strategy, it might be easier to switch between MQA scorer and batch expansion scroer. For example, when batch_size < 8, use batch expansion scorer with cuda graph, when batch_size > 8, use MQA scorer without cudagraph. The switch should be relatively easy to implement without introducing overhead. Any thoughts/comments/discussion are appreciated! |
@LiuXiaoxuanPKU Thanks for the quick response! :)
Does the memory overhead come from multiple cuda graphs for each different K value? I think we can use the variable-K MQA scorer in this PR only when we want to use the dynamic speculative decoding feature. |
Thank you very much for your work @LiuXiaoxuanPKU 😊 GPU: 1 A10 (tp=1) When I did not add the --enforce-eager parameter (automatically using the batch expansion scorer):
When I added the --enforce-eager parameter (automatically using the MQA scorer):
It can be observed that with CUDA Graphs enabled, the average_time_per_proposal_tok_ms is only 5.56 ms, whereas, with CUDA Graphs disabled, it increases to 13.69 ms :( In addition, I conducted separate tests on the inference performance of the draft model with and without CUDA Graphs enabled, as shown below: Could you explain why enabling or disabling CUDA Graphs has such a significant impact on the draft model in my environment? Are there any solutions to resolve this issue? |
Signed-off-by: Alvant <[email protected]>
Signed-off-by: Amit Garg <[email protected]>
yeah it totally makes sense. When draft model is small, we need cuda graph to achieve good performance. The cuda graph support for the draft model should always be on. |
The tricky part is that in ngram, it very likely that requests within the same batch might have different propose lengths (0 or k). If there is a match for a given request, the propose length will be k, otherwise 0. In this case, we cannot assume the same k for all requests in the batch. |
@LiuXiaoxuanPKU Thanks for the detailed answer :) I didn't know that ngram has that kind of requirement..! By the way, I have one more question about generation outputs. While exploring this, I found that batching can also introduce this kind of issue due to floating-point errors. (#966) Notes
|
Signed-off-by: Sumit Dubey <[email protected]>
Signed-off-by: Maxime Fournioux <[email protected]>
Follow up of #8839.
We will use
flash_attn_varlen_func
for MQA scorer. Therefore, we can support different propose lengths for different requests in the batch, which is essential for ngram and dynamic speculative decoding.The following are some preliminary benchmark numbers with MQA scorer compared with batch expansion with/w/o cuda graph.