-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some fixes for custom allreduce kernels #2760
Conversation
@WoosukKwon Did any of you manage to produce custom all reduce stuck or generate garbage output? |
To reproduce I run vllm on 4x A100 80G SXM with CodeLlama 70b. I sent some requests like this:
It usually gets stuck before request 50 for me. |
Did not manage to reproduce when |
@WoosukKwon This PR is ready for review. |
@hanzhi713 Sorry for the delays in the review. I will review the PR this weekend and make sure this is included in v0.3.4. |
Actually. @hanzhi713 have you tested this PR with a MoE model like mixtral?
|
@tdene Hmm maybe the buffer isn't placed on the correct cuda device. Let me check tomorrow. |
@tdene I didn't manage to reproduce this error |
@tdene I merged this branch with the latest main. Can you rerun your test? Also, I find it strange to see |
@zhuohan123 Looks like @WoosukKwon is too busy. Can you help me get a different reviewer for this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hanzhi713 Apologies for the very late review. I had no bandwidth recently.
The PR looks good to me, while I still don't understand the cause of the bug. I let some minor comments. Please take a look at them.
# note: num dev can be larger than world_size if we're only using | ||
# first few GPUs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the case when more than one nodes (hosts) are used for TP, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. This check is for the case that the users have N GPUs, but are only using the first M GPUs where M < N.
@hanzhi713 Thanks! Let me do extra tests on the PR and merge. |
@hanzhi713 I will merge this as I found this worked well on my 4 A100-80GB machine. Thanks for the fix! |
@hanzhi713 Actually, we recently found that TRT-LLM's custom all reduce kernel is extremely simple. Do you have an idea why it can be much simpler than this implementation? What do you think about using TRT-LLM's kernel? |
It looks to me the implementation has about the same complexity as mine. What makes you think it looks simpler? |
in my test, 4 non NVLink-capable GPUs situation. custom allreduce have 20% performance improvement when batch 1. |
@hanzhi713 can you open 4 non NVLink-capable GPUs situation as an option? |
@garycaokai Curious about your setup. What GPUs are you using? Are they all connected to a PCIe switch or are they connected to CPU directly? Yes I can provide that as an option if I find some time to work on this. |
@garycaokai Also, how did you measure the performance improvement? Is it a latency benchmark? And what is the configuration for the benchmark? |
Legend: X = Self NIC Legend: NIC0: mlx5_0 |
I see this makes a lot of sense. Usually A30 machines have GPUs connected to CPU directly, and CPUs are often terrible PCIe switches. My implementation relies on PCIe P2P. However, in your case you have a PCIe switch connecting to each group of 4 GPUs. Given the much better switching performance, my implementation may work and provide performance improvements. Let me see if I can find machines of this topology |
Recently, there are some reports of stuck generation or garbage text when custom all reduce is enabled. While I didn't manage to reproduce any issues on A30 and A100, I did find some potentially unsafe synchronizations and I attempt to fix them here.
__threadfence_system
or a release-acquire pattern is needed to absolutely guarantee the visibility of other devices' write on the current device, which is missing from the current implementation.Related issues:
#2788 (garbage output when upgrading vllm from 0.2.7 -> 0.3.0)
#2742 (garbage output. Solved when disable_custom_all_reduce=True)
#2731 (one person reported that disable_custom_all_reduce=True solve generation hanging)
In this PR, I made the following changes
__threadfence_system
guarantee visibility of other device's writes when using 2-stage or half butterfly allreduce. Note that this adds a few microseconds of overhead.disable custom allreduce by default by setting the argument to True. User can explicitly opt-in by setting it to False.