Some fixes for custom allreduce kernels #2760

hanzhi713 · 2024-02-05T12:59:46Z

Recently, there are some reports of stuck generation or garbage text when custom all reduce is enabled. While I didn't manage to reproduce any issues on A30 and A100, I did find some potentially unsafe synchronizations and I attempt to fix them here.

When using the signal flag, GPUs are writing to different bytes of the same 8-byte signal. Although the writes are strong writes, they are not considered morally strong according to the CUDA memory model because they didn't overlap completely. Hence, they are still considered as data races.
When using 2-stage allreduce or half butterfly allreduce, a __threadfence_system or a release-acquire pattern is needed to absolutely guarantee the visibility of other devices' write on the current device, which is missing from the current implementation.

Related issues:
#2788 (garbage output when upgrading vllm from 0.2.7 -> 0.3.0)
#2742 (garbage output. Solved when disable_custom_all_reduce=True)
#2731 (one person reported that disable_custom_all_reduce=True solve generation hanging)

In this PR, I made the following changes

Use 8 uint32 instead of 8 bytes per signal per device.
simplify synchronization by only syncing blocks of the same index across GPUs. Some changes are made to ensure that each thread's read only depends on the writes from the thread of the same id in other devices.
add a __threadfence_system guarantee visibility of other device's writes when using 2-stage or half butterfly allreduce. Note that this adds a few microseconds of overhead.
removed support for more than two PCIe-only GPUs because performance improvement is small.
add additional p2p checks to avoid buggy driver/hardware P2P support. Might fix Mixtral GPTQ with TP=2 not generating output #2728
add check for device count when running P2P test. Should fix Distributed inference on multi machine (error Invalid peer device id) #2795
~~disable custom allreduce by default by setting the argument to True. User can explicitly opt-in by setting it to False.~~

hanzhi713 · 2024-02-08T02:05:08Z

@WoosukKwon Did any of you manage to produce custom all reduce stuck or generate garbage output?

This reverts commit 3711811.

NikolaBorisov · 2024-02-13T01:20:35Z

To reproduce I run vllm on 4x A100 80G SXM with CodeLlama 70b. I sent some requests like this:

for i in {0..100}; do curl "http://localhost:8000/v1/chat/completions" -H "Content-Type: application/json" -d '{
  "model": "codellama/CodeLlama-70b-Instruct-hf",
  "messages": [
    {"role": "user", "content": "Hello!"}
  ],
  "max_tokens":100
  }' & sleep 2; done

It usually gets stuck before request 50 for me.

hanzhi713 · 2024-02-15T08:52:27Z

To reproduce I run vllm on 4x A100 80G SXM with CodeLlama 70b. I sent some requests like this:
for i in {0..100}; do curl "http://localhost:8000/v1/chat/completions" -H "Content-Type: application/json" -d '{
  "model": "codellama/CodeLlama-70b-Instruct-hf",
  "messages": [
    {"role": "user", "content": "Hello!"}
  ],
  "max_tokens":100
  }' & sleep 2; done
It usually gets stuck before request 50 for me.

Did not manage to reproduce when enforce_eager=True. Custom allreduce is running fine I think

hanzhi713 · 2024-02-15T09:00:26Z

@WoosukKwon This PR is ready for review.

WoosukKwon · 2024-03-01T07:28:17Z

@hanzhi713 Sorry for the delays in the review. I will review the PR this weekend and make sure this is included in v0.3.4.

tdene · 2024-03-01T11:58:42Z

Actually.

@hanzhi713 have you tested this PR with a MoE model like mixtral?
When using this PR merged on top of 5255d99 I'm seeing

custom_all_reduce.py:239] Registering 2 cuda graph addresses
[...]
File "/workspace/vllm/vllm/worker/worker.py", line 160, in warm_up_model 
  self.model_runner.capture_model(self.gpu_cache)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
   return func(*args, **kwargs)
File "/workspace/vllm/vllm/worker/model_runner.py", line 725, in capture_model
  graph_runner.capture(
[...]
File "/workspace/vllm/vllm/model_executor/models/mixtral.py", line 130, in forward
  final_hidden_states = fused_moe(hidden_states,
File "/workspace/vllm/vllm/model_executor/layers/fused_moe.py", line 276, in fused_moe
   gating_output.float(),  # TODO(woosuk): Optimize this.
RuntimeError: CUDA error: invalid device function
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

hanzhi713 · 2024-03-03T15:31:41Z

@tdene Hmm maybe the buffer isn't placed on the correct cuda device. Let me check tomorrow.

hanzhi713 · 2024-03-06T08:23:05Z

@tdene I didn't manage to reproduce this error

hanzhi713 · 2024-03-06T08:29:14Z

@tdene I merged this branch with the latest main. Can you rerun your test? Also, I find it strange to see Registering 2 cuda graph addresses because normally with cuda graph enabled, we should see at least a few hundred addresses, if not thousands.

hanzhi713 · 2024-03-18T10:00:27Z

@zhuohan123 Looks like @WoosukKwon is too busy. Can you help me get a different reviewer for this PR?

WoosukKwon

@hanzhi713 Apologies for the very late review. I had no bandwidth recently.

The PR looks good to me, while I still don't understand the cause of the bug. I let some minor comments. Please take a look at them.

vllm/config.py

vllm/engine/arg_utils.py

WoosukKwon · 2024-03-20T16:25:08Z

vllm/model_executor/parallel_utils/custom_all_reduce.py

+    # note: num dev can be larger than world_size if we're only using
+    # first few GPUs


This is the case when more than one nodes (hosts) are used for TP, right?

No. This check is for the case that the users have N GPUs, but are only using the first M GPUs where M < N.

csrc/custom_all_reduce_test.cu

WoosukKwon · 2024-03-21T22:36:37Z

@hanzhi713 Thanks! Let me do extra tests on the PR and merge.

WoosukKwon · 2024-03-22T06:02:46Z

@hanzhi713 I will merge this as I found this worked well on my 4 A100-80GB machine. Thanks for the fix!

WoosukKwon · 2024-03-22T06:04:54Z

@hanzhi713 Actually, we recently found that TRT-LLM's custom all reduce kernel is extremely simple. Do you have an idea why it can be much simpler than this implementation? What do you think about using TRT-LLM's kernel?

hanzhi713 · 2024-03-22T06:16:53Z

@hanzhi713 Actually, we recently found that TRT-LLM's custom all reduce kernel is extremely simple. Do you have an idea why it can be much simpler than this implementation? What do you think about using TRT-LLM's kernel?

It looks to me the implementation has about the same complexity as mine. What makes you think it looks simpler?

garycaokai · 2024-04-07T02:42:01Z

in my test, 4 non NVLink-capable GPUs situation. custom allreduce have 20% performance improvement when batch 1.
// for 4 or more non NVLink-capable GPUs, custom allreduce provides little
// performance improvement over NCCL.

garycaokai · 2024-04-07T04:52:30Z

in my test, 4 non NVLink-capable GPUs situation. custom allreduce have 20% performance improvement when batch 1. // for 4 or more non NVLink-capable GPUs, custom allreduce provides little // performance improvement over NCCL.

@hanzhi713 can you open 4 non NVLink-capable GPUs situation as an option?

hanzhi713 · 2024-04-07T05:15:00Z

@garycaokai Curious about your setup. What GPUs are you using? Are they all connected to a PCIe switch or are they connected to CPU directly?

Yes I can provide that as an option if I find some time to work on this.

hanzhi713 · 2024-04-07T05:16:40Z

@garycaokai Also, how did you measure the performance improvement? Is it a latency benchmark? And what is the configuration for the benchmark?

garycaokai · 2024-04-07T07:33:09Z

@garycaokai Curious about your setup. What GPUs are you using? Are they all connected to a PCIe switch or are they connected to CPU directly?

Yes I can provide that as an option if I find some time to work on this.
72b int4 tp=4, 1 batch, decode speed from 20 token/s to 26 token/s
A30 * 8 gpus .
#nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PIX PIX PIX SYS SYS SYS SYS NODE NODE 0-11,48-59 0 N/A
GPU1 PIX X PIX PIX SYS SYS SYS SYS NODE NODE 0-11,48-59 0 N/A
GPU2 PIX PIX X PIX SYS SYS SYS SYS NODE NODE 0-11,48-59 0 N/A
GPU3 PIX PIX PIX X SYS SYS SYS SYS NODE NODE 0-11,48-59 0 N/A
GPU4 SYS SYS SYS SYS X PIX PIX PIX SYS SYS 24-35,72-83 2 N/A
GPU5 SYS SYS SYS SYS PIX X PIX PIX SYS SYS 24-35,72-83 2 N/A
GPU6 SYS SYS SYS SYS PIX PIX X PIX SYS SYS 24-35,72-83 2 N/A
GPU7 SYS SYS SYS SYS PIX PIX PIX X SYS SYS 24-35,72-83 2 N/A
NIC0 NODE NODE NODE NODE SYS SYS SYS SYS X PIX
NIC1 NODE NODE NODE NODE SYS SYS SYS SYS PIX X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0
NIC1: mlx5_1

hanzhi713 · 2024-04-07T19:56:17Z

I see this makes a lot of sense. Usually A30 machines have GPUs connected to CPU directly, and CPUs are often terrible PCIe switches. My implementation relies on PCIe P2P. However, in your case you have a PCIe switch connecting to each group of 4 GPUs. Given the much better switching performance, my implementation may work and provide performance improvements.

Let me see if I can find machines of this topology

safer sync

1631c6e

hanzhi713 mentioned this pull request Feb 5, 2024

Mixtral GPTQ with TP=2 not generating output #2728

Closed

hanzhi713 added 6 commits February 5, 2024 13:28

Merge branch 'main' into safer-sync

0e6c4e8

per rank

cdc7ed9

cleanup

212f69e

use cuda atomics

82c3387

smaller size

80d28ea

use acquire release when necessary

64fcc54

hanzhi713 changed the title ~~Safer sync~~ Fix unsafe synchronization for custom allreduce kernels Feb 7, 2024

hanzhi713 changed the title ~~Fix unsafe synchronization for custom allreduce kernels~~ [WIP] Fix unsafe synchronization for custom allreduce kernels Feb 7, 2024

hanzhi713 added 3 commits February 7, 2024 13:00

use more iters

f22ff3c

don't use atomics

0d922cb

add note

786d547

hanzhi713 changed the title ~~[WIP] Fix unsafe synchronization for custom allreduce kernels~~ Fix unsafe synchronization for custom allreduce kernels Feb 8, 2024

move fence

ae6c122

hanzhi713 changed the title ~~Fix unsafe synchronization for custom allreduce kernels~~ [WIP] Fix unsafe synchronization for custom allreduce kernels Feb 9, 2024

hanzhi713 added 3 commits February 10, 2024 04:54

remove half butterfly impl

7c51c72

Merge branch 'main' into safer-sync

d38d481

Revert "Disable custom all reduce by default (vllm-project#2808)"

594d24b

This reverts commit 3711811.

hanzhi713 mentioned this pull request Feb 10, 2024

vLLM getting stuck. Nothing is generate while requests are running and pending. #2731

Closed

format code

9755f6d

hanzhi713 added 2 commits February 15, 2024 08:35

Merge branch 'main' into safer-sync

f2f49a7

add additional p2p checks

6ef1b1f

format

7e7dc4a

hanzhi713 mentioned this pull request Feb 15, 2024

Distributed inference on multi machine (error Invalid peer device id) #2795

Closed

hanzhi713 changed the title ~~[WIP] Fix unsafe synchronization for custom allreduce kernels~~ Some fixes for custom allreduce kernels Feb 15, 2024

hanzhi713 added 2 commits March 6, 2024 08:12

Merge branch 'main' into safer-sync

028bbf6

fix spelling

5a5de79

WoosukKwon approved these changes Mar 20, 2024

View reviewed changes

WoosukKwon added the action-required label Mar 20, 2024

hanzhi713 added 3 commits March 20, 2024 19:53

add note

974777f

Merge branch 'main' into safer-sync

72526d8

add newline

d6d3d57

WoosukKwon removed the action-required label Mar 21, 2024

WoosukKwon merged commit f721096 into vllm-project:main Mar 22, 2024
31 checks passed

youkaichao mentioned this pull request Apr 12, 2024

[Core] fix custom allreduce default value #4040

Merged

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

hanzhi713 mentioned this pull request May 15, 2024

[Feature]: CI: Test on NVLink-enabled machine #4770

Closed

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

[BugFix] Some fixes for custom allreduce kernels (vllm-project#2760)

beae53c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some fixes for custom allreduce kernels #2760

Some fixes for custom allreduce kernels #2760

hanzhi713 commented Feb 5, 2024 •

edited

Loading

hanzhi713 commented Feb 8, 2024 •

edited

Loading

NikolaBorisov commented Feb 13, 2024

hanzhi713 commented Feb 15, 2024

hanzhi713 commented Feb 15, 2024

WoosukKwon commented Mar 1, 2024

tdene commented Mar 1, 2024 •

edited

Loading

hanzhi713 commented Mar 3, 2024 •

edited

Loading

hanzhi713 commented Mar 6, 2024

hanzhi713 commented Mar 6, 2024

hanzhi713 commented Mar 18, 2024

WoosukKwon left a comment

WoosukKwon Mar 20, 2024

hanzhi713 Mar 20, 2024

WoosukKwon commented Mar 21, 2024 •

edited

Loading

WoosukKwon commented Mar 22, 2024

WoosukKwon commented Mar 22, 2024

hanzhi713 commented Mar 22, 2024

garycaokai commented Apr 7, 2024

garycaokai commented Apr 7, 2024

hanzhi713 commented Apr 7, 2024

hanzhi713 commented Apr 7, 2024

garycaokai commented Apr 7, 2024

hanzhi713 commented Apr 7, 2024 •

edited

Loading

		# note: num dev can be larger than world_size if we're only using
		# first few GPUs

Some fixes for custom allreduce kernels #2760

Some fixes for custom allreduce kernels #2760

Conversation

hanzhi713 commented Feb 5, 2024 • edited Loading

hanzhi713 commented Feb 8, 2024 • edited Loading

NikolaBorisov commented Feb 13, 2024

hanzhi713 commented Feb 15, 2024

hanzhi713 commented Feb 15, 2024

WoosukKwon commented Mar 1, 2024

tdene commented Mar 1, 2024 • edited Loading

hanzhi713 commented Mar 3, 2024 • edited Loading

hanzhi713 commented Mar 6, 2024

hanzhi713 commented Mar 6, 2024

hanzhi713 commented Mar 18, 2024

WoosukKwon left a comment

Choose a reason for hiding this comment

WoosukKwon Mar 20, 2024

Choose a reason for hiding this comment

hanzhi713 Mar 20, 2024

Choose a reason for hiding this comment

WoosukKwon commented Mar 21, 2024 • edited Loading

WoosukKwon commented Mar 22, 2024

WoosukKwon commented Mar 22, 2024

hanzhi713 commented Mar 22, 2024

garycaokai commented Apr 7, 2024

garycaokai commented Apr 7, 2024

hanzhi713 commented Apr 7, 2024

hanzhi713 commented Apr 7, 2024

garycaokai commented Apr 7, 2024

hanzhi713 commented Apr 7, 2024 • edited Loading

hanzhi713 commented Feb 5, 2024 •

edited

Loading

hanzhi713 commented Feb 8, 2024 •

edited

Loading

tdene commented Mar 1, 2024 •

edited

Loading

hanzhi713 commented Mar 3, 2024 •

edited

Loading

WoosukKwon commented Mar 21, 2024 •

edited

Loading

hanzhi713 commented Apr 7, 2024 •

edited

Loading