Report of increased memory overhead during cudagraph capture with nccl >= 2.19 #1234

youkaichao · 2024-03-24T05:01:02Z

Hi, I would like to report a memory issue with nccl. A reproducible example is attached below:

In a gcp g2-standard-24 instance (with 2 L4 GPUs):

docker pull us-central1-docker.pkg.dev/vllm-405802/vllm-ci-test-repo/vllm-test:a3c2340ae36ce8ee782691d30111377eaf7ae6ce
docker run --gpus all --shm-size=2g -it us-central1-docker.pkg.dev/vllm-405802/vllm-ci-test-repo/vllm-test:a3c2340ae36ce8ee782691d30111377eaf7ae6ce -- /bin/bash

# inside docker
cd /vllm-workspace/tests/distributed
export NCCL_DEBUG=TRACE
TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf pytest -v -s --forked test_basic_distributed_correctness.py

Note that the code manually links against a pre-downloaded nccl==2.18.3. There is also a nccl==2.19.3 available inside the image, the path is /usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 .

By adding the following line in /vllm-workspace/vllm/model_executor/parallel_utils/pynccl.py, before nccl = ctypes.CDLL(so_file):

so_file = "/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2"

We can force the program to use nccl 2.19.3, and we will get an OOM error.

The background:

In distributed inference, https://github.com/vllm-project/vllm uses nccl together with cudagraph. We capture about 30 graphs with different batch sizes. The memory overhead when we use pytorch 2.1.2 (with nccl==2.18.3) is nearly zero (about 10MB per graph, and sometimes is zero); however, when we upgrade to pytorch 2.2.0 (with nccl==2.19.3), the memory overhead is more than 100MB per graph.

We spent more than a week (to be honest, more time than one would feel comfortable with) to investigate the issue. We used to think it should be related with pytorch, but finally we find the problem comes from the nccl library.

For more code on measuring the memory overhead, please check vllm-project/vllm#3442 (comment) .

It would be very helpful if the nccl team can point our the root cause of the memory overhead, and potential knobs to control it (e.g. via some environment variables). The above problem happens for both nccl==2.19.3 and nccl==2.20.5 .

Thank you for your time.

The text was updated successfully, but these errors were encountered:

sjeaugey · 2024-03-25T09:22:07Z

Thanks for the report. @jbachan did we change something in 2.19 regarding the way we store graph-related data for NCCL?

@youkaichao how many graphs do you expect to create? A power of ten is ok, it doesn't have to be precise; in my mind apps would create a couple of graphs (less than 10 for sure) so even 100MB per graph doesn't seem like a big problem. If the expectation is that we should allow for tens or hundred of graphs, it could influence how we fix the issue and design things in the future.

youkaichao · 2024-03-25T11:52:02Z

how many graphs do you expect to create?

About 30, each for a different batch size (8, 16, 24, ..., 256).

Technically we capture the graph with large batchsize first, followed by smaller batchsize. We find cudagraph with smaller batchsize can share the memory buffer of larger batchsize. So even with more than 30 cudagraph, the memory overhead prior to nccl 2.19 is low.

youkaichao · 2024-04-10T07:30:50Z

@sjeaugey hi, any update on this? 👀

youkaichao · 2024-04-30T18:49:48Z

@sjeaugey hi, any update on this? 👀

youkaichao · 2024-05-29T22:41:09Z

@sjeaugey we finally find this problem is related with the virtual memory usage, and solve it in vllm-project/vllm#5091 .

It would be great if you can share why virtual memory is allocated during graph capture, and what it is used for, what's the performance impact of turning it off 🙏

sjeaugey · 2024-05-30T07:38:42Z

Thanks for the feedback. I'm not sure I understand how you solved it though. Was it a proper fix or a workaround?

From what I understand, you're saying that enabling CUMEM increases the memory usage when using CUDA graphs. Is that accurate? Did you confirm the problem went away with NCCL_CUMEM_ENABLE=0?

youkaichao · 2024-05-30T07:47:11Z

Yes, the fix is to set NCCL_CUMEM_ENABLE=0 .

youkaichao · 2024-05-30T08:08:52Z

I would say this is only a workaround. I don't know why NCCL costs more memory with cudagraph when NCCL_CUMEM_ENABLE is used by default in nccl 2.19 .

youkaichao · 2024-05-30T18:36:24Z

here is a minimal reproducible example:

import torch
import torch.distributed as dist
from contextlib import contextmanager

@contextmanager
def graph_capture(pool=None, stream=None, capture_error_mode: str = "global", dump_path=None):
    g = torch.cuda.CUDAGraph()
    if dump_path is not None:
        g.enable_debug_mode()
    with torch.cuda.graph(cuda_graph=g, pool=pool, stream=stream, capture_error_mode=capture_error_mode):
        yield g
    if dump_path is not None:
        g.debug_dump(dump_path)

dist.init_process_group(backend="gloo", init_method="env://")
rank = dist.get_rank()
torch.cuda.set_device(rank)

from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator

pynccl = PyNcclCommunicator(group=dist.group.WORLD, device=rank)
pynccl.disabled = False

MAX_BATCHSIZE = 4

# Placeholder input used for capture
static_a = torch.zeros((MAX_BATCHSIZE, 1024), device="cuda")

def compute(batchsize):
    pynccl.all_reduce(static_a[:batchsize], stream=torch.cuda.current_stream())

# Warmup before capture
s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    for i in range(1, MAX_BATCHSIZE + 1):
        compute(i)
torch.cuda.current_stream().wait_stream(s)

def report_memory(prefix):
    free, total = torch.cuda.mem_get_info()
    used = total - free
    print(f"{prefix}: Used: {used / 1024 / 1024} MB, Free: {free / 1024 / 1024} MB, Total: {total / 1024 / 1024} MB")

# Captures the graph
# To allow capture, automatically sets a side stream as the current stream in the context
report_memory("Before capture")
graphs = [0] # 0 is a placeholder for 0 batchsize
memory_pool = None
for i in range(1, MAX_BATCHSIZE + 1):
    with graph_capture(pool=memory_pool) as g:
        compute(i)
    graphs.append(g)
    memory_pool = g.pool()
    report_memory(f"After capture batchsize {i}")
# Run the graph
static_a[:2] += 1
graphs[2].replay()
torch.cuda.current_stream().synchronize()
print(static_a[:2])

I call allreduce on a static buffer, and capture only this allreduce operation in cudagraph. Ideally it should not cost any memory.

When I run it with default setting:

INFO 05-30 18:33:53 utils.py:619] Found nccl from library libnccl.so.2
INFO 05-30 18:33:53 pynccl.py:65] vLLM is using nccl==2.20.5
INFO 05-30 18:33:53 utils.py:619] Found nccl from library libnccl.so.2
INFO 05-30 18:33:53 pynccl.py:65] vLLM is using nccl==2.20.5
Before capture: Used: 1373.375 MB, Free: 79677.25 MB, Total: 81050.625 MBBefore capture: Used: 1373.375 MB, Free: 79677.25 MB, Total: 81050.625 MB

After capture batchsize 1: Used: 1379.375 MB, Free: 79671.25 MB, Total: 81050.625 MBAfter capture batchsize 1: Used: 1379.375 MB, Free: 79671.25 MB, Total: 81050.625 MB

After capture batchsize 2: Used: 1381.375 MB, Free: 79669.25 MB, Total: 81050.625 MB
After capture batchsize 2: Used: 1381.375 MB, Free: 79669.25 MB, Total: 81050.625 MB
After capture batchsize 3: Used: 1383.375 MB, Free: 79667.25 MB, Total: 81050.625 MB
After capture batchsize 3: Used: 1383.375 MB, Free: 79667.25 MB, Total: 81050.625 MB
After capture batchsize 4: Used: 1385.375 MB, Free: 79665.25 MB, Total: 81050.625 MB
After capture batchsize 4: Used: 1385.375 MB, Free: 79665.25 MB, Total: 81050.625 MB
tensor([[2., 2., 2.,  ..., 2., 2., 2.],
        [2., 2., 2.,  ..., 2., 2., 2.]], device='cuda:0')
tensor([[2., 2., 2.,  ..., 2., 2., 2.],
        [2., 2., 2.,  ..., 2., 2., 2.]], device='cuda:1')

every cudagraph takes 2 MB memory.

If I run it with export NCCL_CUMEM_ENABLE=0 :

INFO 05-30 18:30:52 utils.py:619] Found nccl from library libnccl.so.2
INFO 05-30 18:30:52 pynccl.py:65] vLLM is using nccl==2.20.5
INFO 05-30 18:30:52 utils.py:619] Found nccl from library libnccl.so.2
INFO 05-30 18:30:52 pynccl.py:65] vLLM is using nccl==2.20.5
Before capture: Used: 1181.375 MB, Free: 79869.25 MB, Total: 81050.625 MBBefore capture: Used: 1181.375 MB, Free: 79869.25 MB, Total: 81050.625 MB

After capture batchsize 1: Used: 1185.375 MB, Free: 79865.25 MB, Total: 81050.625 MB
After capture batchsize 1: Used: 1185.375 MB, Free: 79865.25 MB, Total: 81050.625 MB
After capture batchsize 2: Used: 1185.375 MB, Free: 79865.25 MB, Total: 81050.625 MB
After capture batchsize 2: Used: 1185.375 MB, Free: 79865.25 MB, Total: 81050.625 MB
After capture batchsize 3: Used: 1185.375 MB, Free: 79865.25 MB, Total: 81050.625 MB
After capture batchsize 3: Used: 1185.375 MB, Free: 79865.25 MB, Total: 81050.625 MB
After capture batchsize 4: Used: 1185.375 MB, Free: 79865.25 MB, Total: 81050.625 MB
After capture batchsize 4: Used: 1185.375 MB, Free: 79865.25 MB, Total: 81050.625 MB
tensor([[2., 2., 2.,  ..., 2., 2., 2.],
        [2., 2., 2.,  ..., 2., 2., 2.]], device='cuda:1')
tensor([[2., 2., 2.,  ..., 2., 2., 2.],
        [2., 2., 2.,  ..., 2., 2., 2.]], device='cuda:0')

the memory cost does not increase when I capture more graphs .

youkaichao · 2024-05-31T23:14:07Z

@sjeaugey I just tried nccl 2.21.5 now, the problem still exists. I doubt if this is because cuMemCreate is captured in cuda graph? I don't see any documentation explaining the behavior of cuMemCreate under graph capture and graph execution.

sjeaugey · 2024-06-03T09:21:41Z

For each CUDA graph capture, we allocate some memory on the GPU to store the information related to that persistent operation:
https://github.com/NVIDIA/nccl/blob/master/src/enqueue.cc#L1094

With CUMEM enabled, each allocation has to be aligned to the mem granularity, i.e. 2MB, so it is not surprising you see 2MB allocated per graph.

I guess we'd need to add a sub-allocator to the CUMEM code to avoid allocating 2MB for each CUDA graph. CC @AddyLaddy

youkaichao · 2024-06-03T16:51:41Z

@sjeaugey so it is not 2MB per cuda graph, it is 2MB per allreduce operation per cuda graph.

In total, it will cost 2MB * # allreduce * # graphs, which accumulates to GB in our case.

In my minimal reproducible example, when I add multiple allreduce operations, e.g.

def compute(batchsize):
    pynccl.all_reduce(static_a[:batchsize], stream=torch.cuda.current_stream())
    pynccl.all_reduce(static_a[:batchsize], stream=torch.cuda.current_stream())

I do see the memory overhead grows proportionally (for 2 allreduce operations, each graph takes 4MB more memory):

Before capture: Used: 491.6875 MB, Free: 32002.4375 MB, Total: 32494.125 MB
Before capture: Used: 491.6875 MB, Free: 32002.4375 MB, Total: 32494.125 MB
After capture batchsize 1: Used: 499.6875 MB, Free: 31994.4375 MB, Total: 32494.125 MB
After capture batchsize 1: Used: 499.6875 MB, Free: 31994.4375 MB, Total: 32494.125 MB
After capture batchsize 2: Used: 503.6875 MB, Free: 31990.4375 MB, Total: 32494.125 MB
After capture batchsize 2: Used: 503.6875 MB, Free: 31990.4375 MB, Total: 32494.125 MB
After capture batchsize 3: Used: 507.6875 MB, Free: 31986.4375 MB, Total: 32494.125 MB
After capture batchsize 3: Used: 507.6875 MB, Free: 31986.4375 MB, Total: 32494.125 MB
After capture batchsize 4: Used: 511.6875 MB, Free: 31982.4375 MB, Total: 32494.125 MB
After capture batchsize 4: Used: 511.6875 MB, Free: 31982.4375 MB, Total: 32494.125 MB
tensor([[4., 4., 4.,  ..., 4., 4., 4.],
        [4., 4., 4.,  ..., 4., 4., 4.]], device='cuda:0')
tensor([[4., 4., 4.,  ..., 4., 4., 4.],
        [4., 4., 4.,  ..., 4., 4., 4.]], device='cuda:1')

Hope it can be fixed soon 🙏

sjeaugey · 2024-06-04T07:08:55Z

You're right, it's 2MB per operation within the graph. My previous comment still applies though; we'd need to implement a sub-allocator for CUMEM operations to reduce the memory usage with CUMEM.

youkaichao · 2024-06-04T07:15:31Z

how does cumem related api behave under graph capture? the documentation does not say anything about it.

davidthomas426 · 2024-06-13T19:25:16Z

Is this right?

When using cumem ops, graph capture ignores it -- instead, by using the lower level cumem ops, you are directly allocating each time you run the cuda graph capture. I'm not sure how cuMemRelease is handled with graph capture though...
When not using cumem ops, cudaMallocAsync and cudaFreeAsync are basically used instead. These use a stream-ordered suballocator with special behavior during graph capture, so that buffers can get reused for non-overlapping operations and are eligible to become part of the shared memory pool across cuda graphs.

@sjeaugey Could you clarify the behavior of cumem ops with cuda graphs, especially what happens when pointers are captured that are then released with cuMemRelease? Does this become a use-after-free? If not, how?

youkaichao mentioned this issue Apr 3, 2024

[Core] manage nccl via a pypi package & upgrade to pt 2.2.1 vllm-project/vllm#3805

Merged

hmellor mentioned this issue Apr 4, 2024

Upgrade to torch==2.2.1 vllm-project/vllm#2804

Closed

This was referenced May 8, 2024

What's the purpose of this repo? vllm-project/vllm-nccl#1

Closed

[Installation]: Stuck for two hours during the installation of vllm vllm-project/vllm#4779

Closed

Update version vllm vllm-project/vllm-nccl#2

Closed

This was referenced May 14, 2024

Dockerfile: use fixed vllm-provided nccl version opendatahub-io/vllm#23

Merged

Dockerfile: use fixed vllm-provided nccl==2.18.1 dtrifiro/vllm-tgis-adapter#6

Merged

youkaichao closed this as completed May 29, 2024

hmellor mentioned this issue May 31, 2024

NCCL error vllm-project/vllm#1726

Closed

youkaichao mentioned this issue Jun 13, 2024

[Bug]: Excessive Memory Consumption of Cudagraph on A10G/L4 GPUs vllm-project/vllm#5517

Closed

youkaichao mentioned this issue Aug 11, 2024

[Bug]: some questions regarding the usage of NCCL allreduce/broadcast/allgather/send/recv in VLLM using pycomm and torch's distributed. vllm-project/vllm#7383

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report of increased memory overhead during cudagraph capture with nccl >= 2.19 #1234

Report of increased memory overhead during cudagraph capture with nccl >= 2.19 #1234

youkaichao commented Mar 24, 2024

sjeaugey commented Mar 25, 2024

youkaichao commented Mar 25, 2024

youkaichao commented Apr 10, 2024

youkaichao commented Apr 30, 2024

youkaichao commented May 29, 2024

sjeaugey commented May 30, 2024

youkaichao commented May 30, 2024

youkaichao commented May 30, 2024

youkaichao commented May 30, 2024

youkaichao commented May 31, 2024

sjeaugey commented Jun 3, 2024

youkaichao commented Jun 3, 2024

sjeaugey commented Jun 4, 2024

youkaichao commented Jun 4, 2024

davidthomas426 commented Jun 13, 2024

Report of increased memory overhead during cudagraph capture with nccl >= 2.19 #1234

Report of increased memory overhead during cudagraph capture with nccl >= 2.19 #1234

Comments

youkaichao commented Mar 24, 2024

sjeaugey commented Mar 25, 2024

youkaichao commented Mar 25, 2024

youkaichao commented Apr 10, 2024

youkaichao commented Apr 30, 2024

youkaichao commented May 29, 2024

sjeaugey commented May 30, 2024

youkaichao commented May 30, 2024

youkaichao commented May 30, 2024

youkaichao commented May 30, 2024

youkaichao commented May 31, 2024

sjeaugey commented Jun 3, 2024

youkaichao commented Jun 3, 2024

sjeaugey commented Jun 4, 2024

youkaichao commented Jun 4, 2024

davidthomas426 commented Jun 13, 2024