[Bug]: Producer process has been terminated before all shared CUDA tensors released (v 0.5.0 post1, v 0.4.3) #6025

yaronr · 2024-07-01T10:50:30Z

Your current environment

Docker Image: vllm/vllm-openai:v0.4.3 as well as 0.5.0 post-1

Params:

--model=microsoft/Phi-3-medium-4k-instruct 
--tensor-parallel-size=2
--disable-log-requests
--trust-remote-code
--max-model-len=2048
--gpu-memory-utilization=0.9

The container freezes (does nothing) after presenting the following exception in the log.

🐛 Describe the bug

Original exception was:
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

The text was updated successfully, but these errors were encountered:

youkaichao · 2024-07-06T01:33:46Z

can you follow https://docs.vllm.ai/en/latest/getting_started/debugging.html to figure out what is happening here?

sayakpaul · 2024-09-11T12:14:10Z

@youkaichao I am facing the same log as well. What is the general recommendation to remedy it if it's a critical issue? For my case, the programs runs fine with that message.

youkaichao · 2024-09-11T15:57:21Z

For my case, the programs runs fine with that message.

then it's just a warning you can ignore.

OsaCode · 2024-11-05T15:38:33Z

Same problem
Running:
`model_name = "allenai/Molmo-7B-D-0924"

llm = LLM(
model=model_name,
trust_remote_code=True,
dtype="bfloat16",
tensor_parallel_size=2
)`

Getting :
INFO 11-05 15:35:36 config.py:1704] Downcasting torch.float32 to torch.bfloat16. INFO 11-05 15:35:41 config.py:944] Defaulting to use mp for distributed inference INFO 11-05 15:35:41 llm_engine.py:242] Initializing an LLM engine (v0.6.3.post2.dev127+g2adb4409) with config: model='allenai/Molmo-7B-D-0924', speculative_config=None, tokenizer='allenai/Molmo-7B-D-0924', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=allenai/Molmo-7B-D-0924, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, chat_template_text_format=string, mm_processor_kwargs=None) WARNING 11-05 15:35:42 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 12 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 11-05 15:35:42 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager (VllmWorkerProcess pid=9216) INFO 11-05 15:35:42 multiproc_worker_utils.py:215] Worker ready; awaiting tasks INFO 11-05 15:35:43 utils.py:976] Found nccl from library libnccl.so.2 INFO 11-05 15:35:43 pynccl.py:63] vLLM is using nccl==2.21.5 (VllmWorkerProcess pid=9216) INFO 11-05 15:35:43 utils.py:976] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=9216) INFO 11-05 15:35:43 pynccl.py:63] vLLM is using nccl==2.21.5 INFO 11-05 15:35:43 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /home/jupyter/.cache/vllm/gpu_p2p_access_cache_for_0,1.json (VllmWorkerProcess pid=9216) INFO 11-05 15:35:43 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /home/jupyter/.cache/vllm/gpu_p2p_access_cache_for_0,1.json Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:336 'invalid argument' Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:336 'invalid argument' [rank0]:[W1105 15:35:43.442228048 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors] [rank1]:[W1105 15:35:43.442235220 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

On an instance with 2 Nvidia L4 GPUs.
It kills my kernel.

OsaCode · 2024-11-05T15:47:47Z

I upgraded my vLLM after reading #9774 and it fixed the issue, although I still crash for another reason

yaronr added the bug Something isn't working label Jul 1, 2024

kota-iizuka mentioned this issue Jul 19, 2024

Is there a way to terminate vllm.LLM and release the GPU memory #1908

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Producer process has been terminated before all shared CUDA tensors released (v 0.5.0 post1, v 0.4.3) #6025

[Bug]: Producer process has been terminated before all shared CUDA tensors released (v 0.5.0 post1, v 0.4.3) #6025

yaronr commented Jul 1, 2024

youkaichao commented Jul 6, 2024

sayakpaul commented Sep 11, 2024

youkaichao commented Sep 11, 2024

OsaCode commented Nov 5, 2024 •

edited

Loading

OsaCode commented Nov 5, 2024

[Bug]: Producer process has been terminated before all shared CUDA tensors released (v 0.5.0 post1, v 0.4.3) #6025

[Bug]: Producer process has been terminated before all shared CUDA tensors released (v 0.5.0 post1, v 0.4.3) #6025

Comments

yaronr commented Jul 1, 2024

Your current environment

🐛 Describe the bug

youkaichao commented Jul 6, 2024

sayakpaul commented Sep 11, 2024

youkaichao commented Sep 11, 2024

OsaCode commented Nov 5, 2024 • edited Loading

OsaCode commented Nov 5, 2024

OsaCode commented Nov 5, 2024 •

edited

Loading