Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Producer process has been terminated before all shared CUDA tensors released (v 0.5.0 post1, v 0.4.3) #6025

Open
yaronr opened this issue Jul 1, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@yaronr
Copy link

yaronr commented Jul 1, 2024

Your current environment

Docker Image: vllm/vllm-openai:v0.4.3 as well as 0.5.0 post-1

Params:

--model=microsoft/Phi-3-medium-4k-instruct 
--tensor-parallel-size=2
--disable-log-requests
--trust-remote-code
--max-model-len=2048
--gpu-memory-utilization=0.9

The container freezes (does nothing) after presenting the following exception in the log.

🐛 Describe the bug

Original exception was:
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
@yaronr yaronr added the bug Something isn't working label Jul 1, 2024
@youkaichao
Copy link
Member

can you follow https://docs.vllm.ai/en/latest/getting_started/debugging.html to figure out what is happening here?

@sayakpaul
Copy link
Contributor

@youkaichao I am facing the same log as well. What is the general recommendation to remedy it if it's a critical issue? For my case, the programs runs fine with that message.

@youkaichao
Copy link
Member

For my case, the programs runs fine with that message.

then it's just a warning you can ignore.

@OsaCode
Copy link

OsaCode commented Nov 5, 2024

Same problem
Running:
`model_name = "allenai/Molmo-7B-D-0924"

llm = LLM(
model=model_name,
trust_remote_code=True,
dtype="bfloat16",
tensor_parallel_size=2
)`

Getting :
INFO 11-05 15:35:36 config.py:1704] Downcasting torch.float32 to torch.bfloat16. INFO 11-05 15:35:41 config.py:944] Defaulting to use mp for distributed inference INFO 11-05 15:35:41 llm_engine.py:242] Initializing an LLM engine (v0.6.3.post2.dev127+g2adb4409) with config: model='allenai/Molmo-7B-D-0924', speculative_config=None, tokenizer='allenai/Molmo-7B-D-0924', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=allenai/Molmo-7B-D-0924, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, chat_template_text_format=string, mm_processor_kwargs=None) WARNING 11-05 15:35:42 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 12 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 11-05 15:35:42 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager (VllmWorkerProcess pid=9216) INFO 11-05 15:35:42 multiproc_worker_utils.py:215] Worker ready; awaiting tasks INFO 11-05 15:35:43 utils.py:976] Found nccl from library libnccl.so.2 INFO 11-05 15:35:43 pynccl.py:63] vLLM is using nccl==2.21.5 (VllmWorkerProcess pid=9216) INFO 11-05 15:35:43 utils.py:976] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=9216) INFO 11-05 15:35:43 pynccl.py:63] vLLM is using nccl==2.21.5 INFO 11-05 15:35:43 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /home/jupyter/.cache/vllm/gpu_p2p_access_cache_for_0,1.json (VllmWorkerProcess pid=9216) INFO 11-05 15:35:43 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /home/jupyter/.cache/vllm/gpu_p2p_access_cache_for_0,1.json Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:336 'invalid argument' Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:336 'invalid argument' [rank0]:[W1105 15:35:43.442228048 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors] [rank1]:[W1105 15:35:43.442235220 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

On an instance with 2 Nvidia L4 GPUs.
It kills my kernel.

@OsaCode
Copy link

OsaCode commented Nov 5, 2024

I upgraded my vLLM after reading #9774 and it fixed the issue, although I still crash for another reason

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants