-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TCPStore is not available #3334
Comments
I have the same problem |
Can you try eager_force=True? I think in general, your environment doesn't seem to play well with cupy nccl backend. |
I think that's the version cupy backend has been introduced. It is basically to enable cuda graph (cuz it didn't work with default nccl backends). |
do you have any other error from logs? For some reasons, your env cannot initialize the cupy backend, but it is difficult to know with information you just posted. One way to debug is to use this yourself from vllm.model_executor.parallel_utils import cupy_utils
cupy_utils.init_process_group(
world_size=1
rank=0,
host="localhost",
port=<choose port>,
) And see why this fails. |
I can reproduce when I use a spawn process to run vllm. There is another problem. Here this catch any exception without logging: vllm/vllm/executor/ray_gpu_executor.py Line 338 in 253a980
|
same issue here with the latest vllm 0.3.3 version |
@Z-Diviner can you give me the copy-able docker run command? I can try repro in my local env |
Hello, I have the same issue with version 0.3.3 running on a ray cluster in kubernetes. Everything works fine in a single node with multiple GPU and --tensor-parallel-size enabled but running the same config with an additional worker node results in "TCPStore is not available". I'll be happy to provide any information to help in the resolution of this issue. Thank you |
@CodeScriptum it'd be great if you can provide me a way to reproduce the issue! I'd like to make it clear that this doesn't seem like vllm or Ray issue. It must be cupy is somehow not working with your environment. I can try helping debug in this case, but I'd need to reproduce since I think it only happens in some environment. |
Hello, I'm sorry, I just saw that below is the command for my Docker Run:
|
I think you can try to inference with multi-process and tensor-parallel-size greater than 1. This way can reproduce the error. And this error occurs when the version of vllm is greater than 0.3.0. So it should be due to the cupy. The specific reasons for this can be found below.
|
Btw, there's also an effort to remove cupy #3625 from the dependency. I may not have time to tackle this for next few days |
I have the same issue. When I set enforce_eager=true, I get a new error: INFO 03-26 13:20:22 llm_engine.py:87] Initializing an LLM engine with config: model='TheBloke/Llama-2-7b-Chat-AWQ', tokenizer='TheBloke/Llama-2-7b-Chat-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=True, quantization=awq, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0) Dell-Dev-U:2803813:2803813 [0] init.cc:1270 NCCL WARN Invalid config blocking attribute value -2147483648 |
For people who encountered the problem, please try this docker image We plan to remove |
Hi I confirm that the new backend is working good. I did some quick tests using the v0.4.0 built from source using a ray cluster with head and worker nodes and I didn't notice any issues so far with distributed inference. I will give more feedback if I encounter any instability, thank you for the great work. |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you! |
Hello, when I use vllm3.2 and deepseek coder-33b install to start the service through Docker, the following error is reported. What is the situation?
The text was updated successfully, but these errors were encountered: