Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vLLM Distributed Inference stuck when using multi -GPU #2466

Closed
RathoreShubh opened this issue Jan 17, 2024 · 11 comments
Closed

vLLM Distributed Inference stuck when using multi -GPU #2466

RathoreShubh opened this issue Jan 17, 2024 · 11 comments

Comments

@RathoreShubh
Copy link

I am trying to run inferece server on multi GPU using this on (4 * NVIDIA GeForce RTX 3090) server.

python -u -m vllm.entrypoints.api_server --host 0.0.0.0 --model mistralai/Mistral-7B-Instruct-v0.2 --tensor-parallel-size 4

while this works fine when using --tensor-parallel-size =1 , but on using tensor-parallel-size >1 it stuck on strat up.

Thanks
Screenshot 2024-01-17 at 5 44 11 PM

@RhizoNymph
Copy link

this is happening to me too, on 2 * 3090

@s-natsubori
Copy link

try these parameters
--gpu-memory-utilization 0.7~0.9
--max-model-len 8192

@Double-bear
Copy link

try these parameters --gpu-memory-utilization 0.7~0.9 --max-model-len 8192

hello, I have tried the method you provided, but it has no effect.

@RhizoNymph
Copy link

No effect here either

@BilalKHA95
Copy link

Did you found a solution i Ve the same issue ?

@shubham-bnxt
Copy link

@BilalKHA95 try this

export NCCL_P2P_DISABLE=1

this woked for me

@BilalKHA95
Copy link

@BilalKHA95 try this

export NCCL_P2P_DISABLE=1

this woked for me

Thank's !!! it's working now, this env variable + update cuda tooltik to 12.3

@Palmik
Copy link

Palmik commented Mar 16, 2024

export NCCL_P2P_DISABLE=1

This also solved this issue for me.

@emersonium
Copy link

@BilalKHA95 try this
export NCCL_P2P_DISABLE=1
this woked for me

Thank's !!! it's working now, this env variable + update cuda tooltik to 12.3

Hi!
does this result in higher tokens/second for you ? (for a small model like: -model mistralai/Mistral-7B-Instruct-v0.2 --tensor-parallel-size 4) ? thanks!

@SuperBruceJia
Copy link

This didn't work for me:

export NCCL_P2P_DISABLE=1

Is there any solutions?

Thank you guys very much in advance!

Best regards,

Shuyue
June 9th, 2024

@DarkLight1337
Copy link
Member

We have added documentation for this situation in #5430. Please take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants