-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensor parallelism on ray cluster #1566
Comments
Solved: my particular issue (not necessarily that of OP) was that the RayCluster that runs locally on the single node (because we're not doing distributed inference), didn't have enough memory.
Dont know Ray well enough to understand waht this does lol I am unaffiliated with OP, but believe we are having the same issue. We're using kubernetes to deploy a model an a single g4.12xlarge instance (4GPUs). We cannot use a newer model class for various reasons. To troubleshoot, I've chosen a small model that runs easily on a single GPU.
This is overkill, but as you see we're making 4 GPUs available to the container, despite only running on one of them. I've also confirmed from shelling into the container and running pytorch commands that it does have 4 GPUs accessible. When When the flag is set to 2 or more, we get a long stracktrace, the relevant portion is shown below. Do I have to manually start the Ray Cluster or do any other env settings or something so that it is up and healthy when the Docker container starts? Or does
|
Same, any solution please? |
Hit the same issue |
#1058 (comment) could be related |
Here is the finding for may case: when submit remote job, it claims gpus. For the following code, it takes 1 gpu
When vllm runs tensor parallelism, it will create gpu cluster. However the gpu is unavailable, so the job get timeout eventually. Is there a way to use the gpus assigned to the remote job? |
It’s the same as my finding in #1058 (comment). I used custom resources to work around it. Ideally vLLM should have a way to pass in already assigned logical resources. |
I am also running into the same issue on redhat OpenShift.
llm = VLLM(model="meta-llama/Llama-2-13b-chat-hf", Startup hangs here: INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 |
I found issue 31897 over on ray serves repo: looks similar: |
I wanted to update this thread as I've found a resolution to this issue, and it might be good to include this in the vLLM documentation. I'm running on a very large OpenShift cluster with a high numbe of CPU on the nodes, and after digging really deep into RAY I found the issue is not with vLLM but rather how RAY works and this simply needed 2 things done.
look for the line 83--> ray.init(address=ray_address, ignore_reinit_error=True)
Package Version adal 1.2.7 |
I highly suggest your guys to use kuberay, launch a ray cluster and submit vLLM worker. That's the most easiest way I found and kuberay will reduce your chance coming into cluster issues. |
How? Can you post a minimal example, please? |
it was problem of nvidia cuda driver 545 bug. |
This actually solved my problem — running vLLM with TP via Ray within a container provsioned via OpenShift. I can share more details if needed. Thanks @ernestol0817 ! |
I am still looking forward to get my sponsor GitHub profile verified, when I am in the state. Am out of the country for projects which needed to be fulfill with LLC law firms. |
@nelsonspbr can you please post your example of vllm with TP via Ray within a container provisioned via OpenShift? I'm really interested! |
Would appreciate a working example. I'm having difficulties running more than one tensor parallel Ray Serve application. I suspect it has something to do with vLLM initializing Ray / altering placement groups within each application. |
I am using vllm on a ray cluster, multiple nodes and 4 gpus on each node. I am trying to load llama model with more than one gpu by setting tensor_parallel_size=2. The model won't load. It works fine on a single instance when I don't use a ray cluster. I can only set tensor_parallel_size=1 on ray cluster. Is there a way to use tensor parallelism on a ray cluster?
The text was updated successfully, but these errors were encountered: