Tensor parallelism on ray cluster #1566

baojunliu · 2023-11-05T07:43:16Z

I am using vllm on a ray cluster, multiple nodes and 4 gpus on each node. I am trying to load llama model with more than one gpu by setting tensor_parallel_size=2. The model won't load. It works fine on a single instance when I don't use a ray cluster. I can only set tensor_parallel_size=1 on ray cluster. Is there a way to use tensor parallelism on a ray cluster?

hughesadam87 · 2023-11-06T15:57:05Z

Solved: my particular issue (not necessarily that of OP) was that the RayCluster that runs locally on the single node (because we're not doing distributed inference), didn't have enough memory.

apiVersion: v1
kind: Pod
metadata:
  name: mypod
spec:
  containers:
  - name: mycontainer
    image: myimage
    volumeMounts:
    - mountPath: /dev/shm
      name: dshm
  volumes:
  - name: dshm
    emptyDir:
      medium: Memory
      sizeLimit: "10.24Gi"

Dont know Ray well enough to understand waht this does lol

I am unaffiliated with OP, but believe we are having the same issue. We're using kubernetes to deploy a model an a single g4.12xlarge instance (4GPUs). We cannot use a newer model class for various reasons. To troubleshoot, I've chosen a small model that runs easily on a single GPU.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm
  labels:
    app: vllm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      nodeSelector:
        workload-type: gpu
      containers:
        - name: vllm-container
          image: lmisam/vllm-openai:latest
          imagePullPolicy: Always
          ports:
            - containerPort: 8000
              name: api
          resources:
            limits:
              nvidia.com/gpu: "4"
          env:
            - name: NVIDIA_VISIBLE_DEVICES
              value: "all"
            - name: CUDA_VISIBLE_DEVICES
              value: "0,1,2,3" # ensures the container app can see all 4 GPUs

          args: [ "--model", "TheBloke/OpenHermes-2-Mistral-7B-AWQ",
                  "--quantization", "awq",
                  "--dtype", "float16",
                  "--max-model-len", "4096",
                  "--tensor-parallel-size", "1"]`   #<---- Important

This is overkill, but as you see we're making 4 GPUs available to the container, despite only running on one of them. I've also confirmed from shelling into the container and running pytorch commands that it does have 4 GPUs accessible.

When --tensor-parallel-size=1 or the flag is not included, the model works just fine.

When the flag is set to 2 or more, we get a long stracktrace, the relevant portion is shown below.

Do I have to manually start the Ray Cluster or do any other env settings or something so that it is up and healthy when the Docker container starts? Or does vllm abstract this entirely and there's nothing else to do on our end?

    self._init_workers_ray(placement_group)
  File "/workspace/vllm/engine/llm_engine.py", line 181, in _init_workers_ray
    self._run_workers(
  File "/workspace/vllm/engine/llm_engine.py", line 704, in _run_workers
    all_outputs = ray.get(all_outputs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2565, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
    class_name: RayWorker
    actor_id: 1f4e0ddffc9942a1e34140a601000000
    pid: 1914
    namespace: 0c0e14ce-0a35-4190-ad47-f0a1959e7fe4
    ip: 172.31.15.88
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2023-11-06 15:44:56,238    WARNING worker.py:2074 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffedad1d4643afe66a344639bf01000000 Worker ID: 9cd611fa721b93a5d424da4383573ab345580a0a19f9301512aa384f Node ID: 27fa470afbb4822adc44dc5bbad10b5584e94e8f7b1d843bbcb63485 Worker IP address: 172.31.15.88 Worker port: 38939 Worker PID: 1916 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2023-11-06 15:44:56,272    WARNING worker.py:2074 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff45041ffe54799af4388ab20401000000 Worker ID: 708f3407d01c3576315da0569ada9fcd984f754e43417718e04ca93b Node ID: 27fa470afbb4822adc44dc5bbad10b5584e94e8f7b1d843bbcb63485 Worker IP address: 172.31.15.88 Worker port: 46579 Worker PID: 1915 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2023-11-06 15:44:56,316    WARNING worker.py:2074 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff1252d098d6462eb8cd537bd601000000 Worker ID: acea4bee761db0a4bff1adb3cc43d1e3ba1793d3017c3fc22e18e6d7 Node ID: 27fa470afbb4822adc44dc5bbad10b5584e94e8f7b1d843bbcb63485 Worker IP address: 172.31.15.88 Worker port: 45813 Worker PID: 1913 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(RayWorker pid=1916) *** SIGBUS received at time=1699285494 on cpu 2 *** [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(RayWorker pid=1916) PC: @     0x7fcfb18c6c3a  (unknown)  (unknown) [repeated 3x across cluster]
(RayWorker pid=1916)     @     0x7fcfb1759520       3392  (unknown) [repeated 3x across cluster]
(RayWorker pid=1916)     @ 0x32702d6c63636e2f  (unknown)  (unknown) [repeated 3x across cluster]

JenniePing · 2023-12-07T08:46:40Z

I am using vllm on a ray cluster, multiple nodes and 4 gpus on each node. I am trying to load llama model with more than one gpu by setting tensor_parallel_size=2. The model won't load. It works fine with on a single instance when I don't use a ray cluster. I cannot only set tensor_parallel_size=1 on ray cluster. Is there a way to use tensor parallelism on a ray cluster?

Same, any solution please?

qizzzh · 2023-12-08T04:50:28Z

Hit the same issue

qizzzh · 2023-12-08T06:37:44Z

#1058 (comment) could be related

baojunliu · 2024-01-05T23:23:02Z

Here is the finding for may case:

when submit remote job, it claims gpus. For the following code, it takes 1 gpu

 @ray.remote(num_gpus=1)
class my class
....

When vllm runs tensor parallelism, it will create gpu cluster. However the gpu is unavailable, so the job get timeout eventually.

Is there a way to use the gpus assigned to the remote job?

qizzzh · 2024-01-05T23:27:36Z

It’s the same as my finding in #1058 (comment). I used custom resources to work around it. Ideally vLLM should have a way to pass in already assigned logical resources.

ernestol0817 · 2024-01-26T02:10:15Z

I am also running into the same issue on redhat OpenShift.

import torch
torch.cuda.is_available()
True
torch.cuda.device_count()
2

llm = VLLM(model="meta-llama/Llama-2-13b-chat-hf",
trust_remote_code=True,
max_new_tokens=50,
temperature=0.1,
tensor_parallel_size=2,
)

Startup hangs here:

INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265

ernestol0817 · 2024-01-26T02:13:14Z

I found issue 31897 over on ray serves repo: looks similar:
ray-project/ray#31897

ernestol0817 · 2024-01-26T20:19:17Z

I wanted to update this thread as I've found a resolution to this issue, and it might be good to include this in the vLLM documentation. I'm running on a very large OpenShift cluster with a high numbe of CPU on the nodes, and after digging really deep into RAY I found the issue is not with vLLM but rather how RAY works and this simply needed 2 things done.

I modified .../site-packages/vllm/engine/ray_utils.py

look for the line 83

--> ray.init(address=ray_address, ignore_reinit_error=True)
Modify this to:
--> ray.init(address=ray_address, ignore_reinit_error=True, num_gpus=2, num_cpus=2)

Pay very special attention to your ENV, including python version, installed libraries. Here is what I'm running now and its working:

Package Version

adal 1.2.7
aiohttp 3.8.6
aiohttp-cors 0.7.0
aioprometheus 23.12.0
aiorwlock 1.3.0
aiosignal 1.3.1
annotated-types 0.6.0
anyio 3.7.1
applicationinsights 0.11.10
archspec 0.2.1
argcomplete 1.12.3
async-timeout 4.0.3
attrs 21.4.0
azure-cli-core 2.40.0
azure-cli-telemetry 1.0.8
azure-common 1.1.28
azure-core 1.29.6
azure-identity 1.10.0
azure-mgmt-compute 23.1.0
azure-mgmt-core 1.4.0
azure-mgmt-network 19.0.0
azure-mgmt-resource 20.0.0
backoff 1.10.0
bcrypt 4.1.2
blessed 1.20.0
boltons 23.0.0
boto3 1.26.76
botocore 1.29.165
Brotli 1.0.9
cachetools 5.3.2
certifi 2023.11.17
cffi 1.16.0
charset-normalizer 3.3.2
click 8.1.7
cloudpickle 2.2.0
colorful 0.5.5
commonmark 0.9.1
conda 23.11.0
conda-content-trust 0.2.0
conda-libmamba-solver 23.12.0
conda-package-handling 2.2.0
conda_package_streaming 0.9.0
cryptography 38.0.1
Cython 0.29.32
dataclasses-json 0.6.3
distlib 0.3.7
distro 1.8.0
dm-tree 0.1.8
exceptiongroup 1.2.0
Farama-Notifications 0.0.4
fastapi 0.104.0
filelock 3.13.1
flatbuffers 23.5.26
frozenlist 1.4.0
fsspec 2023.5.0
google-api-core 2.14.0
google-api-python-client 1.7.8
google-auth 2.23.4
google-auth-httplib2 0.2.0
google-oauth 1.0.1
googleapis-common-protos 1.61.0
gptcache 0.1.43
gpustat 1.1.1
greenlet 3.0.3
grpcio 1.59.3
gymnasium 0.28.1
h11 0.14.0
httpcore 1.0.2
httplib2 0.22.0
httptools 0.6.1
httpx 0.26.0
huggingface-hub 0.20.3
humanfriendly 10.0
idna 3.6
imageio 2.31.1
isodate 0.6.1
jax-jumpy 1.0.0
Jinja2 3.1.3
jmespath 1.0.1
jsonpatch 1.33
jsonpointer 2.1
jsonschema 4.17.3
knack 0.10.1
langchain 0.1.4
langchain-community 0.0.16
langchain-core 0.1.16
langsmith 0.0.83
lazy_loader 0.3
libmambapy 1.5.3
lz4 4.3.2
MarkupSafe 2.1.4
marshmallow 3.20.2
menuinst 2.0.1
mpmath 1.3.0
msal 1.18.0b1
msal-extensions 1.0.0
msgpack 1.0.7
msrest 0.7.1
msrestazure 0.6.4
multidict 6.0.4
mypy-extensions 1.0.0
networkx 3.1
ninja 1.11.1.1
numpy 1.26.3
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-ml-py 12.535.133
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.3.101
nvidia-nvtx-cu12 12.1.105
oauthlib 3.2.2
openai 1.10.0
opencensus 0.11.3
opencensus-context 0.1.3
opentelemetry-api 1.1.0
opentelemetry-exporter-otlp 1.1.0
opentelemetry-exporter-otlp-proto-grpc 1.1.0
opentelemetry-proto 1.1.0
opentelemetry-sdk 1.1.0
opentelemetry-semantic-conventions 0.20b0
orjson 3.9.12
packaging 23.2
pandas 1.5.3
paramiko 2.12.0
Pillow 9.2.0
pip 23.3.1
pkginfo 1.9.6
platformdirs 3.11.0
pluggy 1.0.0
portalocker 2.8.2
prometheus-client 0.19.0
protobuf 3.19.6
psutil 5.9.6
py-spy 0.3.14
pyarrow 12.0.1
pyasn1 0.5.1
pyasn1-modules 0.3.0
pycosat 0.6.6
pycparser 2.21
pydantic 1.10.13
pydantic_core 2.14.1
Pygments 2.13.0
PyJWT 2.8.0
PyNaCl 1.5.0
pyOpenSSL 22.1.0
pyparsing 3.1.1
pyrsistent 0.20.0
PySocks 1.7.1
python-dateutil 2.8.2
python-dotenv 1.0.0
pytz 2022.7.1
PyWavelets 1.4.1
PyYAML 6.0.1
quantile-python 1.1
ray 2.9.1
ray-cpp 2.9.1
redis 3.5.3
regex 2023.12.25
requests 2.31.0
requests-oauthlib 1.3.1
rich 12.6.0
rsa 4.7.2
ruamel.yaml 0.17.21
ruamel.yaml.clib 0.2.6
s3transfer 0.6.2
safetensors 0.4.2
scikit-image 0.21.0
scipy 1.10.1
sentencepiece 0.1.99
setuptools 68.2.2
six 1.16.0
smart-open 6.2.0
sniffio 1.3.0
SQLAlchemy 2.0.25
starlette 0.27.0
sympy 1.12
tabulate 0.9.0
tenacity 8.2.3
tensorboardX 2.6
tifffile 2023.7.10
tokenizers 0.15.1
torch 2.1.2
tqdm 4.65.0
transformers 4.37.1
triton 2.1.0
truststore 0.8.0
typer 0.9.0
typing_extensions 4.8.0
typing-inspect 0.9.0
uritemplate 3.0.1
urllib3 1.26.18
uvicorn 0.22.0
uvloop 0.19.0
virtualenv 20.21.0
vllm 0.2.7
watchfiles 0.19.0
wcwidth 0.2.12
websockets 11.0.3
wheel 0.41.2
xformers 0.0.23.post1
yarl 1.9.3
zstandard 0.19.0

DN-Dev00 · 2024-02-14T05:55:26Z

same issue.

Jeffwan · 2024-04-05T07:50:56Z

I highly suggest your guys to use kuberay, launch a ray cluster and submit vLLM worker. That's the most easiest way I found and kuberay will reduce your chance coming into cluster issues.

RomanKoshkin · 2024-05-05T15:27:29Z

I highly suggest your guys to use kuberay, launch a ray cluster and submit vLLM worker. That's the most easiest way I found and kuberay will reduce your chance coming into cluster issues.

How? Can you post a minimal example, please?

refill-dn · 2024-05-06T22:18:42Z

it was problem of nvidia cuda driver 545 bug.
I was upgraded 550-beta driver(currently 550). and solved.
(I am DN-Dev00)

nelsonspbr · 2024-06-20T06:01:04Z

I wanted to update this thread as I've found a resolution to this issue, and it might be good to include this in the vLLM documentation. I'm running on a very large OpenShift cluster with a high numbe of CPU on the nodes, and after digging really deep into RAY I found the issue is not with vLLM but rather how RAY works and this simply needed 2 things done.

I modified .../site-packages/vllm/engine/ray_utils.py

look for the line 83

--> ray.init(address=ray_address, ignore_reinit_error=True) Modify this to: --> ray.init(address=ray_address, ignore_reinit_error=True, num_gpus=2, num_cpus=2)

This actually solved my problem — running vLLM with TP via Ray within a container provsioned via OpenShift. I can share more details if needed. Thanks @ernestol0817 !

youkaichao · 2024-06-20T06:22:11Z

can you please reply to #4462 and #5360 ? some openshift users are suffering there.

thobicex · 2024-07-09T15:29:40Z

I am still looking forward to get my sponsor GitHub profile verified, when I am in the state. Am out of the country for projects which needed to be fulfill with LLC law firms.

rcarrata · 2024-07-11T11:40:21Z

@nelsonspbr can you please post your example of vllm with TP via Ray within a container provisioned via OpenShift? I'm really interested!
On the other hand, is this creating a Ray Cluster with the vLLM image in one of the ray workers in the workerGroupSpecs?

jbohnslav · 2024-10-15T21:36:13Z

Would appreciate a working example. I'm having difficulties running more than one tensor parallel Ray Serve application. I suspect it has something to do with vLLM initializing Ray / altering placement groups within each application.

hmellor closed this as completed May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor parallelism on ray cluster #1566

Tensor parallelism on ray cluster #1566

baojunliu commented Nov 5, 2023 •

edited

Loading

hughesadam87 commented Nov 6, 2023 •

edited

Loading

JenniePing commented Dec 7, 2023

qizzzh commented Dec 8, 2023

qizzzh commented Dec 8, 2023

baojunliu commented Jan 5, 2024

qizzzh commented Jan 5, 2024

ernestol0817 commented Jan 26, 2024

ernestol0817 commented Jan 26, 2024

ernestol0817 commented Jan 26, 2024 •

edited

Loading

DN-Dev00 commented Feb 14, 2024

Jeffwan commented Apr 5, 2024 •

edited

Loading

RomanKoshkin commented May 5, 2024

refill-dn commented May 6, 2024 •

edited

Loading

nelsonspbr commented Jun 20, 2024

look for the line 83

youkaichao commented Jun 20, 2024

thobicex commented Jul 9, 2024 •

edited

Loading

rcarrata commented Jul 11, 2024

jbohnslav commented Oct 15, 2024

Tensor parallelism on ray cluster #1566

Tensor parallelism on ray cluster #1566

Comments

baojunliu commented Nov 5, 2023 • edited Loading

hughesadam87 commented Nov 6, 2023 • edited Loading

JenniePing commented Dec 7, 2023

qizzzh commented Dec 8, 2023

qizzzh commented Dec 8, 2023

baojunliu commented Jan 5, 2024

qizzzh commented Jan 5, 2024

ernestol0817 commented Jan 26, 2024

ernestol0817 commented Jan 26, 2024

ernestol0817 commented Jan 26, 2024 • edited Loading

look for the line 83

DN-Dev00 commented Feb 14, 2024

Jeffwan commented Apr 5, 2024 • edited Loading

RomanKoshkin commented May 5, 2024

refill-dn commented May 6, 2024 • edited Loading

nelsonspbr commented Jun 20, 2024

look for the line 83

youkaichao commented Jun 20, 2024

thobicex commented Jul 9, 2024 • edited Loading

rcarrata commented Jul 11, 2024

jbohnslav commented Oct 15, 2024

baojunliu commented Nov 5, 2023 •

edited

Loading

hughesadam87 commented Nov 6, 2023 •

edited

Loading

ernestol0817 commented Jan 26, 2024 •

edited

Loading

Jeffwan commented Apr 5, 2024 •

edited

Loading

refill-dn commented May 6, 2024 •

edited

Loading

thobicex commented Jul 9, 2024 •

edited

Loading