Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensor parallelism on ray cluster #1566

Closed
baojunliu opened this issue Nov 5, 2023 · 18 comments
Closed

Tensor parallelism on ray cluster #1566

baojunliu opened this issue Nov 5, 2023 · 18 comments

Comments

@baojunliu
Copy link

baojunliu commented Nov 5, 2023

I am using vllm on a ray cluster, multiple nodes and 4 gpus on each node. I am trying to load llama model with more than one gpu by setting tensor_parallel_size=2. The model won't load. It works fine on a single instance when I don't use a ray cluster. I can only set tensor_parallel_size=1 on ray cluster. Is there a way to use tensor parallelism on a ray cluster?

@hughesadam87
Copy link

hughesadam87 commented Nov 6, 2023

Solved: my particular issue (not necessarily that of OP) was that the RayCluster that runs locally on the single node (because we're not doing distributed inference), didn't have enough memory.

apiVersion: v1
kind: Pod
metadata:
  name: mypod
spec:
  containers:
  - name: mycontainer
    image: myimage
    volumeMounts:
    - mountPath: /dev/shm
      name: dshm
  volumes:
  - name: dshm
    emptyDir:
      medium: Memory
      sizeLimit: "10.24Gi"

Dont know Ray well enough to understand waht this does lol


I am unaffiliated with OP, but believe we are having the same issue. We're using kubernetes to deploy a model an a single g4.12xlarge instance (4GPUs). We cannot use a newer model class for various reasons. To troubleshoot, I've chosen a small model that runs easily on a single GPU.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm
  labels:
    app: vllm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      nodeSelector:
        workload-type: gpu
      containers:
        - name: vllm-container
          image: lmisam/vllm-openai:latest
          imagePullPolicy: Always
          ports:
            - containerPort: 8000
              name: api
          resources:
            limits:
              nvidia.com/gpu: "4"
          env:
            - name: NVIDIA_VISIBLE_DEVICES
              value: "all"
            - name: CUDA_VISIBLE_DEVICES
              value: "0,1,2,3" # ensures the container app can see all 4 GPUs

          args: [ "--model", "TheBloke/OpenHermes-2-Mistral-7B-AWQ",
                  "--quantization", "awq",
                  "--dtype", "float16",
                  "--max-model-len", "4096",
                  "--tensor-parallel-size", "1"]`   #<---- Important

This is overkill, but as you see we're making 4 GPUs available to the container, despite only running on one of them. I've also confirmed from shelling into the container and running pytorch commands that it does have 4 GPUs accessible.

When --tensor-parallel-size=1 or the flag is not included, the model works just fine.

When the flag is set to 2 or more, we get a long stracktrace, the relevant portion is shown below.

Do I have to manually start the Ray Cluster or do any other env settings or something so that it is up and healthy when the Docker container starts? Or does vllm abstract this entirely and there's nothing else to do on our end?

    self._init_workers_ray(placement_group)
  File "/workspace/vllm/engine/llm_engine.py", line 181, in _init_workers_ray
    self._run_workers(
  File "/workspace/vllm/engine/llm_engine.py", line 704, in _run_workers
    all_outputs = ray.get(all_outputs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2565, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
    class_name: RayWorker
    actor_id: 1f4e0ddffc9942a1e34140a601000000
    pid: 1914
    namespace: 0c0e14ce-0a35-4190-ad47-f0a1959e7fe4
    ip: 172.31.15.88
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2023-11-06 15:44:56,238    WARNING worker.py:2074 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffedad1d4643afe66a344639bf01000000 Worker ID: 9cd611fa721b93a5d424da4383573ab345580a0a19f9301512aa384f Node ID: 27fa470afbb4822adc44dc5bbad10b5584e94e8f7b1d843bbcb63485 Worker IP address: 172.31.15.88 Worker port: 38939 Worker PID: 1916 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2023-11-06 15:44:56,272    WARNING worker.py:2074 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff45041ffe54799af4388ab20401000000 Worker ID: 708f3407d01c3576315da0569ada9fcd984f754e43417718e04ca93b Node ID: 27fa470afbb4822adc44dc5bbad10b5584e94e8f7b1d843bbcb63485 Worker IP address: 172.31.15.88 Worker port: 46579 Worker PID: 1915 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2023-11-06 15:44:56,316    WARNING worker.py:2074 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff1252d098d6462eb8cd537bd601000000 Worker ID: acea4bee761db0a4bff1adb3cc43d1e3ba1793d3017c3fc22e18e6d7 Node ID: 27fa470afbb4822adc44dc5bbad10b5584e94e8f7b1d843bbcb63485 Worker IP address: 172.31.15.88 Worker port: 45813 Worker PID: 1913 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(RayWorker pid=1916) *** SIGBUS received at time=1699285494 on cpu 2 *** [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(RayWorker pid=1916) PC: @     0x7fcfb18c6c3a  (unknown)  (unknown) [repeated 3x across cluster]
(RayWorker pid=1916)     @     0x7fcfb1759520       3392  (unknown) [repeated 3x across cluster]
(RayWorker pid=1916)     @ 0x32702d6c63636e2f  (unknown)  (unknown) [repeated 3x across cluster]

@JenniePing
Copy link

I am using vllm on a ray cluster, multiple nodes and 4 gpus on each node. I am trying to load llama model with more than one gpu by setting tensor_parallel_size=2. The model won't load. It works fine with on a single instance when I don't use a ray cluster. I cannot only set tensor_parallel_size=1 on ray cluster. Is there a way to use tensor parallelism on a ray cluster?

Same, any solution please?

@qizzzh
Copy link

qizzzh commented Dec 8, 2023

Hit the same issue

@qizzzh
Copy link

qizzzh commented Dec 8, 2023

#1058 (comment) could be related

@baojunliu
Copy link
Author

Here is the finding for may case:

when submit remote job, it claims gpus. For the following code, it takes 1 gpu

 @ray.remote(num_gpus=1)
class my class
....

When vllm runs tensor parallelism, it will create gpu cluster. However the gpu is unavailable, so the job get timeout eventually.

Is there a way to use the gpus assigned to the remote job?

@qizzzh
Copy link

qizzzh commented Jan 5, 2024

It’s the same as my finding in #1058 (comment). I used custom resources to work around it. Ideally vLLM should have a way to pass in already assigned logical resources.

@ernestol0817
Copy link

I am also running into the same issue on redhat OpenShift.

import torch
torch.cuda.is_available()
True
torch.cuda.device_count()
2

llm = VLLM(model="meta-llama/Llama-2-13b-chat-hf",
trust_remote_code=True,
max_new_tokens=50,
temperature=0.1,
tensor_parallel_size=2,
)

Startup hangs here:

INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265

@ernestol0817
Copy link

I found issue 31897 over on ray serves repo: looks similar:
ray-project/ray#31897

@ernestol0817
Copy link

ernestol0817 commented Jan 26, 2024

I wanted to update this thread as I've found a resolution to this issue, and it might be good to include this in the vLLM documentation. I'm running on a very large OpenShift cluster with a high numbe of CPU on the nodes, and after digging really deep into RAY I found the issue is not with vLLM but rather how RAY works and this simply needed 2 things done.

  1. I modified .../site-packages/vllm/engine/ray_utils.py

look for the line 83

--> ray.init(address=ray_address, ignore_reinit_error=True)
Modify this to:
--> ray.init(address=ray_address, ignore_reinit_error=True, num_gpus=2, num_cpus=2)

  1. Pay very special attention to your ENV, including python version, installed libraries. Here is what I'm running now and its working:

Package Version


adal 1.2.7
aiohttp 3.8.6
aiohttp-cors 0.7.0
aioprometheus 23.12.0
aiorwlock 1.3.0
aiosignal 1.3.1
annotated-types 0.6.0
anyio 3.7.1
applicationinsights 0.11.10
archspec 0.2.1
argcomplete 1.12.3
async-timeout 4.0.3
attrs 21.4.0
azure-cli-core 2.40.0
azure-cli-telemetry 1.0.8
azure-common 1.1.28
azure-core 1.29.6
azure-identity 1.10.0
azure-mgmt-compute 23.1.0
azure-mgmt-core 1.4.0
azure-mgmt-network 19.0.0
azure-mgmt-resource 20.0.0
backoff 1.10.0
bcrypt 4.1.2
blessed 1.20.0
boltons 23.0.0
boto3 1.26.76
botocore 1.29.165
Brotli 1.0.9
cachetools 5.3.2
certifi 2023.11.17
cffi 1.16.0
charset-normalizer 3.3.2
click 8.1.7
cloudpickle 2.2.0
colorful 0.5.5
commonmark 0.9.1
conda 23.11.0
conda-content-trust 0.2.0
conda-libmamba-solver 23.12.0
conda-package-handling 2.2.0
conda_package_streaming 0.9.0
cryptography 38.0.1
Cython 0.29.32
dataclasses-json 0.6.3
distlib 0.3.7
distro 1.8.0
dm-tree 0.1.8
exceptiongroup 1.2.0
Farama-Notifications 0.0.4
fastapi 0.104.0
filelock 3.13.1
flatbuffers 23.5.26
frozenlist 1.4.0
fsspec 2023.5.0
google-api-core 2.14.0
google-api-python-client 1.7.8
google-auth 2.23.4
google-auth-httplib2 0.2.0
google-oauth 1.0.1
googleapis-common-protos 1.61.0
gptcache 0.1.43
gpustat 1.1.1
greenlet 3.0.3
grpcio 1.59.3
gymnasium 0.28.1
h11 0.14.0
httpcore 1.0.2
httplib2 0.22.0
httptools 0.6.1
httpx 0.26.0
huggingface-hub 0.20.3
humanfriendly 10.0
idna 3.6
imageio 2.31.1
isodate 0.6.1
jax-jumpy 1.0.0
Jinja2 3.1.3
jmespath 1.0.1
jsonpatch 1.33
jsonpointer 2.1
jsonschema 4.17.3
knack 0.10.1
langchain 0.1.4
langchain-community 0.0.16
langchain-core 0.1.16
langsmith 0.0.83
lazy_loader 0.3
libmambapy 1.5.3
lz4 4.3.2
MarkupSafe 2.1.4
marshmallow 3.20.2
menuinst 2.0.1
mpmath 1.3.0
msal 1.18.0b1
msal-extensions 1.0.0
msgpack 1.0.7
msrest 0.7.1
msrestazure 0.6.4
multidict 6.0.4
mypy-extensions 1.0.0
networkx 3.1
ninja 1.11.1.1
numpy 1.26.3
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-ml-py 12.535.133
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.3.101
nvidia-nvtx-cu12 12.1.105
oauthlib 3.2.2
openai 1.10.0
opencensus 0.11.3
opencensus-context 0.1.3
opentelemetry-api 1.1.0
opentelemetry-exporter-otlp 1.1.0
opentelemetry-exporter-otlp-proto-grpc 1.1.0
opentelemetry-proto 1.1.0
opentelemetry-sdk 1.1.0
opentelemetry-semantic-conventions 0.20b0
orjson 3.9.12
packaging 23.2
pandas 1.5.3
paramiko 2.12.0
Pillow 9.2.0
pip 23.3.1
pkginfo 1.9.6
platformdirs 3.11.0
pluggy 1.0.0
portalocker 2.8.2
prometheus-client 0.19.0
protobuf 3.19.6
psutil 5.9.6
py-spy 0.3.14
pyarrow 12.0.1
pyasn1 0.5.1
pyasn1-modules 0.3.0
pycosat 0.6.6
pycparser 2.21
pydantic 1.10.13
pydantic_core 2.14.1
Pygments 2.13.0
PyJWT 2.8.0
PyNaCl 1.5.0
pyOpenSSL 22.1.0
pyparsing 3.1.1
pyrsistent 0.20.0
PySocks 1.7.1
python-dateutil 2.8.2
python-dotenv 1.0.0
pytz 2022.7.1
PyWavelets 1.4.1
PyYAML 6.0.1
quantile-python 1.1
ray 2.9.1
ray-cpp 2.9.1
redis 3.5.3
regex 2023.12.25
requests 2.31.0
requests-oauthlib 1.3.1
rich 12.6.0
rsa 4.7.2
ruamel.yaml 0.17.21
ruamel.yaml.clib 0.2.6
s3transfer 0.6.2
safetensors 0.4.2
scikit-image 0.21.0
scipy 1.10.1
sentencepiece 0.1.99
setuptools 68.2.2
six 1.16.0
smart-open 6.2.0
sniffio 1.3.0
SQLAlchemy 2.0.25
starlette 0.27.0
sympy 1.12
tabulate 0.9.0
tenacity 8.2.3
tensorboardX 2.6
tifffile 2023.7.10
tokenizers 0.15.1
torch 2.1.2
tqdm 4.65.0
transformers 4.37.1
triton 2.1.0
truststore 0.8.0
typer 0.9.0
typing_extensions 4.8.0
typing-inspect 0.9.0
uritemplate 3.0.1
urllib3 1.26.18
uvicorn 0.22.0
uvloop 0.19.0
virtualenv 20.21.0
vllm 0.2.7
watchfiles 0.19.0
wcwidth 0.2.12
websockets 11.0.3
wheel 0.41.2
xformers 0.0.23.post1
yarl 1.9.3
zstandard 0.19.0

@DN-Dev00
Copy link

same issue.
image

@Jeffwan
Copy link
Contributor

Jeffwan commented Apr 5, 2024

I highly suggest your guys to use kuberay, launch a ray cluster and submit vLLM worker. That's the most easiest way I found and kuberay will reduce your chance coming into cluster issues.

@RomanKoshkin
Copy link

I highly suggest your guys to use kuberay, launch a ray cluster and submit vLLM worker. That's the most easiest way I found and kuberay will reduce your chance coming into cluster issues.

How? Can you post a minimal example, please?

@refill-dn
Copy link

refill-dn commented May 6, 2024

it was problem of nvidia cuda driver 545 bug.
I was upgraded 550-beta driver(currently 550). and solved.
(I am DN-Dev00)

@hmellor hmellor closed this as completed May 31, 2024
@nelsonspbr
Copy link

I wanted to update this thread as I've found a resolution to this issue, and it might be good to include this in the vLLM documentation. I'm running on a very large OpenShift cluster with a high numbe of CPU on the nodes, and after digging really deep into RAY I found the issue is not with vLLM but rather how RAY works and this simply needed 2 things done.

  1. I modified .../site-packages/vllm/engine/ray_utils.py

look for the line 83

--> ray.init(address=ray_address, ignore_reinit_error=True) Modify this to: --> ray.init(address=ray_address, ignore_reinit_error=True, num_gpus=2, num_cpus=2)

This actually solved my problem — running vLLM with TP via Ray within a container provsioned via OpenShift. I can share more details if needed. Thanks @ernestol0817 !

@youkaichao
Copy link
Member

can you please reply to #4462 and #5360 ? some openshift users are suffering there.

@thobicex
Copy link

thobicex commented Jul 9, 2024

I am still looking forward to get my sponsor GitHub profile verified, when I am in the state. Am out of the country for projects which needed to be fulfill with LLC law firms.

@rcarrata
Copy link

@nelsonspbr can you please post your example of vllm with TP via Ray within a container provisioned via OpenShift? I'm really interested!
On the other hand, is this creating a Ray Cluster with the vLLM image in one of the ray workers in the workerGroupSpecs?

@jbohnslav
Copy link

Would appreciate a working example. I'm having difficulties running more than one tensor parallel Ray Serve application. I suspect it has something to do with vLLM initializing Ray / altering placement groups within each application.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests