Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU has fallen off the bus running in docker with more than one GPU #1195

Closed
nylocx opened this issue Dec 12, 2023 · 6 comments
Closed

GPU has fallen off the bus running in docker with more than one GPU #1195

nylocx opened this issue Dec 12, 2023 · 6 comments

Comments

@nylocx
Copy link

nylocx commented Dec 12, 2023

Every time I try to run the docker version of h2ogpt on multiple GPUs using the official docker image my system crashes with the error that the GPU has fallen off the bus (its always the primary GPU that falls of the bus).

First I thought this would be a hardware thing but I now have two System that show exactly the same behavior. The system specs:

GPUs:
lspci | grep VGA                                                                                                                 Tue 12 Dec 2023 09:20:21 AM CET
17:00.0 VGA compatible controller: NVIDIA Corporation AD102 [GeForce RTX 4090] (rev a1)
65:00.0 VGA compatible controller: NVIDIA Corporation AD102 [GeForce RTX 4090] (rev a1)

CPU:
Intel(R) Core(TM) i9-7920X CPU @ 2.90GHz

Driver:
==============NVSMI LOG==============

Timestamp                                 : Tue Dec 12 09:21:46 2023
Driver Version                            : 545.29.06
CUDA Version                              : 12.3

Attached GPUs                             : 2
GPU 00000000:17:00.0
    Product Name                          : NVIDIA GeForce RTX 4090

Docker:
Docker version 24.0.7, build afdd53b4e3
...

The second system:

GPU (4x 4090 RTX):
lspci | grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1)
2c:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1)
41:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1)
61:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1)

CPU:
AMD Ryzen Threadripper PRO 5995WX 64-Cores

Driver:
==============NVSMI LOG==============

Timestamp                                 : Tue Dec 12 08:24:11 2023
Driver Version                            : 545.23.08
CUDA Version                              : 12.3

Attached GPUs                             : 4
GPU 00000000:01:00.0
    Product Name                          : NVIDIA GeForce RTX 4090

Docker:
Docker version 24.0.7, build afdd53b

I run with the following simple compose file:

services:
  h2ogpt:
    image: ${H2OGPT_RUNTIME}
    restart: always
    shm_size: '2gb'
    ports:
      - '${H2OGPT_PORT}:7860'
    volumes:
      - cache:/workspace/.cache
      - save:/workspace/save
    command: ${H2OGPT_ARGS}
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            device_ids: ['0', '1']
            capabilities: [gpu]

and vars:

H2OGPT_PORT=7860
H2OGPT_RUNTIME=gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0
H2OGPT_BASE_MODEL=HuggingFaceH4/zephyr-7b-beta
H2OGPT_ARGS="/workspace/generate.py --base_model=${H2OGPT_BASE_MODEL} --use_safetensors=True --prompt_type=zephyr --save_dir=/workspace/save/ --use_gpu_id=False --score_model=None --max_max_new_tokens=2048 --max_new_tokens=1024 --verbose --debug"

The crash happens during the loading of the models. I benchmarked the cards with high load high throughput and everything I could think of and they are rock solid. So I am running out of ideas.
I tried running with VLLM so I dont need two cards for the h2ogpt image, but this failed I guess due to the 128 cores:
vllm-project/vllm#1058

If someone has any good ideas what to do please let me know. The next thing I will try is to get docker out of the equation and run h2ogpt on bare metal on the 4 GPU system.

Kind regards,
Alex

@nylocx
Copy link
Author

nylocx commented Dec 12, 2023

I just checked on bare Metal and I get the same system crash after following the linux install instructions.

@pseudotensor
Copy link
Collaborator

Hi @nylocx , what is your bare metal command? And what is the error? I don't know what "fallen off the bus" means. thanks.

@nylocx
Copy link
Author

nylocx commented Dec 13, 2023

Hi @pseudotensor I used:
generate.py --base_model=HuggingFaceH4/zephyr-7b-beta --prompt_type=zephyr --use_gpu_id=False --score_model=None --max_max_new_tokens=2048 --max_new_tokens=1024 --verbose --debug"

And the error is from the system because the PCIe bus lost the connection to the GPU. Some more infos are here:
https://askubuntu.com/questions/868321/gpu-has-fallen-off-the-bus-nvidia

Most of the time this hints to a hardware defect or aspm related issues (Have completely disabled everything related to power saving and also used nvidias persistence daemon) but as this now happened on two different systems every time I start h2ogpt on more than one of the GPUs with --use_gpu_id=False, that are otherwise working flawlessly, I am pretty sure this has to be a driver incompatibility or something with whatever is triggered when using the option. But as I cant get any debug output from the python code and the system completely crashes as the PCIe Bus fails after the GPU drops out I don't know how to debug this further.

@pseudotensor
Copy link
Collaborator

I unfortunately have no good ideas.

@nylocx
Copy link
Author

nylocx commented Dec 19, 2023

Too bad, if anyone gets h2ogpt working with multiple RTX 4090 please comment here and let me know what the trick was ;).

@pseudotensor
Copy link
Collaborator

FYI in my case I have 4*A6000 and no issues. I think @arnocandel has multiple 4090s, he can comment if he has time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants