GPU has fallen off the bus running in docker with more than one GPU #1195

nylocx · 2023-12-12T08:30:45Z

Every time I try to run the docker version of h2ogpt on multiple GPUs using the official docker image my system crashes with the error that the GPU has fallen off the bus (its always the primary GPU that falls of the bus).

First I thought this would be a hardware thing but I now have two System that show exactly the same behavior. The system specs:

GPUs:
lspci | grep VGA                                                                                                                 Tue 12 Dec 2023 09:20:21 AM CET
17:00.0 VGA compatible controller: NVIDIA Corporation AD102 [GeForce RTX 4090] (rev a1)
65:00.0 VGA compatible controller: NVIDIA Corporation AD102 [GeForce RTX 4090] (rev a1)

CPU:
Intel(R) Core(TM) i9-7920X CPU @ 2.90GHz

Driver:
==============NVSMI LOG==============

Timestamp                                 : Tue Dec 12 09:21:46 2023
Driver Version                            : 545.29.06
CUDA Version                              : 12.3

Attached GPUs                             : 2
GPU 00000000:17:00.0
    Product Name                          : NVIDIA GeForce RTX 4090

Docker:
Docker version 24.0.7, build afdd53b4e3
...

The second system:

GPU (4x 4090 RTX):
lspci | grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1)
2c:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1)
41:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1)
61:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1)

CPU:
AMD Ryzen Threadripper PRO 5995WX 64-Cores

Driver:
==============NVSMI LOG==============

Timestamp                                 : Tue Dec 12 08:24:11 2023
Driver Version                            : 545.23.08
CUDA Version                              : 12.3

Attached GPUs                             : 4
GPU 00000000:01:00.0
    Product Name                          : NVIDIA GeForce RTX 4090

Docker:
Docker version 24.0.7, build afdd53b

I run with the following simple compose file:

services:
  h2ogpt:
    image: ${H2OGPT_RUNTIME}
    restart: always
    shm_size: '2gb'
    ports:
      - '${H2OGPT_PORT}:7860'
    volumes:
      - cache:/workspace/.cache
      - save:/workspace/save
    command: ${H2OGPT_ARGS}
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            device_ids: ['0', '1']
            capabilities: [gpu]

and vars:

H2OGPT_PORT=7860
H2OGPT_RUNTIME=gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0
H2OGPT_BASE_MODEL=HuggingFaceH4/zephyr-7b-beta
H2OGPT_ARGS="/workspace/generate.py --base_model=${H2OGPT_BASE_MODEL} --use_safetensors=True --prompt_type=zephyr --save_dir=/workspace/save/ --use_gpu_id=False --score_model=None --max_max_new_tokens=2048 --max_new_tokens=1024 --verbose --debug"

The crash happens during the loading of the models. I benchmarked the cards with high load high throughput and everything I could think of and they are rock solid. So I am running out of ideas.
I tried running with VLLM so I dont need two cards for the h2ogpt image, but this failed I guess due to the 128 cores:
vllm-project/vllm#1058

If someone has any good ideas what to do please let me know. The next thing I will try is to get docker out of the equation and run h2ogpt on bare metal on the 4 GPU system.

Kind regards,
Alex

The text was updated successfully, but these errors were encountered:

nylocx · 2023-12-12T14:56:48Z

I just checked on bare Metal and I get the same system crash after following the linux install instructions.

pseudotensor · 2023-12-12T23:11:54Z

Hi @nylocx , what is your bare metal command? And what is the error? I don't know what "fallen off the bus" means. thanks.

nylocx · 2023-12-13T05:58:05Z

Hi @pseudotensor I used:
generate.py --base_model=HuggingFaceH4/zephyr-7b-beta --prompt_type=zephyr --use_gpu_id=False --score_model=None --max_max_new_tokens=2048 --max_new_tokens=1024 --verbose --debug"

And the error is from the system because the PCIe bus lost the connection to the GPU. Some more infos are here:
https://askubuntu.com/questions/868321/gpu-has-fallen-off-the-bus-nvidia

Most of the time this hints to a hardware defect or aspm related issues (Have completely disabled everything related to power saving and also used nvidias persistence daemon) but as this now happened on two different systems every time I start h2ogpt on more than one of the GPUs with --use_gpu_id=False, that are otherwise working flawlessly, I am pretty sure this has to be a driver incompatibility or something with whatever is triggered when using the option. But as I cant get any debug output from the python code and the system completely crashes as the PCIe Bus fails after the GPU drops out I don't know how to debug this further.

pseudotensor · 2023-12-18T08:10:00Z

I unfortunately have no good ideas.

nylocx · 2023-12-19T16:10:13Z

Too bad, if anyone gets h2ogpt working with multiple RTX 4090 please comment here and let me know what the trick was ;).

pseudotensor · 2023-12-19T21:50:10Z

FYI in my case I have 4*A6000 and no issues. I think @arnocandel has multiple 4090s, he can comment if he has time.

pseudotensor closed this as completed Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU has fallen off the bus running in docker with more than one GPU #1195

GPU has fallen off the bus running in docker with more than one GPU #1195

nylocx commented Dec 12, 2023

nylocx commented Dec 12, 2023

pseudotensor commented Dec 12, 2023

nylocx commented Dec 13, 2023

pseudotensor commented Dec 18, 2023

nylocx commented Dec 19, 2023

pseudotensor commented Dec 19, 2023

GPU has fallen off the bus running in docker with more than one GPU #1195

GPU has fallen off the bus running in docker with more than one GPU #1195

Comments

nylocx commented Dec 12, 2023

nylocx commented Dec 12, 2023

pseudotensor commented Dec 12, 2023

nylocx commented Dec 13, 2023

pseudotensor commented Dec 18, 2023

nylocx commented Dec 19, 2023

pseudotensor commented Dec 19, 2023