-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU has fallen off the bus running in docker with more than one GPU #1195
Comments
I just checked on bare Metal and I get the same system crash after following the linux install instructions. |
Hi @nylocx , what is your bare metal command? And what is the error? I don't know what "fallen off the bus" means. thanks. |
Hi @pseudotensor I used: And the error is from the system because the PCIe bus lost the connection to the GPU. Some more infos are here: Most of the time this hints to a hardware defect or aspm related issues (Have completely disabled everything related to power saving and also used nvidias persistence daemon) but as this now happened on two different systems every time I start h2ogpt on more than one of the GPUs with |
I unfortunately have no good ideas. |
Too bad, if anyone gets h2ogpt working with multiple RTX 4090 please comment here and let me know what the trick was ;). |
FYI in my case I have 4*A6000 and no issues. I think @arnocandel has multiple 4090s, he can comment if he has time. |
Every time I try to run the docker version of h2ogpt on multiple GPUs using the official docker image my system crashes with the error that the GPU has fallen off the bus (its always the primary GPU that falls of the bus).
First I thought this would be a hardware thing but I now have two System that show exactly the same behavior. The system specs:
The second system:
I run with the following simple compose file:
and vars:
The crash happens during the loading of the models. I benchmarked the cards with high load high throughput and everything I could think of and they are rock solid. So I am running out of ideas.
I tried running with VLLM so I dont need two cards for the h2ogpt image, but this failed I guess due to the 128 cores:
vllm-project/vllm#1058
If someone has any good ideas what to do please let me know. The next thing I will try is to get docker out of the equation and run h2ogpt on bare metal on the 4 GPU system.
Kind regards,
Alex
The text was updated successfully, but these errors were encountered: