-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to initialize NVML: Unknown Error after callingsystemctl daemon-reload
#251
Comments
I also encountered this problem, which has been occurring for some time. |
@klueska Could you help take a look? Thanks. |
I find these logs during systemd reload:
From the major and minor number of these devices, I find they are /dev/nvidia* devices, if i manually create these soft links as the following steps, the problem disappeared:
Furthermore, i find runc converts paths from So i wonder if nvidia toolkits should provide something like udev rules that can trigger kernel or systemd to create /dev/char/* -> /dev/nvidia* ? |
Otherwise, if there exists a configuration file that we can explicitly set |
hey, I have been experienced this issue for a long time, I solved this by adding |
Thanks for response. But I'm not able to set privilege because I'm using it in Kubernetes, and it will let user see all the gpus. |
I fixed this issue in our env (centos 8, systemd 239) perfectly with cgroup v2, for both docker and containerd nodes. i can share the steps how we fixed it by upgrading from cgroup1 to cgroup2, if that's an option for you. |
I'm using cgroups v2 myself so I would be interested in hearing what you did @gengwg |
Sure here I wrote the detailed steps how I fixed it using cgroup v2. Let me know if it works in your env. https://gist.github.com/gengwg/55b3eb2bc22bcbd484fccbc0978484fc |
In that case, whatever the trigger is that you're seeing apparently isn't the same as mine as all that your instructions do is switch from cgroups v1 to v2. I'm already on cgroups v2 here on Debian 11 (bullseye) and I know that just having cgroups v2 enabled doesn't fix anything for me.
|
yeah i do see some people still reporting it in v2, example this. time wise, this issue starts to appear after we upgraded from centos 7 to centos 8. all components (kernel, systemd, containerd, nvidia runtime, etc.) on the pipeline all got upgraded. so i'm not totally sure which component (or possibly multiple components) caused this issue. in our case v1 to v2 seems fixed this issue so far for a week or so. i will monitor it in case it's back again. |
It has been over a week. Did you see the error again? |
How to get these logs to find the device numbers for my use case? |
@matifali You can simply use
Here, |
i've just fixed same issue in ubuntu 22.04 with changing my docker compose file and your final docker-compose file be like this: version: '3'
|
And what if we are not using docker-compose @RezaImany. I am using terraform to provision with the |
the root cause of this error is cgroup controller not allow container to reconnect to NVML until restart, you should mod cgroup for bypassing some limitations the --privileged flag gives all capabilities to the container, and it also lifts all the limitations enforced by the device cgroup controller. In other words, the container can then do almost everything that the host can do. |
For my use case, multiple people are using the same machine, and setting |
Hello, Which status of the problem? I still have the same problem on cgroup2... # systemctl --version
systemd 249 (249.11-0ubuntu3.11)
# dpkg -l | grep libnvidia-container
ii libnvidia-container-tools 1.14.3-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.14.3-1 amd64 NVIDIA container runtime library
# runc --version
runc version 1.1.9
commit: v1.1.9-0-gccaecfc
spec: 1.0.2-dev
go: go1.20.8
libseccomp: 2.5.3
# containerd --version
containerd containerd.io 1.6.24 61f9fd88f79f081d64d6fa3bb1a0dc71ec870523
# uname -a
Linux toor 5.15.0-88-generic NVIDIA/nvidia-docker#98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
# docker info
...
Cgroup Driver: systemd
Cgroup Version: 2
... |
@slapshin Have you followed this approach? https://gist.github.com/gengwg/55b3eb2bc22bcbd484fccbc0978484fc |
I can't set |
NVIDIA/nvidia-docker#1671 (comment) - it is working for me |
1. Issue or feature description
Failed to initialize NVML: Unknown Error
does not occurred in initial NVIDIA docker created, but it's happened after callingsystemctl daemon-reload
.It works fine in
Kernel: 4.19.91 and systemd 219.
But it doesn't work in
Kernel: 5.10.23 and systemd 239.
I tried to monitor it with bpftrace:
During container startup, I can see event:
And I can see the
devicel.list
in container as below:But after running
systemctl daemon-reload
, I find the event:And the
devicel.list
in container as below:cat /sys/fs/cgroup/devices/devices.list ... c 195:* m
GPU device is not able be
rw
.Currently I'm not able to use
cgroup V2
. Any suggestions about it? Thanks very much.2. Steps to reproduce the issue
docker run --env NVIDIA_VISIBLE_DEVICES=all --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 --name test -itd nvidia/cuda:11.2.1-devel-ubuntu20.04 bash
3. Information to attach (optional if deemed irrelevant)
nvidia-container-cli -k -d /dev/tty info
uname -a
dmesg
nvidia-smi -a
docker version
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
nvidia-container-cli -V
The text was updated successfully, but these errors were encountered: