Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't install nvidia driver via nvidia.service #1283

Closed
tearfulDalvik opened this issue Dec 10, 2023 · 7 comments
Closed

Can't install nvidia driver via nvidia.service #1283

tearfulDalvik opened this issue Dec 10, 2023 · 7 comments
Labels
kind/bug Something isn't working

Comments

@tearfulDalvik
Copy link

Description

Can't install nvidia driver via nvidia.service

Environment and steps to reproduce

  1. Set-up:
    Flatcar stable installed in VMWare ESXi 8.0.2 using ova import, then manually upgrade to flatcar beta
DISTRIB_ID="Flatcar Container Linux by Kinvolk"
DISTRIB_RELEASE=3760.1.0
DISTRIB_CODENAME="Oklo"
DISTRIB_DESCRIPTION="Flatcar Container Linux by Kinvolk 3760.1.0 (Oklo)"
NAME="Flatcar Container Linux by Kinvolk"
ID=flatcar
ID_LIKE=coreos
VERSION=3760.1.0
VERSION_ID=3760.1.0
BUILD_ID=2023-11-20-1827
SYSEXT_LEVEL=1.0
PRETTY_NAME="Flatcar Container Linux by Kinvolk 3760.1.0 (Oklo)"
ANSI_COLOR="38;5;75"
HOME_URL="https://flatcar.org/"
BUG_REPORT_URL="https://issues.flatcar.org"
FLATCAR_BOARD="amd64-usr"
CPE_NAME="cpe:2.3:o:flatcar-linux:flatcar_linux:3760.1.0:*:*:*:*:*:*:*"
  1. Task:
    journalctl -u nvidia -f
  2. Action(s):
    a. Assigned a P40-6Q GPU to flatcar vm
    b.journalctl -u nvidia -f
  3. Error: [describe the error that was triggered]
Dec 10 09:09:15 flatcar-rke2-worker-2 systemd[1]: Starting nvidia.service - NVIDIA Configure Service...
Dec 10 09:09:15 flatcar-rke2-worker-2 setup-nvidia[18335]: Downloading Flatcar Container Linux Developer Container for version: 3760.1.0
Dec 10 09:09:16 flatcar-rke2-worker-2 setup-nvidia[18398]:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Dec 10 09:09:16 flatcar-rke2-worker-2 setup-nvidia[18398]:                                  Dload  Upload   Total   Spent    Left  Speed
Dec 10 09:09:30 flatcar-rke2-worker-2 setup-nvidia[18398]: [1.2K blob data]
Dec 10 09:09:41 flatcar-rke2-worker-2 setup-nvidia[18335]: Downloading NVIDIA 535.104.05 Driver
Dec 10 09:09:41 flatcar-rke2-worker-2 setup-nvidia[19275]:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Dec 10 09:09:41 flatcar-rke2-worker-2 setup-nvidia[19275]:                                  Dload  Upload   Total   Spent    Left  Speed
Dec 10 09:10:09 flatcar-rke2-worker-2 setup-nvidia[19275]: [2.3K blob data]
Dec 10 09:10:09 flatcar-rke2-worker-2 setup-nvidia[18335]: Extract the NVIDIA Driver Installer 535.104.05
Dec 10 09:10:09 flatcar-rke2-worker-2 setup-nvidia[18335]: /opt/nvidia/workdir/nvidia-workdir /
Dec 10 09:10:09 flatcar-rke2-worker-2 setup-nvidia[20411]: Creating directory NVIDIA-Linux-x86_64-535.104.05
Dec 10 09:10:09 flatcar-rke2-worker-2 setup-nvidia[20411]: Verifying archive integrity... OK
Dec 10 09:10:10 flatcar-rke2-worker-2 setup-nvidia[20411]: Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 535.104.05
Dec 10 09:10:12 flatcar-rke2-worker-2 setup-nvidia[20448]: ..................................................................................................................................>Dec 10 09:10:12 flatcar-rke2-worker-2 setup-nvidia[18335]: /
Dec 10 09:10:12 flatcar-rke2-worker-2 setup-nvidia[18335]: Spawn system-nspawn container to install the NVIDIA drivers
Dec 10 09:10:12 flatcar-rke2-worker-2 sudo[20540]:     root : PWD=/ ; USER=root ; COMMAND=/usr/bin/systemd-nspawn --read-only --volatile=overlay --image=/opt/nvidia/workdir/flatcar_develope>Dec 10 09:10:12 flatcar-rke2-worker-2 sudo[20540]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0)
Dec 10 09:10:48 flatcar-rke2-worker-2 setup-nvidia[18335]: /opt/nvidia /
Dec 10 09:10:48 flatcar-rke2-worker-2 setup-nvidia[18335]: /
Dec 10 09:10:48 flatcar-rke2-worker-2 setup-nvidia[30179]: ldconfig: /lib/ld.so.conf is not an ELF file - it has the wrong magic bytes at the start.
Dec 10 09:10:48 flatcar-rke2-worker-2 setup-nvidia[18335]: /opt/nvidia/current/usr/lib/modules/6.1.62-flatcar/video /
Dec 10 09:10:48 flatcar-rke2-worker-2 setup-nvidia[30183]: insmod: ERROR: could not insert module nvidia.ko: No such device
Dec 10 09:10:48 flatcar-rke2-worker-2 systemd[1]: nvidia.service: Main process exited, code=exited, status=1/FAILURE
Dec 10 09:10:48 flatcar-rke2-worker-2 systemd[1]: nvidia.service: Failed with result 'exit-code'.
Dec 10 09:10:48 flatcar-rke2-worker-2 systemd[1]: Failed to start nvidia.service - NVIDIA Configure Service.
@tearfulDalvik tearfulDalvik added the kind/bug Something isn't working label Dec 10, 2023
@jepio
Copy link
Member

jepio commented Dec 11, 2023

hi @tearfulDalvik,

could you paste the last couple lines of dmesg after running sudo /usr/lib/nvidia/bin/setup-nvidia? "No such device" suggests that your GPU might be unsupported by the default driver version and you might want to explicitly select a different one.

@jepio
Copy link
Member

jepio commented Dec 11, 2023

Checking nvidia's driver download page you might want to try selecting nvidia driver version 440.95.01. See the instructions here: https://www.flatcar.org/docs/latest/setup/customization/using-nvidia/#customization

@tearfulDalvik
Copy link
Author

tearfulDalvik commented Dec 11, 2023

hello @jepio,

Thank you, it seems vGPUs aren't supported. vGPUs need NVIDIA GRID drivers instead of normal linux drivers
Also, may I know how to undo the nvidia.service&setup-nvidia installation?

sudo /usr/lib/nvidia/bin/setup-nvidia
ldconfig: /lib/ld.so.conf is not an ELF file - it has the wrong magic bytes at the start.

/opt/nvidia/current/usr/lib/modules/6.1.62-flatcar/video /home/core
insmod: ERROR: could not insert module nvidia.ko: No such device

dmesg:

[130852.530069] nvidia 0000:02:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[130852.530546] NVRM: The NVIDIA GPU 0000:02:00.0 (PCI ID: 10de:1b38)
                NVRM: installed in this system is not supported by the
                NVRM: NVIDIA 535.104.05 driver release.
                NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
                NVRM: in this release's README, available on the operating system
                NVRM: specific graphics driver download page at www.nvidia.com.
[130852.532087] nvidia: probe of 0000:02:00.0 failed with error -1
[130852.532299] NVRM: The NVIDIA probe routine failed for 1 device(s).
[130852.532499] NVRM: None of the NVIDIA devices were initialized.
[130852.532957] nvidia-nvlink: Unregistered Nvlink Core, major device number 246

@jepio
Copy link
Member

jepio commented Dec 12, 2023

I don't think we ever investigated GRID/vGPU drivers, as those require licensing.

To undo you can remove /opt/nvidia and systemctl mask --now nvidia.service.

@tearfulDalvik
Copy link
Author

Totally understandable. Thank you very much.

@sayanchowdhury
Copy link
Member

One closing query: Did you manually initiate the nvidia.service, or did it trigger automatically?

@tearfulDalvik
Copy link
Author

One closing query: Did you manually initiate the nvidia.service, or did it trigger automatically?

Hello,
It is triggered automatically

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants