Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nvidia-driver-installer for CoreOS Container Linux #54

Closed

Conversation

lsjostro
Copy link

@lsjostro lsjostro commented Feb 9, 2018

This PR adds nvidia driver installer for CoreOS Container Linux.

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed, please reply here (e.g. I signed it!) and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.
  • If your company signed a CLA, they designated a Point of Contact who decides which employees are authorized to participate. You may need to contact the Point of Contact for your company and ask to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the project maintainer to go/cla#troubleshoot. The email used to register you as an authorized contributor must be the email used for the Git commit.
  • In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

1 similar comment
@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed, please reply here (e.g. I signed it!) and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.
  • If your company signed a CLA, they designated a Point of Contact who decides which employees are authorized to participate. You may need to contact the Point of Contact for your company and ask to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the project maintainer to go/cla#troubleshoot. The email used to register you as an authorized contributor must be the email used for the Git commit.
  • In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

@lsjostro
Copy link
Author

lsjostro commented Feb 9, 2018

I signed it!

@googlebot
Copy link

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for the commit author(s). If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

1 similar comment
@googlebot
Copy link

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for the commit author(s). If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

@lsjostro lsjostro force-pushed the containerlinux-support branch from a1e6b20 to 889ffb0 Compare February 9, 2018 11:59
@googlebot
Copy link

CLAs look good, thanks!

1 similar comment
@googlebot
Copy link

CLAs look good, thanks!

@googlebot googlebot added cla: yes and removed cla: no labels Feb 9, 2018
@discordianfish
Copy link

@lsjostro I'm trying this out right now. This only addresses the driver parts though, right? It's doesn't seem to install the libnvidia-container nor nvidia-container-runtime, correct?

@lsjostro
Copy link
Author

@discordianfish correct. We are using googles own nvidia k8s device plugin instead of the official nvidia plugin, which doesn't require a custom docker runtime. Works really well for us.

Currently we have coreos nvidia installer Dockerfile hosted here and docker image here

@discordianfish
Copy link

@lsjostro Ah thanks for that hint! So this installer container and the k8s device plugin should be enough?

I thought it's needed since my pods still fail to schedule (Insufficient nvidia.com/gpu). I'm using the same k8s device plugin manifest. Guess I've been down the wrong route.

When dropping the nvidia.com/gpu: 1 limit, the test pod gets scheduled but fails with:

libdc1394 error: Failed to initialize libdc1394
..
cudaRuntimeGetVersion() failed with error #35

The k8s plugin is running fine though:

I0213 10:52:04.891238    1412 nvidia_gpu.go:258] device-plugin started
I0213 10:53:35.096858    1412 nvidia_gpu.go:82] Found Nvidia GPU "nvidia0"
I0213 10:53:35.096935    1412 nvidia_gpu.go:209] starting device-plugin server at: /device-plugin/nvidiaGPU-1518519215.sock
I0213 10:53:35.097257    1412 nvidia_gpu.go:231] device-plugin server started serving
I0213 10:53:35.098269    1412 nvidia_gpu.go:240] device-plugin registered with the kubelet
I0213 10:53:35.098918    1412 nvidia_gpu.go:143] device-plugin: ListAndWatch start
I0213 10:53:35.098948    1412 nvidia_gpu.go:159] ListAndWatch: send devices &ListAndWatchResponse{Devices:[&Device{ID:nvidia0,Health:Healthy,}],}

Well, guess I have to dig deeper. Looks like the installer worked fine. Thanks a lot for that!

@discordianfish
Copy link

I think with the approach of bind-mounting the shared libs from the host, once has to run ldconfig to update the ld cache before starting an application. The official digits container isn't doing that.

@lsjostro
Copy link
Author

lsjostro commented Feb 13, 2018

make sure you add -host-path=/opt/nvidia to the k8s device plugin cmdline.

example here

@discordianfish
Copy link

Yep, figured that out. Maybe the default should be changed, /home/kubernetes is an odd choice.

@vishh
Copy link
Collaborator

vishh commented Feb 22, 2018

I'd recommend having a single daemonset perform driver installation and run the device plugin and that way sharing driver artifacts are controlled via a single config file.

On the other hand, we at Google do not have bandwidth to setup CI and maintain installations for additional OSes. I would like to not merge this until we identify a maintenance plan.

@lsjostro
Copy link
Author

@vishh thanks for the feedback! I totally understand that! We’ll host it here in the meanwhile https://github.com/shelmangroup/coreos-gpu-installer

@amoghkashyap
Copy link

@lsjostro Thanks for the patch. Nvidia-drivers are installed on the V100. However, I am having difficulty in getting the runtime set to NVIDIA. Because of this, I am not able to run nvidia-docker or use the nvidia-smi inside the Kubernetes pod which needs GPU acceleration. Request you to help me out.

CoreOS version : 2079.3.0

Also, I guess the docker tag has to be used as "master" in the daemonset.yml since "latest" is not being pulled.

@lsjostro
Copy link
Author

lsjostro commented May 9, 2019

@amoghkashyap sure! mind coping the issue to https://github.com/shelmangroup/coreos-gpu-installer ?

@lsjostro lsjostro closed this May 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants