Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xid error mitigation #262

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

xid error mitigation #262

wants to merge 1 commit into from

Conversation

crystalzhaizhai
Copy link
Contributor

This is making xid error mitigation public

cluster.

```
apiVersion: v1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add these as actual files in a directory with a README.md file explaining how to deploy it? Then the user could just clone the repo and deploy it from their commandline.

Copy link
Member

@grac3gao grac3gao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a short-term mitigation which can't be used in a long run. The device plugin change introduced in this doc will be reverted by the add-on manager during an upgrade. If the GPU nodes are used together with autoscaling, new nodes may not contain this mitigation.

Ideally, we would like to have a more stable and simpler solution for this problem. (e.g. user only need to configure the configmap to customize the XID they like).

It would be better to add more explanation in this doc for this situation, mentioning it is a short-term mitigation.

@thomas-riccardi
Copy link

Ideally, we would like to have a more stable and simpler solution for this problem. (e.g. user only need to configure the configmap to customize the XID they like).

We would like that, yes: just edit/create a configmap.

For now, we can modify the /etc/nvidia/gpu_config.json file on the host, but we have no guarantee it's done before the gpu-device-plugin loads it, and have no easy way to reload it if it's done after. (alternatively, gpu-device-plugin could watch it?)

Also, xid 79 should IMO be there by default: it's not a user error: https://docs.nvidia.com/deploy/xid-errors/index.html#topic_4

Context: we get NVRM: Xid (PCI:0000:00:04): 79, pid=0, GPU has fallen off the bus. on GKE, and the node stay stuck with a unusable GPU, our workload using the GPU at best reaches a CRITICAL/FATAL state and crash-loops, or at worse gets silently stuck.
The end goal is that this would somehow mark the node as broken, so the GKE auto-repair feature kicks-in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants