xid error mitigation #262

crystalzhaizhai · 2022-12-05T18:04:29Z

This is making xid error mitigation public

richardsliu · 2022-12-05T18:08:02Z

demo/xid-error-mitigation.yaml

+cluster.
+
+```
+apiVersion: v1


Can we add these as actual files in a directory with a README.md file explaining how to deploy it? Then the user could just clone the repo and deploy it from their commandline.

grac3gao

This is a short-term mitigation which can't be used in a long run. The device plugin change introduced in this doc will be reverted by the add-on manager during an upgrade. If the GPU nodes are used together with autoscaling, new nodes may not contain this mitigation.

Ideally, we would like to have a more stable and simpler solution for this problem. (e.g. user only need to configure the configmap to customize the XID they like).

It would be better to add more explanation in this doc for this situation, mentioning it is a short-term mitigation.

thomas-riccardi · 2023-09-12T15:17:57Z

Ideally, we would like to have a more stable and simpler solution for this problem. (e.g. user only need to configure the configmap to customize the XID they like).

We would like that, yes: just edit/create a configmap.

For now, we can modify the /etc/nvidia/gpu_config.json file on the host, but we have no guarantee it's done before the gpu-device-plugin loads it, and have no easy way to reload it if it's done after. (alternatively, gpu-device-plugin could watch it?)

Also, xid 79 should IMO be there by default: it's not a user error: https://docs.nvidia.com/deploy/xid-errors/index.html#topic_4

Context: we get NVRM: Xid (PCI:0000:00:04): 79, pid=0, GPU has fallen off the bus. on GKE, and the node stay stuck with a unusable GPU, our workload using the GPU at best reaches a CRITICAL/FATAL state and crash-loops, or at worse gets silently stuck.
The end goal is that this would somehow mark the node as broken, so the GKE auto-repair feature kicks-in.

xid error mitigation

53d7122

crystalzhaizhai requested review from grac3gao and richardsliu December 5, 2022 18:04

richardsliu reviewed Dec 5, 2022

View reviewed changes

grac3gao reviewed Dec 5, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xid error mitigation #262

xid error mitigation #262

crystalzhaizhai commented Dec 5, 2022

richardsliu Dec 5, 2022

grac3gao left a comment

thomas-riccardi commented Sep 12, 2023

xid error mitigation #262

Are you sure you want to change the base?

xid error mitigation #262

Conversation

crystalzhaizhai commented Dec 5, 2022

richardsliu Dec 5, 2022

Choose a reason for hiding this comment

grac3gao left a comment

Choose a reason for hiding this comment

thomas-riccardi commented Sep 12, 2023