-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xid error mitigation #262
base: master
Are you sure you want to change the base?
xid error mitigation #262
Conversation
cluster. | ||
|
||
``` | ||
apiVersion: v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add these as actual files in a directory with a README.md file explaining how to deploy it? Then the user could just clone the repo and deploy it from their commandline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a short-term mitigation which can't be used in a long run. The device plugin change introduced in this doc will be reverted by the add-on manager during an upgrade. If the GPU nodes are used together with autoscaling, new nodes may not contain this mitigation.
Ideally, we would like to have a more stable and simpler solution for this problem. (e.g. user only need to configure the configmap to customize the XID they like).
It would be better to add more explanation in this doc for this situation, mentioning it is a short-term mitigation.
We would like that, yes: just edit/create a configmap. For now, we can modify the Also, xid 79 should IMO be there by default: it's not a user error: https://docs.nvidia.com/deploy/xid-errors/index.html#topic_4 Context: we get |
This is making xid error mitigation public