-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nvidia-gpu-device-plugin gets OOM killed #202
Comments
@omesser, Thanks for reporting this issue. On GKE 1.19, you can increase the GPU device plugin memory limit with |
Hi @pradvenkat, |
Encountered the same issue on 1.19.x This is also what I see in the source. Here is the graph of memory utilization over time after I set the limit to 200 Mb. Out of memory on the driver can cause, if the timing is right, for running pods to fail in outofnvidia.com/gpu status I am altering the memory manually for now, but it should be tested and fixed. |
I noticed that editing the daemonset manually, setting the container's limits.memory gets automatically overwritten after a few seconds, back to the original value (20Mi). If I delete the daemonset --> also gets automatically recreated What's the right way of changing this value so that it doesn't get overwritten? |
Hi @assapin We're installing the gpu device plugin via the installer which might be different from how you install it. And when running
What I'm seeing on our gke clusters is
So, in my clusters, changes to the daemonset persist, and addon manager isn't ruining my day and overriding the limit values. From what I gather there's nothing you can do except change the way you install the gpu plugin so you'll have |
@assapin Sorry for the inconvenience this brings. For the memory limit fix (we change the memory limit from 20MiB to 50 MiB), it was initially pushed to 1.21 version first (all 1.21 version currently in Rapid channel contains this fix), then backported to 1.20. We backported it half month ago to 1.20 version. However, checked our recent release, this backported version is not available in 1.20 currently (the closest 1.20 release which will include the fix should be 1.20.9-2100, and will be available in next week) Starting from 1.20, the addonmanager changed from EnsureExists to Reconcile (this change is necessary for us to make the device plugin automatically updated to use new features together with the cluster upgrade), the side effect is manual editing will be reverted from the device plugin YAML. To quickly change the memory limitation, I would suggest to upgrade your cluster again to 1.21 version, all 1.21 versions should contains this fix. Again, sorry for the inconvenience it brings. Please let us know if 50 MiB is not enough for your case (from the memory utilization you includes, 50 MiB may not be sufficient either...) |
@assapin - I can indeed verify the addon manager mode here is version dependent (so, imo my initial assumption that it's somehow related to how I installed it was wrong). v1.17.17-gke.9100 - Reconcile (mem limits 10M) @grac3gao
Not everyone can/wants to upgrade k8s version to 1.21 (or generally speaking) willy-nilly, and things like this are true blockers for stable/enterprise environments (even in this specific scenario/issue as you said, 50MiB might not be enough, but there are always other problematic parameters people would need to tweak). |
Yes, I saw this discussion on Istio and tracked down the issue to the "Reconcile" label, that causes the addon manager to re-apply the configuration it has stored locally on the master(?) @omesser great that you've managed to find a specific version that comes with EnsureExists. @grac3gao |
@grac3gao - ping. Any followup here? |
Sorry for the late response. For the nvidia-device-plugin OOM issue, we have the following mitigations and plans :
Extended from the workarounds in point 3, we are planning to pre-configure several nvidia-gpu-device-plugin daemonsets with different memory limit level. In the future, users are able to change some node labels to switch to use a nvidia-gpu-device-plugin daemonsets with proper memory limit which accommodates their workloads. |
@grac3gao-zz
Obviously that is quite a hacky way around this. I would not want to go with such a solution, having to not consider such nodes as ones with
I'll be waiting to see what this is all about then, I suppose, Though I can't say I understand why you would go with this solution instead of either upping the limit to something which is safely upwards of real life usage under high loads, or providing a way to interact with the addon manager in a meaningful way to configure things like that. Anyways, |
I want to echo Oded's input
What surprises me is - how come so few people encountered this issue?
I would have thought It would reproduce for every single ML team working on
GKE as a training env?
…On Fri, Oct 22, 2021, 22:44 Oded Messer ***@***.***> wrote:
@grac3gao-zz <https://github.com/grac3gao-zz>
Thanks for the response, unfortunately, not good news I would say
"Modify any workloads you have, which required the '
cloud.google.com/gke-accelerator' label to work with the new '
cloud.google.com/gke-accelerator-modified' label"
Obviously that is quite a hacky way around this. I would not want to go
with such a solution, having to not consider such nodes as ones with
cloud.google.com/gke-accelerator just because of arbitrarily low settings
of memory limit.
"we are planning to pre-configure several nvidia-gpu-device-plugin
daemonsets with different memory limit level"
I'll be waiting to see what this is all about then, I suppose, Though I
can't say I understand why you would go with the solution instead of either
upping the limit to something which is safely upwards of real life usage
under high loads, or providing a way to interact with the addon manager in
a meaningful way to configure things like that.
Anyways,
Cheers and thanks for answering
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#202 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALH4S6SQDR2QHWY5U3FGELTUIG5JXANCNFSM5BIFPWKQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Hey folks,
In our experiments with running GPU loads over GKE over at iguazio we've hit OOM kill of the NVIDIA GPU device plugin pod during a gpu load test
This happened on ubuntu nodes w GPU (n1-standard-16 though I don't think it matters 😄 ), running GKE engine 1.19.9-gke.1900
I suspect the allocated resources for it (mem limits) might not be enough. Maybe up it to 40Mi?
We're using the documented way of installing it as describe in https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#ubuntu
The text was updated successfully, but these errors were encountered: