You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sometimes workspaces fail to start due to OOM error from kubelet.
We suspect it happens when node is at capacity, but has workspaces that are still terminating.
It seems like k8s scheduler ignores terminating pods, but kubelet doesn't.
So scheduler schedules a pod on a node that it think it should be able to run on, but then kubelet rejects it with OOM error.
new approach based on @aledbf fix:
currently ws-manager just creates the pod and expects that it will get created.
Since k8s 1.22 that is no longer valid approach, and instead it should try to create the pod, and if it failed, try again. As it can fail due to out of resource errors (like cpu or memory).
Bug description
Sometimes workspaces fail to start due to OOM error from kubelet.
We suspect it happens when node is at capacity, but has workspaces that are still terminating.
It seems like k8s scheduler ignores terminating pods, but kubelet doesn't.
So scheduler schedules a pod on a node that it think it should be able to run on, but then kubelet rejects it with OOM error.
This seem to be related to this:
kubernetes/kubernetes#106884
kubernetes/kubernetes#104560
I will create a controller that will cordon node when it reached maximum capacity of workspaces on it as a temporary workaround for this issue.
Steps to reproduce
Workspace affected
No response
Expected behavior
No response
Example repository
No response
Anything else?
No response
The text was updated successfully, but these errors were encountered: