Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Starting new workspace causes it to fail with OOM error from kubelet #8253

Closed
sagor999 opened this issue Feb 16, 2022 · 3 comments · Fixed by #8289
Closed

Starting new workspace causes it to fail with OOM error from kubelet #8253

sagor999 opened this issue Feb 16, 2022 · 3 comments · Fixed by #8289
Assignees
Labels
team: workspace Issue belongs to the Workspace team

Comments

@sagor999
Copy link
Contributor

Bug description

Sometimes workspaces fail to start due to OOM error from kubelet.
We suspect it happens when node is at capacity, but has workspaces that are still terminating.
It seems like k8s scheduler ignores terminating pods, but kubelet doesn't.
So scheduler schedules a pod on a node that it think it should be able to run on, but then kubelet rejects it with OOM error.

This seem to be related to this:
kubernetes/kubernetes#106884
kubernetes/kubernetes#104560

I will create a controller that will cordon node when it reached maximum capacity of workspaces on it as a temporary workaround for this issue.

Steps to reproduce

Workspace affected

No response

Expected behavior

No response

Example repository

No response

Anything else?

No response

@sagor999 sagor999 added the team: workspace Issue belongs to the Workspace team label Feb 16, 2022
@sagor999 sagor999 self-assigned this Feb 16, 2022
@sagor999
Copy link
Contributor Author

Related: #8238

@sagor999
Copy link
Contributor Author

Related: #7969
#7969

@sagor999
Copy link
Contributor Author

new approach based on @aledbf fix:
currently ws-manager just creates the pod and expects that it will get created.
Since k8s 1.22 that is no longer valid approach, and instead it should try to create the pod, and if it failed, try again. As it can fail due to out of resource errors (like cpu or memory).

Repository owner moved this from In Progress to Done in 🌌 Workspace Team Feb 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team: workspace Issue belongs to the Workspace team
Projects
No open projects
Archived in project
1 participant