Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ws-daemon] avoid hitting workspace memory limits #13254

Closed
kylos101 opened this issue Sep 23, 2022 · 23 comments · Fixed by #14507
Closed

[ws-daemon] avoid hitting workspace memory limits #13254

kylos101 opened this issue Sep 23, 2022 · 23 comments · Fixed by #14507
Assignees

Comments

@kylos101
Copy link
Contributor

kylos101 commented Sep 23, 2022

Is your feature request related to a problem? Please describe

Sometimes the Workspace container gets oomkilled because the node's oomkiller is triggered due to high memory utilization.

Thanks, @aledbf , for the idea!

@csweichel @Furisto @utam0k wdyt? 🤔 @atduarte for awareness.

Describe the behaviour you'd like

Wrap the cgroup inside the workspace with a lower memory limit. Like by 5% and then the process inside the pod will be killed, instead of the workspace container.

Describe alternatives you've considered

n/a

Additional context

Refer to this internal incident for background.

Front logo Front conversations

@utam0k
Copy link
Contributor

utam0k commented Sep 26, 2022

@kylos101 This problem happens with cgroup v2(SaaS), right? I'd like to know if we have to consider cgroup v1.

@kylos101
Copy link
Contributor Author

@utam0k it happens with cgroup v1 on self-hosted, not sure about saas and cgroup v2.

@utam0k
Copy link
Contributor

utam0k commented Sep 26, 2022

@Furisto How about set memory.high = memory.max * 0.95? WDYT? But, probably this setting will affect #11156. Last time it was supposed to be 80%(default value of kubelet), so it should be better than that.

@utam0k
Copy link
Contributor

utam0k commented Sep 26, 2022

it happens with cgroup v1 on self-hosted, not sure about saas and cgroup v2.

Oops, that was unexpected. Thanks.

@utam0k
Copy link
Contributor

utam0k commented Sep 26, 2022

For cgroup v1 it can't support memory QoS. Currently we avoid OOM by clearing the cache periodically, but this will not be enough for the self-hosted customer. The easiest workaround is to upgrade cgroup v2. But I'll give it some thought.
https://github.com/gitpod-io/gitpod/blob/main/components/ws-daemon/pkg/cgroup/plugin_cachereclaim.go

@utam0k
Copy link
Contributor

utam0k commented Sep 26, 2022

I'll have to experiment a bit more to fully understand the behavior, but what about having a two-level cgroup?
1st level: container hard limits set by kubelet
2nd level: a little less than the hard limit (95%) of memory limit for the workspace
Then, if all workspace processes belong to this second level and the supervisor is not selected for OOM (e.g., nice value), it may be possible to prevent the entire workspace from falling.

@utam0k
Copy link
Contributor

utam0k commented Sep 26, 2022

In cgroup v2, this is different, but there are already two levels of cgroup.

@utam0k
Copy link
Contributor

utam0k commented Sep 26, 2022

It might be possible to make this second level value user-configurable by giving the user the cgroup namespace; I don't see any reason not to use the cgroup namespace in cgroup v1 as far as I know, so it should be possible. Well, this is an optional feature.

@Furisto
Copy link
Member

Furisto commented Oct 10, 2022

Wrap the cgroup inside the workspace with a lower memory limit. Like by 5% and then the process inside the pod will be killed, instead of the workspace container.

@kylos101 Let me restate this to ensure that we are on the same page: We currently have (slightly simplified)

workspacekit ring 0 
   workspacekit ring 1
    supervisor
      ide
        user process 1
        user process 2

What you would like is for the user processes to run inside a different cgroup than workspacekit, supervisor and ide. Is that correct?

@jimmybrancaccio
Copy link

As a side note, I am having users in support tickets whom are reporting this to run the following in a workspace and then to send me over the created log file:

for i in `seq 1 60` ; do date |tee -a memory.log ; free --mega | tee -a memory.log ; echo '' >> memory.log ; echo '' >> memory.log ; sleep 60 ; done

@kylos101
Copy link
Contributor Author

kylos101 commented Oct 11, 2022

What you would like is for the user processes to run inside a different cgroup than workspacekit, supervisor and ide. Is that correct?

@Furisto Potentially? What would the user experience be like if we did that? For example, would user processes be oom killed within the workspace, making it less likely for the workspace to be oom killed? Assuming the IDE and extensions memory usage remains low.

@utam0k
Copy link
Contributor

utam0k commented Oct 12, 2022

@Furisto Potentially? What would the user experience be like if we did that? For example, would user processes be oom killed within the workspace, making it less likely for the workspace to be oom killed? Assuming the IDE and extensions memory usage remains low.

I expect the user's process to be OOM because the mem hard limit of the user's cgroup is smaller. The supervisor should remain and the workspace should remain.l

@utam0k
Copy link
Contributor

utam0k commented Oct 12, 2022

Another way to use eventfd. This gives us a feature that can detect OOM events in a cgroup and process them in userland but we can't rely on the OOMKiller in that case. I haven't tried it, but it should be possible with memory.oom_control and cgroup.event_control. I can't be sure if it is really possible without spending a bit of time on it. In other words, ws-daemon would pick a process and kill it before the OOM of the workspace. Maybe this is more flexible.

@Furisto
Copy link
Member

Furisto commented Oct 12, 2022

Another way to use eventfd...

I have been thinking in the same direction, specifically I was inspired by what Facebook has done with oomd. Oomd would not work for self-hosted as it only supports cgroup v2, but the principle is interesting. We could use oomd for SaaS and write our own implementation for cgroup v1 or just use our own implementation for both.

Another option that could be simpler to implement and work for cgroup v1 and v2 would be to modify the oom_score_adj value. If we adjust the oom score of IDE, we would have to run this in a go routine I think as child processes would inherit the value of the parent but that is not too bad. I will spend some time prototyping this approach.

@Furisto
Copy link
Member

Furisto commented Oct 12, 2022

Some more info on the second approach:

The badness heuristic is used to determine which process should be killed in out of memory situtations by the OOM killer. The result of this calculation is surfaced by /proc/[pid]/oom_score for each process. A higher score means that a process is more likely to be killed.

The score can be adjusted (provided the writing process has CAP_SYS_RESOURCE))) by writing to either /proc/[pid]/oom_adj (deprecated) or the newer /proc/[pid]/oom_score_adj. Values between +1000 and -1000 can be written to /proc/[pid]/oom_score_adj. A value of -1000 means that the process will never be killed by the OOM killer.

Whenever an out of memory situation occurs out_of_memory is called. Depending on how the out of memory condition was caused (either memory is scarce system wide or just in the cgroup) either all processes or the processes in the cgroup and its child cgroups are scanned and for each process the badness score is calculated. Some tasks like kernel tasks and the init process are exempt and cannot be killed.

The badness score finds the task with the highest memory consumption (as measured by the tasks resident set size (RSS), pagetable and swap space usage) and then modifies it with the oom_score_adj. The resident set size is the memory consumed by file memory, anonymous memory and shem (which is tmpfs memory, SysV shared memory,
POSIX shared memory).

@utam0k
Copy link
Contributor

utam0k commented Oct 12, 2022

@Furisto PSI is a very interesting tech to me, but I was concerned that oom(PSI) might lead to complications for our product. WDYT? I think simple is better. I think all approaches are effective in their own way. Excellent. As for cgroup v2, can't memory QoS deal with this to some extent?

But this issue hasn't been put into the schedules column. Why not list the methods in order of cost-effectiveness first, and then try them one at a time to examine their effectiveness? How about trying oom_score_adj first? I think the oom_score_adj way can use the same approach of a nice value. That is why I recommend oom_score_adj

@utam0k
Copy link
Contributor

utam0k commented Oct 13, 2022

I think we should socialize in RFC before attempting. WDYT?, @Furisto

@Furisto
Copy link
Member

Furisto commented Oct 13, 2022

@utam0k @kylos101 See https://www.notion.so/gitpod/Avoid-hitting-workspace-memory-limits-1ab8cd8b2d4d4f248c4dd8b0dc6d34f2

@utam0k
Copy link
Contributor

utam0k commented Oct 14, 2022

@Furisto Thanks ❤️

@kylos101
Copy link
Contributor Author

Here's an example of a workspace getting OOMKilled on the saas / with cgroup v2. 🤔

@Furisto @utam0k may I ask you to update the RFC accordingly?

@kylos101
Copy link
Contributor Author

kylos101 commented Nov 2, 2022

@Furisto @utam0k let's plan to pickup the related RFC for this late this week / early next? First thing to do is assess frequency for workspace oomkill on cgroup v2.

@utam0k
Copy link
Contributor

utam0k commented Nov 7, 2022

@kylos101 @Furisto FYI: It happened four times in the last 7 days on the gitpod.io
https://cloudlogging.app.goo.gl/YrSpgMoxEVY1eem26

@Furisto Furisto moved this from Awaiting Deployment to In Validation in 🌌 Workspace Team Nov 21, 2022
@Furisto
Copy link
Member

Furisto commented Nov 22, 2022

This has not been deployed with gen77 (probably due to the way it was deployed). Need to wait for gen78.

@kylos101 kylos101 removed the blocked label Dec 19, 2022
@Furisto Furisto moved this from In Validation to Done in 🌌 Workspace Team Jan 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants