[ws-daemon] avoid hitting workspace memory limits #13254

kylos101 · 2022-09-23T17:24:28Z

Is your feature request related to a problem? Please describe

Sometimes the Workspace container gets oomkilled because the node's oomkiller is triggered due to high memory utilization.

Thanks, @aledbf , for the idea!

@csweichel @Furisto @utam0k wdyt? 🤔 @atduarte for awareness.

Describe the behaviour you'd like

Wrap the cgroup inside the workspace with a lower memory limit. Like by 5% and then the process inside the pod will be killed, instead of the workspace container.

Describe alternatives you've considered

n/a

Additional context

Refer to this internal incident for background.

Front conversations

utam0k · 2022-09-26T00:27:23Z

@kylos101 This problem happens with cgroup v2(SaaS), right? I'd like to know if we have to consider cgroup v1.

kylos101 · 2022-09-26T00:30:04Z

@utam0k it happens with cgroup v1 on self-hosted, not sure about saas and cgroup v2.

utam0k · 2022-09-26T00:31:25Z

@Furisto How about set memory.high = memory.max * 0.95? WDYT? But, probably this setting will affect #11156. Last time it was supposed to be 80%(default value of kubelet), so it should be better than that.

utam0k · 2022-09-26T00:32:34Z

it happens with cgroup v1 on self-hosted, not sure about saas and cgroup v2.

Oops, that was unexpected. Thanks.

utam0k · 2022-09-26T00:36:12Z

For cgroup v1 it can't support memory QoS. Currently we avoid OOM by clearing the cache periodically, but this will not be enough for the self-hosted customer. The easiest workaround is to upgrade cgroup v2. But I'll give it some thought.
https://github.com/gitpod-io/gitpod/blob/main/components/ws-daemon/pkg/cgroup/plugin_cachereclaim.go

utam0k · 2022-09-26T01:03:05Z

I'll have to experiment a bit more to fully understand the behavior, but what about having a two-level cgroup?
1st level: container hard limits set by kubelet
2nd level: a little less than the hard limit (95%) of memory limit for the workspace
Then, if all workspace processes belong to this second level and the supervisor is not selected for OOM (e.g., nice value), it may be possible to prevent the entire workspace from falling.

utam0k · 2022-09-26T01:03:39Z

In cgroup v2, this is different, but there are already two levels of cgroup.

utam0k · 2022-09-26T01:05:49Z

It might be possible to make this second level value user-configurable by giving the user the cgroup namespace; I don't see any reason not to use the cgroup namespace in cgroup v1 as far as I know, so it should be possible. Well, this is an optional feature.

Furisto · 2022-10-10T22:00:11Z

Wrap the cgroup inside the workspace with a lower memory limit. Like by 5% and then the process inside the pod will be killed, instead of the workspace container.

@kylos101 Let me restate this to ensure that we are on the same page: We currently have (slightly simplified)

workspacekit ring 0 
   workspacekit ring 1
    supervisor
      ide
        user process 1
        user process 2

What you would like is for the user processes to run inside a different cgroup than workspacekit, supervisor and ide. Is that correct?

jimmybrancaccio · 2022-10-11T20:22:15Z

As a side note, I am having users in support tickets whom are reporting this to run the following in a workspace and then to send me over the created log file:

for i in `seq 1 60` ; do date |tee -a memory.log ; free --mega | tee -a memory.log ; echo '' >> memory.log ; echo '' >> memory.log ; sleep 60 ; done

kylos101 · 2022-10-11T20:34:54Z

What you would like is for the user processes to run inside a different cgroup than workspacekit, supervisor and ide. Is that correct?

@Furisto Potentially? What would the user experience be like if we did that? For example, would user processes be oom killed within the workspace, making it less likely for the workspace to be oom killed? Assuming the IDE and extensions memory usage remains low.

utam0k · 2022-10-12T01:15:35Z

@Furisto Potentially? What would the user experience be like if we did that? For example, would user processes be oom killed within the workspace, making it less likely for the workspace to be oom killed? Assuming the IDE and extensions memory usage remains low.

I expect the user's process to be OOM because the mem hard limit of the user's cgroup is smaller. The supervisor should remain and the workspace should remain.l

utam0k · 2022-10-12T01:19:33Z

Another way to use eventfd. This gives us a feature that can detect OOM events in a cgroup and process them in userland but we can't rely on the OOMKiller in that case. I haven't tried it, but it should be possible with memory.oom_control and cgroup.event_control. I can't be sure if it is really possible without spending a bit of time on it. In other words, ws-daemon would pick a process and kill it before the OOM of the workspace. Maybe this is more flexible.

Furisto · 2022-10-12T11:35:52Z

Another way to use eventfd...

I have been thinking in the same direction, specifically I was inspired by what Facebook has done with oomd. Oomd would not work for self-hosted as it only supports cgroup v2, but the principle is interesting. We could use oomd for SaaS and write our own implementation for cgroup v1 or just use our own implementation for both.

Another option that could be simpler to implement and work for cgroup v1 and v2 would be to modify the oom_score_adj value. If we adjust the oom score of IDE, we would have to run this in a go routine I think as child processes would inherit the value of the parent but that is not too bad. I will spend some time prototyping this approach.

Furisto · 2022-10-12T11:38:13Z

Some more info on the second approach:

The badness heuristic is used to determine which process should be killed in out of memory situtations by the OOM killer. The result of this calculation is surfaced by /proc/[pid]/oom_score for each process. A higher score means that a process is more likely to be killed.

The score can be adjusted (provided the writing process has CAP_SYS_RESOURCE ))) by writing to either /proc/[pid]/oom_adj (deprecated) or the newer /proc/[pid]/oom_score_adj. Values between +1000 and -1000 can be written to /proc/[pid]/oom_score_adj. A value of -1000 means that the process will never be killed by the OOM killer.

Whenever an out of memory situation occurs out_of_memory is called. Depending on how the out of memory condition was caused (either memory is scarce system wide or just in the cgroup) either all processes or the processes in the cgroup and its child cgroups are scanned and for each process the badness score is calculated. Some tasks like kernel tasks and the init process are exempt and cannot be killed.

The badness score finds the task with the highest memory consumption (as measured by the tasks resident set size (RSS), pagetable and swap space usage) and then modifies it with the oom_score_adj. The resident set size is the memory consumed by file memory, anonymous memory and shem (which is tmpfs memory, SysV shared memory,
POSIX shared memory).

utam0k · 2022-10-12T23:36:58Z

@Furisto PSI is a very interesting tech to me, but I was concerned that oom(PSI) might lead to complications for our product. WDYT? I think simple is better. I think all approaches are effective in their own way. Excellent. As for cgroup v2, can't memory QoS deal with this to some extent?

But this issue hasn't been put into the schedules column. Why not list the methods in order of cost-effectiveness first, and then try them one at a time to examine their effectiveness? How about trying oom_score_adj first? I think the oom_score_adj way can use the same approach of a nice value. That is why I recommend oom_score_adj

utam0k · 2022-10-13T00:40:51Z

I think we should socialize in RFC before attempting. WDYT?, @Furisto

Furisto · 2022-10-13T21:11:20Z

@utam0k @kylos101 See https://www.notion.so/gitpod/Avoid-hitting-workspace-memory-limits-1ab8cd8b2d4d4f248c4dd8b0dc6d34f2

utam0k · 2022-10-14T06:45:00Z

@Furisto Thanks ❤️

kylos101 · 2022-10-27T16:25:40Z

Here's an example of a workspace getting OOMKilled on the saas / with cgroup v2. 🤔

@Furisto @utam0k may I ask you to update the RFC accordingly?

kylos101 · 2022-11-02T00:53:59Z

@Furisto @utam0k let's plan to pickup the related RFC for this late this week / early next? First thing to do is assess frequency for workspace oomkill on cgroup v2.

utam0k · 2022-11-07T00:53:51Z

@kylos101 @Furisto FYI: It happened four times in the last 7 days on the gitpod.io
https://cloudlogging.app.goo.gl/YrSpgMoxEVY1eem26

Furisto · 2022-11-22T11:03:17Z

This has not been deployed with gen77 (probably due to the way it was deployed). Need to wait for gen78.

kylos101 added the component: ws-daemon label Sep 23, 2022

kylos101 added this to 🌌 Workspace Team Sep 23, 2022

kylos101 added this to 🚀 IDE Team Oct 10, 2022

kylos101 removed this from 🚀 IDE Team Oct 10, 2022

kylos101 mentioned this issue Oct 10, 2022

IDE consumes system cache over time, causing long running workspace to get OOMkilled #13732

Closed

Furisto mentioned this issue Nov 8, 2022

Prevent workspaces from being OOM killed #14507

Merged

4 tasks

roboquat closed this as completed in #14507 Nov 10, 2022

roboquat moved this to Awaiting Deployment in 🌌 Workspace Team Nov 10, 2022

jenting assigned Furisto Nov 21, 2022

Furisto moved this from Awaiting Deployment to In Validation in 🌌 Workspace Team Nov 21, 2022

Furisto added the blocked label Nov 22, 2022

kylos101 removed the blocked label Dec 19, 2022

Furisto moved this from In Validation to Done in 🌌 Workspace Team Jan 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ws-daemon] avoid hitting workspace memory limits #13254

[ws-daemon] avoid hitting workspace memory limits #13254

kylos101 commented Sep 23, 2022 •

edited by jimmybrancaccio

Loading

utam0k commented Sep 26, 2022

kylos101 commented Sep 26, 2022

utam0k commented Sep 26, 2022

utam0k commented Sep 26, 2022

utam0k commented Sep 26, 2022

utam0k commented Sep 26, 2022

utam0k commented Sep 26, 2022

utam0k commented Sep 26, 2022

Furisto commented Oct 10, 2022

jimmybrancaccio commented Oct 11, 2022

kylos101 commented Oct 11, 2022 •

edited

Loading

utam0k commented Oct 12, 2022

utam0k commented Oct 12, 2022 •

edited

Loading

Furisto commented Oct 12, 2022

Furisto commented Oct 12, 2022 •

edited

Loading

utam0k commented Oct 12, 2022

utam0k commented Oct 13, 2022

Furisto commented Oct 13, 2022

utam0k commented Oct 14, 2022

kylos101 commented Oct 27, 2022

kylos101 commented Nov 2, 2022

utam0k commented Nov 7, 2022

Furisto commented Nov 22, 2022

[ws-daemon] avoid hitting workspace memory limits #13254

[ws-daemon] avoid hitting workspace memory limits #13254

Comments

kylos101 commented Sep 23, 2022 • edited by jimmybrancaccio Loading

Is your feature request related to a problem? Please describe

Describe the behaviour you'd like

Describe alternatives you've considered

Additional context

utam0k commented Sep 26, 2022

kylos101 commented Sep 26, 2022

utam0k commented Sep 26, 2022

utam0k commented Sep 26, 2022

utam0k commented Sep 26, 2022

utam0k commented Sep 26, 2022

utam0k commented Sep 26, 2022

utam0k commented Sep 26, 2022

Furisto commented Oct 10, 2022

jimmybrancaccio commented Oct 11, 2022

kylos101 commented Oct 11, 2022 • edited Loading

utam0k commented Oct 12, 2022

utam0k commented Oct 12, 2022 • edited Loading

Furisto commented Oct 12, 2022

Furisto commented Oct 12, 2022 • edited Loading

utam0k commented Oct 12, 2022

utam0k commented Oct 13, 2022

Furisto commented Oct 13, 2022

utam0k commented Oct 14, 2022

kylos101 commented Oct 27, 2022

kylos101 commented Nov 2, 2022

utam0k commented Nov 7, 2022

Furisto commented Nov 22, 2022

kylos101 commented Sep 23, 2022 •

edited by jimmybrancaccio

Loading

kylos101 commented Oct 11, 2022 •

edited

Loading

utam0k commented Oct 12, 2022 •

edited

Loading

Furisto commented Oct 12, 2022 •

edited

Loading