Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFE] Prevent the OOM killer to hit critical services #1427

Open
pothos opened this issue Apr 16, 2024 · 3 comments
Open

[RFE] Prevent the OOM killer to hit critical services #1427

pothos opened this issue Apr 16, 2024 · 3 comments
Labels
kind/feature A feature request

Comments

@pothos
Copy link
Member

pothos commented Apr 16, 2024

Current situation

When running low on memory Flatcar currently relies on the kernel's OOM killer to kill processes. Flatcar does not make use of systemd-oomd yet. When the kernel kills processes, it can hit critical system services.

Impact

Hitting critical system services can render the system unresponsive as observed by @jepio.

Ideal future situation

Instead of killing processes as last resort we can use systemd-oomd to evaluate cgroups memory usage and terminate cgroups instead of single processes and do this earlier than the kernel would do to ensure that the system stays responsive. Terminating whole cgroups means that the action is more coordinated and impactful than killing random child or parent processes. Using the cgroup memory accounting means that the termination hits something that is responsible for the OOM than when the kernel OOM killer would do.

To prevent both the kernel OOM killer and systemd-oomd to hit critical services one can set OOMScoreAdjust= and MemoryMin=.
To steer the systemd-oomd towards killing a certain unit one can set ManagedOOMSwap=kill and ManagedOOMMemoryPressure=kill.

Implementation options

Enable systemd-oomd by default on Flatcar.
Set OOMScoreAdjust= and MemoryMin= for critical service units.
Set a drop-in for docker .scope units to have ManagedOOMSwap=kill and ManagedOOMMemoryPressure=kill.

Additional information

Docker containers run under docker-….scope which is part of system.slice. The same is true for other user-defined workloads that don't spawn new cgroups directly under the root slice. Therefore, setting protections for the system slice is probably too broad and we would really have to identify which units we need to keep running and maintain this "allow list" as long as the upstream units don't set the OOMScoreAdjust= and MemoryMin= already.

@till
Copy link

till commented Apr 16, 2024

We move workloads into a slice to avoid them breaking the system.

Been doing it for a couple years atp, never got to having crashes of Flatcar/OS components.

@jepio
Copy link
Member

jepio commented Apr 17, 2024

@till can you share the details of your config? we might draw inspiration from that

@till
Copy link

till commented Apr 17, 2024

@jepio We do this for docker currently, so we configure cgroup-parent in /etc/docker/daemon.json.

The slice itself looks similar to this:

# https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html
[Slice]
CPUAccounting=yes
CPUQuota={{ cpu_quota_percent }}%
MemoryAccounting=yes
# Systemd > 231 (ignored for older versions)
MemoryHigh={{ memory_high_percent }}%
MemoryMax={{ memory_max_percent }}%
MemorySwapMax=0
# Systemd 219, as on CoreOS7
MemoryLimit={{ memory_limit_mb }}M

[Install]
Before=docker.service

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature A feature request
Projects
Development

No branches or pull requests

3 participants