Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add KEP for cgroups v2 support #1370

Merged
merged 1 commit into from
Feb 26, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
241 changes: 241 additions & 0 deletions keps/sig-node/20191118-cgroups-v2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,241 @@
---
title: cgroups v2
authors:
- "@giuseppe"
owning-sig: sig-node
participating-sigs:
- sig-architecture
reviewers:
- "@yujuhong"
- "@dchen1107"
- "@derekwaynecarr"
approvers:
derekwaynecarr marked this conversation as resolved.
Show resolved Hide resolved
- "@yujuhong"
- "@dchen1107"
- "@derekwaynecarr"
editor: Giuseppe Scrivano
creation-date: 2019-11-18
last-updated: 2019-11-18
status: implementable
see-also:
replaces:
superseded-by:
---

# Cgroups v2

## Table of Contents

<!-- toc -->
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [User Stories](#user-stories)
- [Implementation Details](#implementation-details)
- [Proposal](#proposal)
- [Dependencies on OCI and container runtimes](#dependencies-on-oci-and-container-runtimes)
- [Current status of dependencies](#current-status-of-dependencies)
- [Current cgroups usage and the equivalent in cgroups v2](#current-cgroups-usage-and-the-equivalent-in-cgroups-v2)
- [cgroup namespace](#cgroup-namespace)
- [Phase 1: Convert from cgroups v1 settings to v2](#phase-1-convert-from-cgroups-v1-settings-to-v2)
- [Phase 2: Use cgroups v2 throughout the stack](#phase-2-use-cgroups-v2-throughout-the-stack)
- [Risk and Mitigations](#risk-and-mitigations)
- [Graduation Criteria](#graduation-criteria)
<!-- /toc -->

## Summary

A proposal to add support for cgroups v2 to kubernetes.

## Motivation

The new kernel cgroups v2 API was declared stable more than two years
ago. Newer features in the kernel such as PSI depend upon cgroups
v2. groups v1 will eventually become obsolete in favor of cgroups v2.
Some distros are already using cgroups v2 by default, and that
prevents Kubernetes from working as it is required to run with cgroups
v1.

## Goals
giuseppe marked this conversation as resolved.
Show resolved Hide resolved

This proposal aims to:

* Add support for cgroups v2 to the Kubelet

## Non-Goals

* Expose new cgroup2-only features
* Dockershim
* Plugins support

## User Stories
derekwaynecarr marked this conversation as resolved.
Show resolved Hide resolved
giuseppe marked this conversation as resolved.
Show resolved Hide resolved

* The Kubelet can run on a host using either cgroups v1 or v2.
* Have features parity between cgroup v2 and v1.

## Implementation Details

## Proposal

The proposal is to implement cgroups v2 in two different phases.

The first phase ensures that any configuration file designed for
cgroups v1 will continue to work on cgroups v2.

The second phase requires changes through the entire stack, including
the OCI runtime specifications.

At startup the Kubelet detects what hierarchy the system is using. It
checks the file system type for `/sys/fs/cgroup` (the equivalent of
`stat -f --format '%T' /sys/fs/cgroup`). If the type is `cgroup2fs`
then the Kubelet will use only cgroups v2 during all its execution.

The current proposal doesn't aim at deprecating cgroup v1, that must
still be supported through the stack.

Device plugins that require v2 enablement are out of the scope for
this proposal.

### Dependencies on OCI and container runtimes

In order to support features only available in cgroups v2 but not in
cgroups v1, the OCI runtime specs must be changed.

New features that are not present in cgroup v1 are out of the scope
for this proposal.

The dockershim implementation embedded in the Kubelet won't be
supported on cgroup v2.

### Current status of dependencies

- CRI-O+crun: support cgroups v2

- runc support for cgroups v2 is work in progress [current status](#current-cgroups-usage-and-the-equivalent-in-cgroups-v2)

- containerd: [https://github.com/containerd/containerd/issues/3726](https://github.com/containerd/containerd/issues/3726)

- Moby: [https://github.com/moby/moby/pull/40174](https://github.com/moby/moby/pull/40174)

- OCI runtime spec: TODO

- cAdvisor already supports cgroups v2 ([https://github.com/google/cadvisor/pull/2309](https://github.com/google/cadvisor/pull/2309))

## Current cgroups usage and the equivalent in cgroups v2

|Kubernetes cgroups v1|Kubernetes cgroups v2 behavior|
|---|---|
|CPU stats for Horizontal Pod Autoscaling|No .percpu cpuacct stats.|
|CPU pinning based on integral cores|Cpuset controller available|
|Memory limits|Not changed, different naming|
derekwaynecarr marked this conversation as resolved.
Show resolved Hide resolved
|PIDs limits|Not changed, same naming|
|hugetlb|Added to linux-next, targeting Linux 5.6|

### cgroup namespace

A cgroup namespace restricts the view on the cgroups. When
giuseppe marked this conversation as resolved.
Show resolved Hide resolved
unshare(CLONE_NEWCGROUP) is done, the current cgroup the process
resides in becomes the root. Other cgroups won't be visible from the
new namespace. It was not enabled by default on a cgroup v1 system as
older kernel lacked support for it.

giuseppe marked this conversation as resolved.
Show resolved Hide resolved
Privileged pods will still use the host cgroup namespace so to have
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems inconsistent with other namespaces. On cgroup v2 systems, the host cgroup namespace should be enabled only if PodSpec.HostCgroup is set to true.

visibility on all the other cgroups.

### Phase 1: Convert from cgroups v1 settings to v2

giuseppe marked this conversation as resolved.
Show resolved Hide resolved
We can convert the values passed by the k8s in cgroups v1 from to
giuseppe marked this conversation as resolved.
Show resolved Hide resolved
cgroups v2 so Kubernetes users don’t have to change what they specify
in their manifests.

crun has implemented the conversion as follows:

**Memory controller**

| OCI (x) | cgroup 2 value (y) | conversion | comment |
|---|---|---|---|
| limit | memory.max | y = x ||
| swap | memory.swap.max | y = x ||
| reservation | memory.low | y = x ||

**PIDs controller**

| OCI (x) | cgroup 2 value (y) | conversion | comment |
|---|---|---|---|
| limit | pids.max | y = x ||

**CPU controller**

| OCI (x) | cgroup 2 value (y) | conversion | comment |
|---|---|---|---|
| shares | cpu.weight | y = (1 + ((x - 2) * 9999) / 262142) | convert from [2-262144] to [1-10000]|
| period | cpu.max | y = x| period and quota are written together|
| quota | cpu.max | y = x| period and quota are written together|

**blkio controller**

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any kuberente versions and plans to support blkio controller ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any kuberente versions and plans to support blkio controller ?

kubernetes/enhancements#1907 the KEP is WIP.


| OCI (x) | cgroup 2 value (y) | conversion | comment |
|---|---|---|---|
| weight | io.bfq.weight | y = (1 + (x - 10) * 9999 / 990) | convert linearly from [10-1000] to [1-10000]|
| weight_device | io.bfq.weight | y = (1 + (x - 10) * 9999 / 990) | convert linearly from [10-1000] to [1-10000]|
|rbps|io.max|y=x||
|wbps|io.max|y=x||
|riops|io.max|y=x||
|wiops|io.max|y=x||

**cpuset controller**

| OCI (x) | cgroup 2 value (y) | conversion | comment |
|---|---|---|---|
| cpus | cpuset.cpus | y = x ||
| mems | cpuset.mems | y = x ||

**hugetlb controller**

| OCI (x) | cgroup 2 value (y) | conversion | comment |
|---|---|---|---|
| <PAGE_SIZE>.limit_in_bytes | hugetlb.<PAGE_SIZE>.max | y = x ||

With this approach cAdvisor would have to read back values from
cgroups v2 files (already done).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, with this option, CRI implementations will need to map the cgroup v1 fields to cgroup v2 values, right?

Could you list what each layer needs to do for each option to make it clear?

  • user (pod spec)
  • kubelet
  • CRI implementation
  • OCI runtime

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the changes in the CRI implementation and OCI runtime might be implemented in a different way, deciding where to draw the line.
For CRI-O+crun, most of the logic is in the OCI runtime itself, but that is not the only way to achieve it.
For the Kubelet, the changes in this patch should be enough (at least until we have not hugetlb available): kubernetes/kubernetes#85218

In the second phase though, when cgroup v2 is fully supported through the stack there is need to change both pod specs+CRI.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you clarify why you feel the pod spec needs to change? i see no major reason to change the pod spec or resource representation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was assuming we'd like to expose all/most of the cgroup v2 features.

e.g. the memory controller on cgroup v1 allows to configure:

  • memory.soft_limit_in_bytes
  • memory.limit_in_bytes

while on cgroup v2 we have:

  • memory.high
  • memory.low
  • memory.max
  • memory.min

but that is probably out of the scope as each new feature (if needed) must go through its own KEP?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@derekwaynecarr are future improvements based on what cgroup 2 offers out of scope for the current KEP?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giuseppe future improvements for what cgroup v2 offers should be out of scope of this kep. i would keep this kep focused on kubelet is tolerant of cgroup v2 host. adding new cgroup v2 specific features to resource model would be a separate enhancement.

To address @yujuhong question, I think we are saying the following:

  • no change required to pod spec
  • kubelet cgroup manager uses v1 or v2 mode based on its introspection of sys/fs on startup
  • kubelet qos and pod level cgroup creation on a v2 host uses the mapping table specified
  • cri implementers that support cgroup v2 hosts must use a similar mapping to ensure that pod bounding cgroup values are consistent with the container cgroup values.
  • oci runtime spec is not required to change as cri implementer provides mapping in the transitional period.

@giuseppe agree with above?


Kubelet PR: [https://github.com/kubernetes/kubernetes/pull/85218](https://github.com/kubernetes/kubernetes/pull/85218)

### Phase 2: Use cgroups v2 throughout the stack

This option means that the values are written directly to cgroups v2
by the runtime. The Kubelet doesn’t do any conversion when setting
these values over the CRI. We will need to add a cgroups v2 specific
giuseppe marked this conversation as resolved.
Show resolved Hide resolved
LinuxContainerResources to the CRI.

This depends upon the container runtimes like runc and crun to be able
to write cgroups v2 values directly.

OCI will need support for cgroups v2 and CRI implementations will
write to the cgroups v2 section of the new OCI runtime config.json.

## Risk and Mitigations

Some cgroups v1 features are not available with cgroups v2:

- _cpuacct.usage_percpu_
- network stats from cgroup
giuseppe marked this conversation as resolved.
Show resolved Hide resolved

Some cgroups v1 controllers such as _device_ and _net_cls_,
_net_prio_ are not available with the new version. The alternative to
these controllers is to use eBPF.

## Graduation Criteria
giuseppe marked this conversation as resolved.
Show resolved Hide resolved

- Alpha: Phase 1 completed and basic support for running Kubernetes on
a cgroups v2 host, e2e tests coverage or have a plan for the
failing tests.
A good candidate for running cgroup v2 test is Fedora 31 that has
already switched to default to cgroup v2.

- Beta: e2e tests coverage and performance testing.

- GA: Assuming no negative user feedback based on production
experience, promote after 2 releases in beta.
*TBD* whether phase 2 must be implemented for GA.