Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
kep.yaml	kep.yaml

KEP-2568: Run control-plane as non-root in kubeadm.

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

Summary

This KEP proposes that the control-plane in kubeadm be run as non-root. If containers are running as root an escape from a container may result in the escalation to root in host. CVE-2019-5736 is an example of a container escape vulnerability that can be mitigated by running containers/pods as non-root.

Motivation

CVE-2019-5736 is an example of a container escape vulnerability that can be mitigated by running containers/pods as non-root.

Running containers as non-root has been a long recommended best-practice in kubernetes and we have published blogs recommending this best practice. kubeadm which is a tool built to provide a "best-practice" path to creating clusters is an ideal candidate to apply this to.

Goals

Run control-plane components as non-root in kubeadm. More specifically :-
- Run kube-apiserver as non-root in kubeadm.
- Run kube-controller-manager as non-root in kubeadm.
- Run kube-scheduler as non-root in kubeadm.
- Run etcd as non-root in kubeadm.

Non-Goals

Run node components as non-root in kubeadm. More specifically :-
- Run kube-proxy as non-root, since it is not a control-plane component.
- Run kubelet as non-root.
- Compatibility with user namespace enabled environments is not in scope.
- Setting defaults for SELinux and AppArmor are not in scope. (We may reconsider this in beta.)

Proposal

Here is what we propose to do at a high level:

Run the control-plane components as non-root, by assigning a unique uid/gid to the containers in the Pods using the runAsUser and runAsGroup fields in the securityContext on the Pods.
Drop all capabilities from kube-controller-manager, kube-scheduler, etcd Pods. Drop all but cap_net_bind_service from kube-apiserver Pod, this will allow us to run as non-root while still being able to bind to ports < 1024. This can be achieved by using the capabilities field in the securityContext of the containers in the Pods.
Set the seccomp profile to runtime/default, this can be done by setting the seccompProfile field in the securityContext.

User Stories (Optional)

Story 1

As a security conscious user I would like to run the kubernetes control-plane as non-root to reduce the risk associated with container escape vulnerabilities in the control-plane.

Story 2

As a kubeadm user, I'd expect the bootstrapper to follow the best security practices when generating static or non-static pod manifests to run control plane components.

Notes/Constraints/Caveats (Optional)

There are some caveats with running the components as non-root around how to assign UID/GIDs to them. This is covered in the Assigning UID and GID section below.

Risks and Mitigations

If we hard coded the UID and GID, we could end up in a scenario where those are in use by another process on the machine, which would expose some of the credentials accessible to the UID and GIDs to that process. So we plan to use adduser --system or using the appropriate ranges from /etc/login.defs instead of hard coding the UID and GID.

Design Details

In kubeadm the control-plane components are run as static-pods, i.e. pods directly managed by kubelet. We can use the runAsUser and runAsGroup fields in SecurityContext to run the containers in these Pods with a unique UID and GID and the capabilities field to drop all capabilities other than the ones required.

Assigning UID and GID

There are 3 options for setting the UID/GID of the control-plane components:-

Update the kubeadm API to make the uid/gid configurable by the user: This can be implemented in 2 ways:-
1. This can be done by adding fields UID and GID of type int64 to the ControlPlaneComponent struct in https://github.com/kubernetes/kubernetes/blob/854c2cc79f11cfb46499454e7717a86d3214e6b0/cmd/kubeadm/app/apis/kubeadm/types.go#L132 as demonstrated in this PR.
2. User provides kubeadm with a range of uid/gids that are safe to use i.e. are not being used by another user.
Use constant values for uid/gid: The UID and GIDfor each of the control-plane components would be set to some predetermined value in https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/constants/constants.go.
Create system users and let users override the defaults: Use adduser --system or equivalent to create UIDs in the SYS_UID_MIN - SYS_UID_MAX range and groupadd --system or equivalent to create GIDs in the SYS_GID_MIN - SYS_GID_MAX range. Additionally if users want to specify their own UIDs or GIDs we will support that through kubeadm patching.

The author(s) believes that starting out with a safe default of option 3. and allowing the user to set the UID and GID through the kubeadm patching mechanism is more user-friendly for users who just wan't to quickly bootstrap and also users who care about which UIDs and GIDs that they want to run the control-plane as and also users . Further this feature will be opt-in and will be hidden behind a feature-gate, until it graduates to GA.

Choosing the UID between SYS_UID_MIN and SYS_UID_MAX and GID between SYS_GID_MIN and SYS_GID_MAX is in adherence with distro standards.

For Debian : https://www.debian.org/doc/debian-policy/ch-opersys.html#uid-and-gid-classes
For Fedora : https://docs.fedoraproject.org/en-US/packaging-guidelines/UsersAndGroups/

Updating the component manifests

An example of how the kube-scheduler manifest would like is below:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-scheduler
    tier: control-plane
  name: kube-scheduler
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-scheduler
    - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
    - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
    - --bind-address=127.0.0.1
    - --kubeconfig=/etc/kubernetes/scheduler.conf
    - --leader-elect=true
    - --port=0
    image: k8s.gcr.io/kube-scheduler:v1.21.0-beta.0.368_9850bf06b571d5-dirty
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10259
        scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    name: kube-scheduler
    resources:
      requests:
        cpu: 100m
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      runAsGroup: 2000 # this value is only an example and is not the id we plan to use.
      runAsUser: 2000  # this value is only an example and is not the id we plan to use.
    startupProbe:
      failureThreshold: 24
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10259
        scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    volumeMounts:
    - mountPath: /etc/kubernetes/scheduler.conf
      name: kubeconfig
      readOnly: true
  hostNetwork: true
  priorityClassName: system-node-critical
  volumes:
  - hostPath:
      path: /etc/kubernetes/scheduler.conf
      type: FileOrCreate
    name: kubeconfig
status: {}

Host Volume Permissions

In the manifest above the reader can notice that the kube-scheduler container mounts the kubeconfig file from the host. So we must grant read permissions to that file to the user that the container runs as, which in this case is 2000. kubeadm is responsible for bootstrap and creates the manifests of the control-plane components, which means that it knows all the volume mounts for the container.

There are 2 ways we can set the permissions:-

Use initContainers: We could run an initContainer that runs as root and mounts all the hostVolumes that the control-plane component container mounts. It then calls chown uid:gid on each of the files/directories that are mounted. See the example for kube-shceduler below:-

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-scheduler
    tier: control-plane
  name: kube-scheduler
  namespace: kube-system
spec:
  initContainers:
  - name: kube-scheduler-init
    volumeMounts:
    - mountPath: /etc/kubernetes/scheduler.conf
      name: kubeconfig
      readOnly: true
    image: k8s.gcr.io/debian-base:buster-v1.4.0
    command:
      - /bin/sh
      - -c
      - chown 2000:2000 /etc/kubernetes/scheduler.conf
  containers:
  - command:
    - kube-scheduler
    - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
    - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
    - --bind-address=127.0.0.1
    - --kubeconfig=/etc/kubernetes/scheduler.conf
    - --leader-elect=true
    - --port=0
    image: k8s.gcr.io/kube-scheduler:v1.21.0-beta.0.368_9850bf06b571d5-dirty
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10259
        scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    name: kube-scheduler
    resources:
      requests:
        cpu: 100m
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      runAsGroup: 2000  # this value is only an example and is not the id we plan to use.
      runAsUser: 2000  # this value is only an example and is not the id we plan to use.
    startupProbe:
      failureThreshold: 24
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10259
        scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    volumeMounts:
    - mountPath: /etc/kubernetes/scheduler.conf
      name: kubeconfig
      readOnly: true
  hostNetwork: true
  priorityClassName: system-node-critical
  volumes:
  - hostPath:
      path: /etc/kubernetes/scheduler.conf
      type: FileOrCreate
    name: kubeconfig
status: {}

The initContainer kube-scheduler-init will run before the kube-scheduler container and will setup the permissions of the files that the kube-scheduler container needs. Since initContainers are shotlived and exit once they are done the risk from running them as root on the host is low.

kubeadm sets the permissions: In this approach kubeadm would be responsible for setting the file permissions when it creates the files. It will call os.Chown to set the owner of the files. This approach is demonstrated in PR kubernetes/kubernetes#99753.

The author(s) believe that it is better for kubeadm to set the permission because adding an initContainer would require pulling of debian-base image (or similar) to run the commands to change file ownership, which is something that can be easily done in go. Also it leads to the question of which initContainer would be responsible for files shared between kube-controller-manager and kube-apiserver? Since kubeadm creates these files its best for it to apply the file permissions.

Shared files

Certain hostVolume mounts are shared between kube-apiserver and kube-controller-manager, some examples of these are:-

/etc/kubernetes/pki/ca.crt
/etc/kubernetes/pki/sa.key

We propose that files shared by kube-controller-manager and kube-apiserver be readable by a particular GID and we set the supplementalGroups in the PodSecurityContext of the Pod to this GID for kube-apiserver and kube-controller-manager.

For instance lets consider that kubeadm grant group 2100 read permissions to /etc/kubernetes/pki/ca.crt

Then we update the kube-apiserver manifest as follows:-

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubeadm.kubernetes.io/kube-apiserver.advertise-address.endpoint: 172.17.0.2:6443
  creationTimestamp: null
  labels:
    component: kube-apiserver
    tier: control-plane
  name: kube-apiserver
  namespace: kube-system
spec:
  securityContext:
    supplementalGroups:
    - 2100  # this value is only for an example and is not the id we plan to use.
  containers:
  - name: kube-apiserver
    command:
    - kube-apiserver
    - --advertise-address=172.17.0.2
    - --client-ca-file=/etc/kubernetes/pki/ca.crt
    ... # omitted to save space 
    securityContext:
      runAsUser: 2000  # this value is only an example and is not the id we plan to use.
      runAsGroup: 2000  # this value is only an example and is not the id we plan to use.
      allowPrivilegeEscalation: false
      seccompProfile:
        type: runtime/default
      capabilities:
        drop:
        - ALL
    image: k8s.gcr.io/kube-apiserver:v1.21.0-beta.0.368_9850bf06b571d5-dirty
    ... # omitted to save space
    volumeMounts:
    ... # omitted to save space
    - mountPath: /etc/kubernetes/pki
      name: k8s-certs
      readOnly: true
    ... # omitted to save space

Similarly kube-controller-manager's manifest would like:-

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-controller-manager
    tier: control-plane
  name: kube-controller-manager
  namespace: kube-system
spec:
  securityContext:
    supplementalGroups:
    - 2100
  containers:
  - name: kube-controller-manager
    command:
    - kube-controller-manager
    - --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
    ...
    # omitted to save space.
    securityContext:
      runAsUser: 2001  # this value is only an example and is not the id we plan to use.
      runAsGroup: 2001 # this value is only an example and is not the id we plan to use.
      allowPrivilegeEscalation: false
      seccompProfile:
        type: runtime/default
      capabilities:
        drop:
        - ALL
    image: k8s.gcr.io/kube-controller-manager:v1.21.0-beta.0.368_9850bf06b571d5-dirty
    ... # omitted to save space.
    volumeMounts:
    - mountPath: /etc/kubernetes/pki
      name: k8s-certs
      readOnly: true
    ... # omitted to save space.

Each of the components will run with a unique UID and GID. For each of the components we will create a unique user. For the shared files/resources we will create groups. The naming convention of these groups is tabulated below. It should be noted that kubeadm will take exclusive ownership of these users/groups and will throw erros if users/groups with these names exist and are not in the expected ID range of SYS_UID_MIN-SYS_UID_MAX for users and SYS_GID_MIN-SYS_GID_MAX for groups.

Many of the components need shared access to certificate files, these are not protected by creating a group with read permissions because certificates are not secrets, protecting them and creating groups for them does not improve our security posture in anyway and only makes the change more complicated because we are adding unnecessary groups. Hence we only propose that we create a group with read access for the /etc/kubernetes/pki/sa.key file, which is the only secret that is shared between kube-apiserver and kube-controller-manager. kubeadm creates all certificate files with 0644 so we do not need to modify their owners as they are already world readable.

User/Group name	Explanation
kubeadm-etcd	The UID/GID that we will assign to `etcd`
kubeadm-kas	The UID/GID that we will assign to `kube-apiserver`
kubeadm-kcm	The UID/GID that we will assign to `kube-controller-manager`
kubeadm-ks	The UID/GID that we will assign to `kube-scheduler`
kubeadm-sa-key-readers	The GID we will assign to a group that allows you to read /etc/kubernetes/pki/sa.key

Here is a table of all the things that kube-apiserver, kube-controller-manager, kube-scheduler and etcd mount and the permissions that we will set for them.

Files that we care about for this kep:-

file/directory	Component(s)	File permission
/etc/kubernetes/pki/etcd/server.crt	etcd	644 kubeadm-etcd kubeadm-etcd
/etc/kubernetes/pki/etcd/server.key	etcd	600 kubeadm-etcd kubeadm-etcd
/etc/kubernetes/pki/etcd/peer.crt	etcd	644 kubeadm-etcd kubeadm-etcd
/etc/kubernetes/pki/etcd/peer.key	etcd	600 kubeadm-etcd kubeadm-etcd
/etc/kubernetes/pki/etcd/ca.crt	etcd, kas	644 root root
/var/lib/etcd/	etcd	600 kubeadm-etcd kubeadm-etcd
/etc/kubernetes/pki/ca.crt	kas, kcm	644 root root
/etc/kubernetes/pki/apiserver-etcd-client.crt	kas	644 root root
/etc/kubernetes/pki/apiserver-etcd-client.key	kas	600 kakubeadm-kas kubeadm-kas
/etc/kubernetes/pki/apiserver-kubelet-client.crt	kas	644 root root
/etc/kubernetes/pki/apiserver-kubelet-client.key	kas	600 kubeadm-kas kubeadm-kas
/etc/kubernetes/pki/front-proxy-client.crt	kas	644 root root
/etc/kubernetes/pki/front-proxy-client.key	No-one	600 root root
/etc/kubernetes/pki/front-proxy-ca.crt	kas, kcm	644 root root
/etc/kubernetes/pki/sa.pub	kas	600 kkubeadm-kass kubeadm-kas
/etc/kubernetes/pki/sa.key	kas, kcm	640 kubeadm-sa-key-readers
/etc/kubernetes/pki/apiserver.crt	kas	644 root root
/etc/kubernetes/pki/apiserver.key	kas	600 kubeadm-kas kubeadm-kas
/etc/kubernetes/pki/ca.key	kcm	600 kubeadm-kcm kubeadm-kcm
/etc/kubernetes/controller-manager.conf	kcm	600 kubeadm-kcm kubeadm-kcm
/etc/kubernetes/scheduler.conf	ks	600 kubeadm-ks kubeadm-ks

In addition to the file/directories in that table above the control-plane components also mount the directories below, these we don't have to worry about as these are world readable.

World readable stuff:

/usr/local/share/ca-certificates
/usr/share/ca-certificates
/etc/ssl/certs
/etc/ca-certificates

Reusing users and groups

If any of the users/groups defined above exist already and are in the expected ID range of SYS_UID_MIN-SYS_UID_MAX for users and SYS_GID_MIN-SYS_GID_MAX for groups, then kubeadm will reuse these instead of creating new ones. More specifically kubeadm will reuse the ones that exist and meet the criteria and will create the ones that it needs.

Cleaning up users and groups

kubeadm reset tries to remove everything created by kubeadm on the host and it should do this for the users and groups that it creates as part of cluster bootstrap.

Multi OS support

A Windows control plane is out of scope for this proposal for the time being. OS specific implementations for Linux, would be carefully abstracted behind helper utilities in kubeadm to not block the support for a Windows control plane in the future.

Test Plan

Following functionality needs to be tested:

With feature-gate=True create a cluster
With feature-gate=True upgrade a cluster

These tests will be added using the kinder tooling during the Alpha stage.

Graduation Criteria

Alpha -> Beta Graduation

All control plane components are running as non-root.
All control-plane components have runtime/default seccomp profile.
All control-plane components drop all unnecessary capabilities.
The feature is tested by the community and feedback is adapted with changes.
e2e tests are running periodically and are green with respect to testing this functionality.
kubeadm documentation is updated.

Upgrade / Downgrade Strategy

The flow below is assuming that the feature-flag to run control-plane as non-root is enabled.

kubeadm checks the cluster-config to see if the control-plane is already running as non-root. If so it re-writes the contents of the files/credentials and makes sure that the UIDs and GIDs previously assigned have permissions to read/write appropriately. The control-plane static-pod manifests don't explicitly need to be updated for running them as non-root in this case.

If the control-plane was not running as non-root before then kubeadm creates new UIDs and GIDs based on the approach mentioned in the Assigning UID and GID section and updates the cluster-config. When files/credentials are re-written the owner of these files are set appropriately. The control-plane static-pod manifests explicitly need to be updated to run as non-root in this case.

Version Skew Strategy

kubeadm version X supports deploying kubernetes control-plane version X and X-1. Once the feature gate is enabled in kubeadm to run the control-plane as non root we will run both X and X-1 versions of control-plane as non-root. Nothing in the design of this feature is tied to the version of the control-plane.

Production Readiness Review Questionnaire

⚠️ The PRR was N/A as there are no in-tree changes proposed in this KEP. Pleases see these slack discussion threads. Thread 1 Thread 2

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Note: the feature gate here is for kubeadm and not the control-plane components.

Feature gate (also fill in values in kep.yaml)
- Feature gate name: kubeadmRootlessControlPlane
- Components depending on the feature gate: kube-apiserver, kube-controller-manager, kube-scheduler and etcd in kubeadm control-plane
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane? No, since it will only take effect when the control-plane is upgraded or created.
- Will enabling / disabling the feature require downtime or reprovisioning of a node? No, this only affects the control-plane, no change is required on the node(s).

Does enabling the feature change any default behavior?

Yes it will change the default behavior of kubeadm from running the control-plane components as root to non-root.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes disabling the feature-gate in kubeadm and the upgrading the control-plane to the current version should run the components as root again.

What happens if we reenable the feature if it was previously rolled back?

Nothing unless the user upgrades or creates a new cluster, if they do so, then the control-plane components on the upgraded/created cluster will run as non-root.

Are there any tests for feature enablement/disablement?

Yes we plan to add e2e tests to test the kubeadm behavior with feature gate enabled using kinder.

Rollout, Upgrade and Rollback Planning

How can a rollout fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
Other (treat as last resort)
- Details:

What are the reasonable SLOs (Service Level Objectives) for the above SLIs?

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

None

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Yes, in kubeadm control-plane bootstrap process we will create users/groups for the various control-plane components. This operation will add a minute delay to bootstrap. Also failing to do so would cause the bootstrap to fail.

When we create files and directories we would have to change the permissions and the owners of these files. So there will be a minute increase in bootstrap time for control-plane.

Files

2568-kubeadm-non-root-control-plane

Directory actions

More options