- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- (R) Graduation criteria is in place
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
This KEP proposes that the control-plane in kubeadm
be run as non-root. If
containers are running as root an escape from a container may result in the
escalation to root in host. CVE-2019-5736
is an example of a container escape vulnerability that can be mitigated by
running containers/pods as non-root.
CVE-2019-5736 is an example of a container escape vulnerability that can be mitigated by running containers/pods as non-root.
Running containers as non-root has been a long recommended best-practice in
kubernetes and we have published blogs recommending this best practice. kubeadm
which is a tool built to provide a "best-practice" path to creating clusters is
an ideal candidate to apply this to.
- Run control-plane components as non-root in
kubeadm
. More specifically :-- Run
kube-apiserver
as non-root inkubeadm
. - Run
kube-controller-manager
as non-root inkubeadm
. - Run
kube-scheduler
as non-root inkubeadm
. - Run
etcd
as non-root inkubeadm
.
- Run
- Run node components as non-root in
kubeadm
. More specifically :-- Run
kube-proxy
as non-root, since it is not a control-plane component. - Run
kubelet
as non-root. - Compatibility with user namespace enabled environments is not in scope.
- Setting defaults for SELinux and AppArmor are not in scope. (We may reconsider this in beta.)
- Run
Here is what we propose to do at a high level:
-
Run the control-plane components as non-root, by assigning a unique uid/gid to the containers in the Pods using the
runAsUser
andrunAsGroup
fields in the securityContext on the Pods. -
Drop all capabilities from
kube-controller-manager
,kube-scheduler
,etcd
Pods. Drop all but cap_net_bind_service fromkube-apiserver
Pod, this will allow us to run as non-root while still being able to bind to ports < 1024. This can be achieved by using thecapabilities
field in the securityContext of the containers in the Pods. -
Set the seccomp profile to
runtime/default
, this can be done by setting theseccompProfile
field in thesecurityContext
.
As a security conscious user I would like to run the kubernetes control-plane as non-root to reduce the risk associated with container escape vulnerabilities in the control-plane.
As a kubeadm
user, I'd expect the bootstrapper to follow the best security practices when generating static or non-static pod manifests to run control plane components.
There are some caveats with running the components as non-root around how to assign UID
/GID
s to them. This is covered in the Assigning UID and GID section below.
If we hard coded the UID
and GID
, we could end up in a scenario where those are in use by another process on the machine, which would expose some of the credentials accessible to the UID
and GID
s to that process. So we plan to use adduser --system or using the appropriate ranges from /etc/login.defs instead of hard coding the UID
and GID
.
In kubeadm
the control-plane components are run as static-pods, i.e. pods directly managed by kubelet. We can use the runAsUser
and runAsGroup
fields in SecurityContext to run the containers in these Pods with a unique UID
and GID
and the capabilities
field to drop all capabilities other than the ones required.
There are 3 options for setting the UID
/GID
of the control-plane components:-
-
Update the kubeadm API to make the uid/gid configurable by the user: This can be implemented in 2 ways:-
- This can be done by adding fields
UID
andGID
of typeint64
to theControlPlaneComponent
struct
in https://github.com/kubernetes/kubernetes/blob/854c2cc79f11cfb46499454e7717a86d3214e6b0/cmd/kubeadm/app/apis/kubeadm/types.go#L132 as demonstrated in this PR. - User provides
kubeadm
with a range of uid/gids that are safe to use i.e. are not being used by another user.
- This can be done by adding fields
-
Use constant values for uid/gid: The
UID
andGID
for each of the control-plane components would be set to some predetermined value in https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/constants/constants.go. -
Create system users and let users override the defaults: Use
adduser --system
or equivalent to createUID
s in the SYS_UID_MIN - SYS_UID_MAX range andgroupadd --system
or equivalent to createGID
s in the SYS_GID_MIN - SYS_GID_MAX range. Additionally if users want to specify their ownUID
s orGID
s we will support that throughkubeadm
patching.
The author(s) believes that starting out with a safe default of option 3. and allowing the user to set the UID
and GID
through the kubeadm
patching mechanism is more user-friendly for users who just wan't to quickly bootstrap and also users who care about which UID
s and GID
s that they want to run the control-plane as and also users . Further this feature will be opt-in and will be hidden behind a feature-gate, until it graduates to GA.
Choosing the UID
between SYS_UID_MIN and SYS_UID_MAX and GID
between SYS_GID_MIN and SYS_GID_MAX is in adherence with distro standards.
- For Debian : https://www.debian.org/doc/debian-policy/ch-opersys.html#uid-and-gid-classes
- For Fedora : https://docs.fedoraproject.org/en-US/packaging-guidelines/UsersAndGroups/
An example of how the kube-scheduler
manifest would like is below:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
component: kube-scheduler
tier: control-plane
name: kube-scheduler
namespace: kube-system
spec:
containers:
- command:
- kube-scheduler
- --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
- --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
- --bind-address=127.0.0.1
- --kubeconfig=/etc/kubernetes/scheduler.conf
- --leader-elect=true
- --port=0
image: k8s.gcr.io/kube-scheduler:v1.21.0-beta.0.368_9850bf06b571d5-dirty
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /healthz
port: 10259
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
name: kube-scheduler
resources:
requests:
cpu: 100m
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
runAsGroup: 2000 # this value is only an example and is not the id we plan to use.
runAsUser: 2000 # this value is only an example and is not the id we plan to use.
startupProbe:
failureThreshold: 24
httpGet:
host: 127.0.0.1
path: /healthz
port: 10259
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
volumeMounts:
- mountPath: /etc/kubernetes/scheduler.conf
name: kubeconfig
readOnly: true
hostNetwork: true
priorityClassName: system-node-critical
volumes:
- hostPath:
path: /etc/kubernetes/scheduler.conf
type: FileOrCreate
name: kubeconfig
status: {}
In the manifest above the reader can notice that the kube-scheduler
container mounts the kubeconfig
file from the host. So we must
grant read permissions to that file to the user that the container runs as, which in this case is 2000
. kubeadm
is responsible for bootstrap and creates the manifests of the control-plane components, which means that it knows all the volume mounts for the container.
There are 2 ways we can set the permissions:-
- Use initContainers: We could run an
initContainer
that runs as root and mounts all the hostVolumes that the control-plane component container mounts. It then calls chown uid:gid on each of the files/directories that are mounted. See the example forkube-shceduler
below:-
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
component: kube-scheduler
tier: control-plane
name: kube-scheduler
namespace: kube-system
spec:
initContainers:
- name: kube-scheduler-init
volumeMounts:
- mountPath: /etc/kubernetes/scheduler.conf
name: kubeconfig
readOnly: true
image: k8s.gcr.io/debian-base:buster-v1.4.0
command:
- /bin/sh
- -c
- chown 2000:2000 /etc/kubernetes/scheduler.conf
containers:
- command:
- kube-scheduler
- --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
- --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
- --bind-address=127.0.0.1
- --kubeconfig=/etc/kubernetes/scheduler.conf
- --leader-elect=true
- --port=0
image: k8s.gcr.io/kube-scheduler:v1.21.0-beta.0.368_9850bf06b571d5-dirty
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /healthz
port: 10259
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
name: kube-scheduler
resources:
requests:
cpu: 100m
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
runAsGroup: 2000 # this value is only an example and is not the id we plan to use.
runAsUser: 2000 # this value is only an example and is not the id we plan to use.
startupProbe:
failureThreshold: 24
httpGet:
host: 127.0.0.1
path: /healthz
port: 10259
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
volumeMounts:
- mountPath: /etc/kubernetes/scheduler.conf
name: kubeconfig
readOnly: true
hostNetwork: true
priorityClassName: system-node-critical
volumes:
- hostPath:
path: /etc/kubernetes/scheduler.conf
type: FileOrCreate
name: kubeconfig
status: {}
The initContainer kube-scheduler-init
will run before the kube-scheduler
container and will setup the permissions of the files that the kube-scheduler
container needs. Since initContainers are shotlived and exit once they are done the risk from running them as root on the host is low.
- kubeadm sets the permissions: In this approach
kubeadm
would be responsible for setting the file permissions when it creates the files. It will call os.Chown to set the owner of the files. This approach is demonstrated in PR kubernetes/kubernetes#99753.
The author(s) believe that it is better for kubeadm
to set the permission because adding an initContainer would require pulling of debian-base image (or similar) to run the commands to change file ownership, which is something that can be easily done in go. Also it leads to the question of which initContainer would be responsible for files shared between kube-controller-manager
and kube-apiserver
? Since kubeadm
creates these files its best for it to apply the file permissions.
Certain hostVolume mounts are shared between kube-apiserver
and kube-controller-manager
, some examples of these are:-
- /etc/kubernetes/pki/ca.crt
- /etc/kubernetes/pki/sa.key
We propose that files shared by kube-controller-manager
and kube-apiserver
be readable by a particular GID
and we set the supplementalGroups
in the PodSecurityContext of the Pod to this GID for kube-apiserver
and kube-controller-manager
.
For instance lets consider that kubeadm
grant group 2100 read permissions to /etc/kubernetes/pki/ca.crt
Then we update the kube-apiserver
manifest as follows:-
apiVersion: v1
kind: Pod
metadata:
annotations:
kubeadm.kubernetes.io/kube-apiserver.advertise-address.endpoint: 172.17.0.2:6443
creationTimestamp: null
labels:
component: kube-apiserver
tier: control-plane
name: kube-apiserver
namespace: kube-system
spec:
securityContext:
supplementalGroups:
- 2100 # this value is only for an example and is not the id we plan to use.
containers:
- name: kube-apiserver
command:
- kube-apiserver
- --advertise-address=172.17.0.2
- --client-ca-file=/etc/kubernetes/pki/ca.crt
... # omitted to save space
securityContext:
runAsUser: 2000 # this value is only an example and is not the id we plan to use.
runAsGroup: 2000 # this value is only an example and is not the id we plan to use.
allowPrivilegeEscalation: false
seccompProfile:
type: runtime/default
capabilities:
drop:
- ALL
image: k8s.gcr.io/kube-apiserver:v1.21.0-beta.0.368_9850bf06b571d5-dirty
... # omitted to save space
volumeMounts:
... # omitted to save space
- mountPath: /etc/kubernetes/pki
name: k8s-certs
readOnly: true
... # omitted to save space
Similarly kube-controller-manager
's manifest would like:-
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
component: kube-controller-manager
tier: control-plane
name: kube-controller-manager
namespace: kube-system
spec:
securityContext:
supplementalGroups:
- 2100
containers:
- name: kube-controller-manager
command:
- kube-controller-manager
- --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
...
# omitted to save space.
securityContext:
runAsUser: 2001 # this value is only an example and is not the id we plan to use.
runAsGroup: 2001 # this value is only an example and is not the id we plan to use.
allowPrivilegeEscalation: false
seccompProfile:
type: runtime/default
capabilities:
drop:
- ALL
image: k8s.gcr.io/kube-controller-manager:v1.21.0-beta.0.368_9850bf06b571d5-dirty
... # omitted to save space.
volumeMounts:
- mountPath: /etc/kubernetes/pki
name: k8s-certs
readOnly: true
... # omitted to save space.
Each of the components will run with a unique UID
and GID
. For each of the components we will create a unique user. For the shared files/resources we will create groups. The naming convention of these groups is tabulated below. It should be noted that kubeadm
will take exclusive ownership of these users/groups and will throw erros if users/groups with these names exist and are not in the expected ID range of SYS_UID_MIN
-SYS_UID_MAX
for users and SYS_GID_MIN
-SYS_GID_MAX
for groups.
Many of the components need shared access to certificate files, these are not protected by creating a group with read permissions because certificates are not secrets, protecting them and creating groups for them does not improve our security posture in anyway and only makes the change more complicated because we are adding unnecessary groups. Hence we only propose that we create a group with read access for the /etc/kubernetes/pki/sa.key
file, which is the only secret that is shared between kube-apiserver
and kube-controller-manager
. kubeadm
creates all certificate files with 0644
so we do not need to modify their owners as they are already world readable.
User/Group name | Explanation |
---|---|
kubeadm-etcd | The UID/GID that we will assign to etcd |
kubeadm-kas | The UID/GID that we will assign to kube-apiserver |
kubeadm-kcm | The UID/GID that we will assign to kube-controller-manager |
kubeadm-ks | The UID/GID that we will assign to kube-scheduler |
kubeadm-sa-key-readers | The GID we will assign to a group that allows you to read /etc/kubernetes/pki/sa.key |
Here is a table of all the things that kube-apiserver
, kube-controller-manager
, kube-scheduler
and etcd
mount and the permissions that we will set for them.
Files that we care about for this kep:-
file/directory | Component(s) | File permission |
---|---|---|
/etc/kubernetes/pki/etcd/server.crt | etcd | 644 kubeadm-etcd kubeadm-etcd |
/etc/kubernetes/pki/etcd/server.key | etcd | 600 kubeadm-etcd kubeadm-etcd |
/etc/kubernetes/pki/etcd/peer.crt | etcd | 644 kubeadm-etcd kubeadm-etcd |
/etc/kubernetes/pki/etcd/peer.key | etcd | 600 kubeadm-etcd kubeadm-etcd |
/etc/kubernetes/pki/etcd/ca.crt | etcd, kas | 644 root root |
/var/lib/etcd/ | etcd | 600 kubeadm-etcd kubeadm-etcd |
/etc/kubernetes/pki/ca.crt | kas, kcm | 644 root root |
/etc/kubernetes/pki/apiserver-etcd-client.crt | kas | 644 root root |
/etc/kubernetes/pki/apiserver-etcd-client.key | kas | 600 kakubeadm-kas kubeadm-kas |
/etc/kubernetes/pki/apiserver-kubelet-client.crt | kas | 644 root root |
/etc/kubernetes/pki/apiserver-kubelet-client.key | kas | 600 kubeadm-kas kubeadm-kas |
/etc/kubernetes/pki/front-proxy-client.crt | kas | 644 root root |
/etc/kubernetes/pki/front-proxy-client.key | No-one | 600 root root |
/etc/kubernetes/pki/front-proxy-ca.crt | kas, kcm | 644 root root |
/etc/kubernetes/pki/sa.pub | kas | 600 kkubeadm-kass kubeadm-kas |
/etc/kubernetes/pki/sa.key | kas, kcm | 640 kubeadm-sa-key-readers |
/etc/kubernetes/pki/apiserver.crt | kas | 644 root root |
/etc/kubernetes/pki/apiserver.key | kas | 600 kubeadm-kas kubeadm-kas |
/etc/kubernetes/pki/ca.key | kcm | 600 kubeadm-kcm kubeadm-kcm |
/etc/kubernetes/controller-manager.conf | kcm | 600 kubeadm-kcm kubeadm-kcm |
/etc/kubernetes/scheduler.conf | ks | 600 kubeadm-ks kubeadm-ks |
In addition to the file/directories in that table above the control-plane components also mount the directories below, these we don't have to worry about as these are world readable.
World readable stuff:
- /usr/local/share/ca-certificates
- /usr/share/ca-certificates
- /etc/ssl/certs
- /etc/ca-certificates
If any of the users/groups defined above exist already and are in the expected ID range of SYS_UID_MIN
-SYS_UID_MAX
for users and SYS_GID_MIN
-SYS_GID_MAX
for groups, then kubeadm
will reuse these instead of creating new ones. More specifically kubeadm
will reuse the ones that exist and meet the criteria and will create the ones that it needs.
kubeadm reset
tries to remove everything created by kubeadm
on the host and it should do this for the users and groups that it creates as part of cluster bootstrap.
A Windows control plane is out of scope for this proposal for the time being. OS specific implementations for Linux, would be carefully abstracted behind helper utilities in kubeadm to not block the support for a Windows control plane in the future.
Following functionality needs to be tested:
- With feature-gate=True create a cluster
- With feature-gate=True upgrade a cluster
These tests will be added using the kinder tooling during the Alpha stage.
- All control plane components are running as non-root.
- All control-plane components have
runtime/default
seccomp profile. - All control-plane components drop all unnecessary capabilities.
- The feature is tested by the community and feedback is adapted with changes.
- e2e tests are running periodically and are green with respect to testing this functionality.
kubeadm
documentation is updated.
The flow below is assuming that the feature-flag to run control-plane as non-root is enabled.
kubeadm
checks the cluster-config to see if the control-plane is already running as non-root. If so it re-writes the contents of the files/credentials and makes sure that the UID
s and GID
s previously assigned have permissions to read/write appropriately. The control-plane static-pod manifests don't explicitly need to be updated for running them as non-root in this case.
If the control-plane was not running as non-root before then kubeadm
creates new UID
s and GID
s based on the approach mentioned in the Assigning UID and GID section and updates the cluster-config. When files/credentials are re-written the owner of these files are set appropriately. The control-plane static-pod manifests explicitly need to be updated to run as non-root in this case.
kubeadm
version X supports deploying kubernetes control-plane version X and X-1. Once the feature gate is enabled in kubeadm to run the control-plane as non root we will run both X and X-1 versions of control-plane as non-root. Nothing in the design of this feature is tied to the version of the control-plane.
⚠️ The PRR was N/A as there are no in-tree changes proposed in this KEP. Pleases see these slack discussion threads. Thread 1 Thread 2
Note: the feature gate here is for kubeadm
and not the control-plane components.
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name: kubeadmRootlessControlPlane
- Components depending on the feature gate:
kube-apiserver
,kube-controller-manager
,kube-scheduler
andetcd
inkubeadm
control-plane - Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane? No, since it will only take effect when the control-plane is upgraded or created.
- Will enabling / disabling the feature require downtime or reprovisioning of a node? No, this only affects the control-plane, no change is required on the node(s).
Yes it will change the default behavior of kubeadm from running the control-plane components as root to non-root.
Yes disabling the feature-gate in kubeadm and the upgrading the control-plane to the current version should run the components as root again.
Nothing unless the user upgrades or creates a new cluster, if they do so, then the control-plane components on the upgraded/created cluster will run as non-root.
Yes we plan to add e2e tests to test the kubeadm behavior with feature gate enabled using kinder.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- Other (treat as last resort)
- Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?
None
No
No
No
No
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Yes, in kubeadm control-plane bootstrap process we will create users/groups for the various control-plane components. This operation will add a minute delay to bootstrap. Also failing to do so would cause the bootstrap to fail.
When we create files and directories we would have to change the permissions and the owners of these files. So there will be a minute increase in bootstrap time for control-plane.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
No
Major milestones:
- Initial draft of KEP created - 2021-03-13
- Production readiness review - 2021-04-12
- Production readiness review approved - 2021-04-29
- KEP marked implementable - 2021-04-28
None
None
None