Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP 1710: Update SELinux mount ReadWriteOnce optimization for 1.26 #3548

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 62 additions & 12 deletions keps/sig-storage/1710-selinux-relabeling/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [Required kubelet changes](#required-kubelet-changes)
- [Volume Reconstruction](#volume-reconstruction)
- [Implementation phases](#implementation-phases)
- [Phase 1](#phase-1)
- [Phase 2](#phase-2)
Expand Down Expand Up @@ -55,18 +56,18 @@
Items marked with (R) are required *prior to targeting to a milestone / release*.

- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
- [ ] (R) Design details are appropriately documented
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- [x] (R) KEP approvers have approved the KEP status as `implementable`
- [x] (R) Design details are appropriately documented
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- [ ] e2e Tests for all Beta API Operations (endpoints)
- [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
- [ ] (R) Graduation criteria is in place
- [x] (R) Graduation criteria is in place
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
- [ ] (R) Production readiness review completed
- [ ] (R) Production readiness review approved
- [ ] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [x] "Implementation History" section is up-to-date for milestone
- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

## Summary
Expand Down Expand Up @@ -332,8 +333,7 @@ Apart from the obvious API change and behavior described above, kubelet + volume
* Kubelet's VolumeManager needs to track which SELinux label should get a volume in global mount (to call `MountDevice()` with the right mount options).
* It must call `UnmountDevice()` even when another pod wants to re-use a mounted volume, but it has a different SELinux context.
* After kubelet restart, kubelet must reconstruct the original SELinux label it used to SetUp and MountDevice of each volume.
* Volume reconstruction must be updated to get the SELinux label from mount (in-tree volume plugins) or stored json file (CSI).
This label must be updated in VolumeManager's ActualStateOfWorld after reconstruction.
See Volume Reconstruction below.
* Reconciler must check also SELinux context used to mount a volume (both mounted devices and volumes) before considering what operation to take on a volume (`MountVolume` or `UnmountVolume`/`UnmountDevice` or nothing).
It must throw proper error message telling that a Pod can't start because its volume is used by another Pod with a different SELinux context.
* This is a good point to capture any metrics proposed below.
Expand All @@ -347,6 +347,40 @@ Apart from the obvious API change and behavior described above, kubelet + volume
This error is already part of generic `storage_operation_duration_seconds` metric (with a label for failures).
* Note that kubelet can't check mount options after `NodeStage`, because a CSI driver does not need to mount during NodeStage or it may choose to mount to another directory than the staging one.

#### Volume Reconstruction

Today, volume reconstruction works in this way:

1. When kubelet starts, it starts populating the volume manager's Desired State of World (DSW) immediately (e.g. with static pods),
and it starts running Pods and mounting volumes for them. Kubelet depends on volume plugin / CSI driver idempotency if a volume
is already mounted. At this point, the Actual State of World (ASW) is empty and it is getting populated with volumes
mounted for Pods that are getting started.
2. When kubelet establishes connection to the API server and DSW is fully populated, it reconstructs volumes from disk only for volumes not
present in DSW. This should cover only volumes that don't have a Pod in the API server and need to be unmounted. Kubelet adds the
volumes to the ASW and lets regular reconciler to unmount them.

This approach does not work for SELinux, because at step 1. above, the volume manager needs to know *if* a volume is mounted and with
*what SELinux context mount option*. If the required and existing SELinux contexts of a volume match, the volume manager can continue
mounting the volume. If they don't, volume manager needs to unmount the volume with the wrong SELinux context first and mount it again
with the right one.

We need to populate the ASW as soon as possible after kubelet starts. Suggested changes:

1. When kubelet starts, the volume manager will reconstruct all volumes incl. their SELinux contexts and put them to the DSW as *uncertain*.
At this point, kubelet may not have connection to the API server yet, hence this phase of volume reconstruction must work without it.
Kubelet will store all reconstructed volumes in a separate array, to finish the reconstruction when the API server is available.
* This implies that volume plugins can't expect that the API server is available in `ConstructVolumeSpec`, `ConstructBlockVolumeSpec`,
`NewMounter`, `NewBlockVolumeMapper`, and `NewDeviceMounter` calls. Especially all `CSIDriver` checks in the CSI volume plugin must
be moved to `SetUpAt` or `TearDownAt`, and their block volume counterparts.
2. Only after the initial ASW is populated, kubelet starts running pods and mounting volumes for them. Since the existing volumes are marked
as *uncertain*, volume manager will re-mount them (depending on volume plugin / CSI driver idempotency). Note that only mounting
is allowed at this point, the volume manager can't unmount anything, because the DSW is not yet populated.
3. When kubelet establishes a connection to the API server, it populates the DSW as usual.
4. When the DSW is fully populated, the volume manager will finish reconstruction of volumes, i.e. file devicePaths from the
`node.status.volumesInUse` field.
5. Only after the second phase of volume reconstruction is done, i.e. the DSW is fully populated and volumes are fully reconstructed,
the volume manager starts unmounting volumes that are not in the ASW.

### Implementation phases

Due to change of Kubernetes behavior, we will implement the feature only for cases where it can't break anything first.
Expand Down Expand Up @@ -508,10 +542,24 @@ _This section must be completed when targeting beta graduation to a release._

* **What are the SLIs (Service Level Indicators) an operator can use to
determine the health of the service?**

- [ ] Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- All `errors_total` metrics below cover real errors when a Pod can't start.
It applies to `ReadWriteOncePod` volumes.
- All `warnings_total` metrics below cover **future** errors that would appear if this feature was extended to all volumes.
This will be evaluated in Phase 2.
- 1. `volume_manager_selinux_container_errors_total` + `volume_manager_selinux_container_warnings_total`: Number of errors when kubelet cannot compute SELinux context for a container.
This indicates an error converting SELinux context into SELinux label by github.com/opencontainers/selinux/go-selinux library.
Reading its source code, this should never happen, but one never knows.
1. `volume_manager_selinux_pod_context_mismatch_errors_total` + `volume_manager_selinux_pod_context_mismatch_warnings_total`: Number of errors when a Pod defines different SELinux contexts for its containers that use the same volume.
Before this feature, only one container in such a Pod could access the volume.
With this feature, the Pod won't even start.
This metric captures nr. of failed Pod starts, including periodic retries.
1. `volume_manager_selinux_volume_context_mismatch_errors_total` + `volume_manager_selinux_volume_context_mismatch_warnings_total`: Number of errors when a Pod uses a volume that is already mounted with a different SELinux context than the Pod needs.
Before this feature, both pods would start, but only one such pod could access the volume.
With this feature, one of the Pods won't even start.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We briefly discussed rejecting such pods via pod admission in future. Are we still planning to do that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubelet admission already rejects RWOP pods that use a volume that is already used.

We could add the same for Pods with mismatching SELinux contexts, however, some volume types (e.g. NFS) might support a volume mounted on a node several times with different contexts.

- Components exposing the metric: KCM

- [ ] Other (treat as last resort)
- Details:

Expand Down Expand Up @@ -653,7 +701,9 @@ _This section must be completed when targeting beta graduation to a release._

## Implementation History

* 1.25: Alpha
* 1.25: Partial implementation of alpha.
* Volume reconstruction after kubelet start does not reconstruct SELinux contexts.
* 1.26: Alpha with everything implemented.

## Drawbacks [optional]

Expand Down
13 changes: 9 additions & 4 deletions keps/sig-storage/1710-selinux-relabeling/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,16 +19,21 @@ approvers:
see-also:
- /keps/sig-storage/695-skip-permission-change/README.md
stage: alpha
latest-milestone: "v1.24"
latest-milestone: "v1.26"
milestone:
alpha: "v1.24"
beta: "v1.25"
stable: "v1.27"
beta: "v1.27"
stable: "v1.29"
feature-gates:
- name: SELinuxMountReadWriteOncePod
components:
- kube-apiserver
- kubelet
disable-supported: true
metrics:
# TODO: fill at beta
- volume_manager_selinux_container_errors_total
- volume_manager_selinux_container_warnings_total
- volume_manager_selinux_pod_context_mismatch_errors_total
- volume_manager_selinux_pod_context_mismatch_warnings_total
- volume_manager_selinux_volume_context_mismatch_errors_total
- volume_manager_selinux_volume_context_mismatch_warnings_total