From d0ddbb027448a45e166a7ede6c3535fd85f5966a Mon Sep 17 00:00:00 2001 From: Jan Safranek Date: Wed, 25 Jan 2023 17:46:56 +0100 Subject: [PATCH] Wednesday changes --- .../3756-volume-reconstruction/README.md | 285 +++++++++++------- 1 file changed, 170 insertions(+), 115 deletions(-) diff --git a/keps/sig-storage/3756-volume-reconstruction/README.md b/keps/sig-storage/3756-volume-reconstruction/README.md index ed6a71eff306..4d7ce5a2dd31 100644 --- a/keps/sig-storage/3756-volume-reconstruction/README.md +++ b/keps/sig-storage/3756-volume-reconstruction/README.md @@ -187,12 +187,38 @@ not needed any longer (all Pods that used them were deleted). VolumeManager keeps two caches: * *DesiredStateOfWorld* (DSW) contains volumes that should be mounted. * *ActualStateOfWorld* (ASW) contains currently mounted volumes. - -VolumeManager then compares these two caches and tries to move ASW towards DSW. - -Both caches exist only in memory and are lost when kubelet process dies. It's -relatively easy to populate DSW - just list all Pods from the API server -(and static pods and whatnot) and collect their volumes. + A volume in ASW can be marked as: + * Globally mounted - it is mounted in `/var/lib/kubelet/volumes//...` + * This mount is optional and depends on volume plugin / CSI driver + capabilities. If it's supported, each volume has only a single global + mount. + * Mounted into Pod local directory - it is mounted in + `/var/lib/kubelet/pods//volumes/...`. Each pod that uses a volume + gets its own local mount, because each pod has a different ``. + If the volume plugin / CSI driver supports the global mount mentioned above, + each pod local mount is typically a bind-mount from the global mount. + + In addition, both global and local mounts can be marked as *uncertain*, when + kubelet is not 100% sure if the volume is fully mounted there. Typically, + this happens when a CSI driver times out NodeStage / NodePublish calls + and kubelet can't be sure if the CSI driver has finished mounting the volume + *after* the timeout. Kubelet then needs to call NodeStage / NodePublish again + if the volume is still needed by some Pods, or call NodeUnstage / + NodeUnpublish if all Pods that needed the volume were deleted. + +VolumeManager runs two separate goroutines: +* *[reconciler](https://github.com/kubernetes/kubernetes/blob/44b72d034852eb6da8916c82ce722af604b196c5/pkg/kubelet/volumemanager/reconciler/reconciler.go#L47-L69) + that periodically compares ASW and DSW and tries to move ASW towards DSW. +* *DesiredStateOfWorldPopulator* (DSWP) that + [periodically lists Pods from PodManager and adds them to DSW](https://github.com/kubernetes/kubernetes/blob/cca3d557e6ff7f265eca8517d7c4fa719077c8d1/pkg/kubelet/volumemanager/populator/desired_state_of_world_populator.go#L175-L189). + This DSWP is marked as `hasAddedPods=true` ("fully populated") only after + it has read all Pods from files (static pods) **and** the API server (i.e. + [`sourcesReady.AllReady` returns `true` here](https://github.com/kubernetes/kubernetes/blob/cca3d557e6ff7f265eca8517d7c4fa719077c8d1/pkg/kubelet/volumemanager/populator/desired_state_of_world_populator.go#L150-L159)). + +Both ASW and DSW caches exist only in memory and are lost when kubelet process +dies. It's relatively easy to populate DSW - just list all Pods from the API +server and static pods and collect their volumes. Populating ASW is complicated +and actually source of several problems that we want to change in this KEP. *Volume reconstruction* is a process where kubelet tries to create a single valid `PersistentVolumeSpec` or `VolumeSpec` for a volume from the OS. @@ -203,10 +229,11 @@ the volume (i.e. to call `volumePlugin.TearDown` + `UnmountDevice` calls). Today, kubelet populates VolumeManager's DSW first, from static Pods and pods received from the API server. ASW is populated from the OS -after DSW is complete and **only volumes missing in DSW are added there**. -In other words, kubelet reconstructs only the volumes for Pods that were running -before kubelet started, but they were delered when kubelet started. (If the pod -was Running, its volumes would be in DSW). +after DSW is fully populated (`hasAddedPods==true`) and **only volumes missing +in DSW are added there**. In other words, kubelet reconstructs only the volumes +for Pods that were running, but were deleted from API server before kubelet +started. (If the pod is still in the API server, Running, its volumes would be +in DSW). We assumed that this was enough, because if a volume is in DSW, the VolumeManager will try to mount the volume, and it will eventually reach ASW. @@ -230,15 +257,15 @@ nitty-gritty. We propose to reverse the kubelet startup process. 1. Quickly reconstruct ASW from the OS and add **all** found volumes to ASW - when kubelet starts. "Quickly" means the process should look only at the OS - and files/directories in `/var/lib/kubelet/` and it should not - require the API server or any network calls. Esp. the API server may + when kubelet starts as *uncertain*. "Quickly" means the process should look + only at the OS and files/directories in `/var/lib/kubelet/pods` and it should + not require the API server or any network calls. Esp. the API server may not be available at this stage of kubelet startup. -2. Once all mounted volumes are in ASW, start populating DSW from static Pods - and Pods received from the API server. -3. When connection to the API server becomes available, complete information - in ASW with data from the API server (e.g. from `node.status`). This - typically happens in parallel to the previous step. +2. In parallel to 1., start DSWP and populate DSW from the API server and + static pods. +3. When connection to the API server becomes available, complete reconstructed + information in ASW with data from the API server (e.g. from `node.status`). + This typically happens in parallel to the previous step. Benefits: @@ -265,9 +292,12 @@ to split the feature, we propose feature gate `NewVolumeManagerReconstruction`. (This is not a new story, we want to keep this behavior) As a cluster admin, I want kubelet to resume where it stopped when it was -restarted or its machine was rebooted. It must be able to recognize what -happened in the meantime and either unmount any volumes of Pods that were -deleted in the API server or mount volumes for newly created Pods. +restarted or its machine was rebooted, so I don't need to clean up / unmount +any volumes manually. + +It must be able to recognize what happened in the meantime and either unmount +any volumes of Pods that were deleted in the API server or mount volumes for +newly created Pods. ### Notes/Constraints/Caveats (Optional) @@ -291,35 +321,29 @@ that is starting. We don't know what other kubelet configurations people use, so we decided to write a KEP and move the new VolumeManager startup behind a feature gate. - ## Design Details ### Proposed VolumeManager startup -When kubelet starts: - -1. VolumeManager starts populating DSW from PodManager, who reads Pods from - static pods and/or the API server. - -2. In parallel to 1., VolumeManager scans `/var/lib/kubelet/pods/*` and - reconstructs *all* found volumes and adds them to ASW as *uncertain*. I.e. - depending on DSW, the VolumeManager will either call `volumePlugin.SetUp()` - to make sure the volume is fully mounted or it will call - `volumePlugin.TearDown()` to make sure the volume is unmounted. Only - information that is available in the Pod directory on the disk are - reconstructed into ASW at this point. - * Since the volume reconstruction can be imperfect and can miss `devicePath`, - VolumeManager adds all reconstructed volumes to `volumesNeedDevicePath` - array, to finish their reconstruction from `node.status.volumesAttached` - * All volumes that failed reconstruction are added to - `volumesFailedReconstruction` list. - - -After **ASW** is populated, VolumeManager starts its reconciliation loop, i.e. -starts comparing ASW and DSW and periodically calls: +When kubelet starts, VolumeManager starts DSWP and reconciler +[in parallel](https://github.com/kubernetes/kubernetes/blob/575616cc72dbfdd070ead81ec29c0d4f00226487/pkg/kubelet/volumemanager/volume_manager.go#L288-L292). + +However, the first thing that the reconciler does before reconciling DSW and ASW +is that it scans `/var/lib/kubelet/pods/*` and reconstructs **all** found +volumes and adds them to ASW as *uncertain*. Only information that is available +in the Pod directory on the disk are reconstructed into ASW at this point. +* Since the volume reconstruction can be imperfect and can miss `devicePath`, + VolumeManager adds all reconstructed volumes to `volumesNeedDevicePath` + array, to finish their reconstruction from `node.status.volumesAttached` + later. +* All volumes that failed reconstruction are added to + `volumesFailedReconstruction` list. + +After **ASW** is populated, reconciler starts its +[reconciliation loop](https://github.com/kubernetes/kubernetes/blob/cca3d557e6ff7f265eca8517d7c4fa719077c8d1/pkg/kubelet/volumemanager/reconciler/reconciler_new.go#L33-L69): 1. `mountOrAttachVolumes()` - mounts (and attaches, if necessary) volumes that are in DSW, but not in ASW. This can happen even before DSW is fully - populated, because ASW is fully populated at this point. + populated. 2. `updateReconstructedDevicePaths()` - once kubelet gets connection to the API server and reads its own `Node.status`, volumes in `volumesNeedDevicePath` @@ -329,11 +353,11 @@ starts comparing ASW and DSW and periodically calls: 3. (Only once): Add all reconstructed volumes to `node.status.volumesInUse`. -4. Only after DSW was fully populated (i.e. it contains at least Pods that were - Running when kubelet started), **and** `devicePaths` were populated from +4. Only after DSW was fully populated (i.e. VolumeManager can tell if a volume + is really needed or not), **and** `devicePaths` were populated from `node.status`, VolumeManager can start unmounting volumes and calls: - 1. `unmountVolumes()` - unmounts (`TearDown`) pod local volumes that are in - ASW and are not in DSW. + 1. `unmountVolumes()` - unmounts (`TearDown`) pod local volume mounts that + are in ASW and are not in DSW. 2. `unmountDetachDevices()` - unmounts (`UnmountDevice`) global volume mounts of volumes that are in ASW and are not in DSW. 3. `cleanOrphanVolumes()` - tries to clean up `volumesFailedReconstruction`. @@ -341,25 +365,23 @@ starts comparing ASW and DSW and periodically calls: volume, because kubelet failed to reconstruct the volume spec from `/var/lib/kubelet/pods//volumes/xyz`. Kubelet at least tries to unmount the directory and clean up any orphan files there. + This happens only once, `volumesFailedReconstruction` is cleared + afterwards. Note that e.g. `mountOrAttachVolumes` can call `volumePlugin.MountDevice` / -`SetUp()` on a reconstructed volume (because it's *uncertain*) and finally -update ASW, while the VolumeManager is still waiting for the API server to -update `devicePath` of the same volume in ASW (step 2. above). -We made sure that `updateReconstructedDevicePaths` will update the -`devicePath` only for *uncertain* volumes, not to overwrite the *certain* ones. +`SetUp()` on a reconstructed volume (because it was added to ASW as *uncertain*) +and finally update ASW, while the VolumeManager is still waiting for the API +server to update `devicePath` of the same volume in ASW (step 2. above). We made +sure that `updateReconstructedDevicePaths()` will update the `devicePath` only +for volumes that are still *uncertain*, not to overwrite the *certain* ones. ### Old VolumeManager startup -When kubelet starts: - -1. VolumeManager starts populating DSW from PodManager, who reads Pods from - static pods and/or the API server. - -While DSW populator runs in the another thread, VolumeManager starts its -reconciliation loop. Note that ASW is empty at this point + DSW is being -populated. It starts comparing ASW and DSW and periodically following calls: +When kubelet starts, VolumeManager starts DSWP and the reconciler +[in parallel](https://github.com/kubernetes/kubernetes/blob/575616cc72dbfdd070ead81ec29c0d4f00226487/pkg/kubelet/volumemanager/volume_manager.go#L288-L292). +[The reconciler](https://github.com/kubernetes/kubernetes/blob/44b72d034852eb6da8916c82ce722af604b196c5/pkg/kubelet/volumemanager/reconciler/reconciler.go#L33-L45) +then periodically does: 1. `unmountVolumes()` - unmounts (`TearDown`) pod local volumes that are in ASW and are not in DSW. Since the ASW is initially empty, this call becomes useful later. @@ -368,9 +390,8 @@ populated. It starts comparing ASW and DSW and periodically following calls: DSW, because ASW is empty. This actually the way how AWS is populated. 3. `unmountDetachDevices()` - unmounts (`UnmountDevice`) global volume mounts of volumes that are in ASW and are not in DSW. -4. Only once after DSW is fully populated (i.e. it contains at least Pods that - were Running when kubelet started): - 1. VolumeManager scans `/var/lib/kubelet/pods/*` +4. Only once after DSW is fully populated: + 1. VolumeManager calls `sync()`, which scans `/var/lib/kubelet/pods/*` and reconstructs **only** volumes that are not in DSW. (If a volume is in DSW, we expect that it reaches ASW during step 3.) * `devicePath` of reconstructed volumes is populated from @@ -416,7 +437,23 @@ This can inform certain test coverage improvements that we want to do before extending the production code to implement this enhancement. --> -- ``: `` - `` +All files are in `k8s.io/kubernetes/pkg/kubelet/volumemanager/reconciler/`, +data taken on 2023-01-25. + +The old reconciler + reconstruction: +- `reconciler.go`: `77.1` +- `reconciler_new.go`: `73.3%` +- `reconstruct.go`: `75.7%` +- +The new reconciler + reconstruction +- `reconciler_new.go`: `73.3%` +- `reconstruct_new.go`: `66.2%` + +Common code: +- `reconciler_common.go`: `86.2%` +- `reconstruct_common.go`: `75.8%` + +TODO: check why is the new coverage actually lower. ##### Integration tests @@ -428,7 +465,7 @@ For Beta and GA, add links to added tests together with links to k8s-triage for https://storage.googleapis.com/k8s-triage/index.html --> -- : +None. ##### e2e tests @@ -442,71 +479,40 @@ https://storage.googleapis.com/k8s-triage/index.html We expect no non-infra related flakes in the last month as a GA graduation criteria. --> -- : - -### Graduation Criteria - - ### Upgrade / Downgrade Strategy @@ -522,6 +528,9 @@ enhancement: cluster required to make on upgrade, in order to make use of the enhancement? --> +The feature is enabled by a single feature gate on kubelet and does not require +any special upgrade / downgrade handling. + ### Version Skew Strategy +The feature affects only how kubelet starts. It has no implications on +other Kubernetes components or other kubelets. Therefore, we don't see any +issues with any version skew. + ## Production Readiness Review Questionnaire -- [ ] Feature gate (also fill in values in `kep.yaml`) - - Feature gate name: - - Components depending on the feature gate: +- [X] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: NewVolumeManagerReconstruction + - Components depending on the feature gate: kubelet - [ ] Other - Describe the mechanism: - Will enabling / disabling the feature require downtime of the control @@ -596,6 +609,8 @@ Any change of default behavior may be surprising to users or break existing automations, so be extremely careful here. --> +It changes how kubelet starts. It has no visible effect in any API object. + ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? +The feature can be disabled without any issues. + ###### What happens if we reenable the feature if it was previously rolled back? +Nothing interesting. + ###### Are there any tests for feature enablement/disablement? +We have unit tests for the feature disabled or enabled. +It affects only kubelet startup and we don't change format of data present in +the OS (mount table, content of `/var/lib/kubelet/pods/`), so we don't have +automated tests to start kubelet with the feature enabled and then disable it +or a vice versa. + ### Rollout, Upgrade and Rollback Planning +In theory, if new reconstruction is buggy, kubelet either does not come up at +all (crashes, hangs) or does not unmount volumes that it should unmount. + ###### What specific metrics should inform a rollback? +TODO. VolumeManager startup, esp. volume reconstruction, does not have any +metrics. + ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? +TODO + ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? +No. + ### Monitoring Requirements +TODO whole chapter. + ###### How can an operator determine if the feature is in use by workloads? +No. + ### Scalability +No. + ###### Will enabling / using this feature result in introducing new API types? +No. + ###### Will enabling / using this feature result in any new calls to the cloud provider? +No. + ###### Will enabling / using this feature result in increasing size or count of the existing API objects? +No. + ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? +Kubelet startup could be slower, but that would be a bug. In theory, the old +and new VolumeManager startup does the same things, just in a different order. + ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? +No. + ### Troubleshooting