From d0ddbb027448a45e166a7ede6c3535fd85f5966a Mon Sep 17 00:00:00 2001
From: Jan Safranek <jsafrane@redhat.com>
Date: Wed, 25 Jan 2023 17:46:56 +0100
Subject: [PATCH] Wednesday changes

---
 .../3756-volume-reconstruction/README.md      | 285 +++++++++++-------
 1 file changed, 170 insertions(+), 115 deletions(-)
diff --git a/keps/sig-storage/3756-volume-reconstruction/README.md b/keps/sig-storage/3756-volume-reconstruction/README.md
index ed6a71eff306..4d7ce5a2dd31 100644
--- a/keps/sig-storage/3756-volume-reconstruction/README.md
+++ b/keps/sig-storage/3756-volume-reconstruction/README.md
@@ -187,12 +187,38 @@ not needed any longer (all Pods that used them were deleted).
 VolumeManager keeps two caches:
 * *DesiredStateOfWorld* (DSW) contains volumes that should be mounted.
 * *ActualStateOfWorld* (ASW) contains currently mounted volumes.
-
-VolumeManager then compares these two caches and tries to move ASW towards DSW.
-
-Both caches exist only in memory and are lost when kubelet process dies. It's
-relatively easy to populate DSW - just list all Pods from the API server
-(and static pods and whatnot) and collect their volumes.
+  A volume in ASW can be marked as:
+  * Globally mounted - it is mounted in `/var/lib/kubelet/volumes/<plugin>/...`
+    * This mount is optional and depends on volume plugin / CSI driver
+      capabilities. If it's supported, each volume has only a single global
+      mount. 
+  * Mounted into Pod local directory - it is mounted in
+    `/var/lib/kubelet/pods/<pod UID>/volumes/...`. Each pod that uses a volume
+    gets its own local mount, because each pod has a different `<pod UID>`.
+    If the volume plugin / CSI driver supports the global mount mentioned above,
+    each pod local mount is typically a bind-mount from the global mount.
+  
+  In addition, both global and local mounts can be marked as *uncertain*, when
+  kubelet is not 100% sure if the volume is fully mounted there. Typically,
+  this happens when a CSI driver times out NodeStage / NodePublish calls
+  and kubelet can't be sure if the CSI driver has finished mounting the volume
+  *after* the timeout. Kubelet then needs to call NodeStage / NodePublish again
+  if the volume is still needed by some Pods, or call NodeUnstage /
+  NodeUnpublish if all Pods that needed the volume were deleted.
+
+VolumeManager runs two separate goroutines:
+* *[reconciler](https://github.com/kubernetes/kubernetes/blob/44b72d034852eb6da8916c82ce722af604b196c5/pkg/kubelet/volumemanager/reconciler/reconciler.go#L47-L69)
+  that periodically compares ASW and DSW and tries to move ASW towards DSW.
+* *DesiredStateOfWorldPopulator* (DSWP) that
+  [periodically lists Pods from PodManager and adds them to DSW](https://github.com/kubernetes/kubernetes/blob/cca3d557e6ff7f265eca8517d7c4fa719077c8d1/pkg/kubelet/volumemanager/populator/desired_state_of_world_populator.go#L175-L189).
+  This DSWP is marked as `hasAddedPods=true` ("fully populated") only after
+  it has read all Pods from files (static pods) **and** the API server (i.e.
+  [`sourcesReady.AllReady` returns `true` here](https://github.com/kubernetes/kubernetes/blob/cca3d557e6ff7f265eca8517d7c4fa719077c8d1/pkg/kubelet/volumemanager/populator/desired_state_of_world_populator.go#L150-L159)).
+
+Both ASW and DSW caches exist only in memory and are lost when kubelet process
+dies. It's relatively easy to populate DSW - just list all Pods from the API
+server and static pods and collect their volumes. Populating ASW is complicated
+and actually source of several problems that we want to change in this KEP.
 
 *Volume reconstruction* is a process where kubelet tries to create a single
 valid `PersistentVolumeSpec` or `VolumeSpec` for a volume from the OS.
@@ -203,10 +229,11 @@ the volume (i.e. to call `volumePlugin.TearDown` + `UnmountDevice` calls).
 
 Today, kubelet populates VolumeManager's DSW first, from static Pods and pods
 received from the API server. ASW is populated from the OS
-after DSW is complete and **only volumes missing in DSW are added there**.
-In other words, kubelet reconstructs only the volumes for Pods that were running
-before kubelet started, but they were delered when kubelet started. (If the pod
-was Running, its volumes would be in DSW).
+after DSW is fully populated (`hasAddedPods==true`) and **only volumes missing
+in DSW are added there**. In other words, kubelet reconstructs only the volumes
+for Pods that were running, but were deleted from API server before kubelet
+started. (If the pod is still in the API server, Running, its volumes would be
+in DSW).
 
 We assumed that this was enough, because if a volume is in DSW, the
 VolumeManager will try to mount the volume, and it will eventually reach ASW. 
@@ -230,15 +257,15 @@ nitty-gritty.
 We propose to reverse the kubelet startup process.
 
 1. Quickly reconstruct ASW from the OS and add **all** found volumes to ASW
-   when kubelet starts. "Quickly" means the process should look only at the OS 
-   and files/directories in `/var/lib/kubelet/` and it should not
-   require the API server or any network calls. Esp. the API server may
+   when kubelet starts as *uncertain*. "Quickly" means the process should look
+   only at the OS and files/directories in `/var/lib/kubelet/pods` and it should
+   not require the API server or any network calls. Esp. the API server may
    not be available at this stage of kubelet startup.
-2. Once all mounted volumes are in ASW, start populating DSW from static Pods
-   and Pods received from the API server.
-3. When connection to the API server becomes available, complete information
-   in ASW with data from the API server (e.g. from `node.status`). This
-   typically happens in parallel to the previous step.
+2. In parallel to 1., start DSWP and populate DSW from the API server and
+   static pods.
+3. When connection to the API server becomes available, complete reconstructed
+   information in ASW with data from the API server (e.g. from `node.status`).
+   This typically happens in parallel to the previous step.
 
 Benefits:
 
@@ -265,9 +292,12 @@ to split the feature, we propose feature gate `NewVolumeManagerReconstruction`.
 (This is not a new story, we want to keep this behavior)
 
 As a cluster admin, I want kubelet to resume where it stopped when it was
-restarted or its machine was rebooted. It must be able to recognize what
-happened in the meantime and either unmount any volumes of Pods that were
-deleted in the API server or mount volumes for newly created Pods. 
+restarted or its machine was rebooted, so I don't need to clean up / unmount
+any volumes manually.
+
+It must be able to recognize what happened in the meantime and either unmount
+any volumes of Pods that were deleted in the API server or mount volumes for
+newly created Pods. 
 
 ### Notes/Constraints/Caveats (Optional)
 
@@ -291,35 +321,29 @@ that is starting. We don't know what other kubelet configurations people use,
 so we decided to write a KEP and move the new VolumeManager startup behind
 a feature gate.
 
-
 ## Design Details
 
 ### Proposed VolumeManager startup
 
-When kubelet starts:
-
-1. VolumeManager starts populating DSW from PodManager, who reads Pods from
-   static pods and/or the API server.
-
-2. In parallel to 1., VolumeManager scans `/var/lib/kubelet/pods/*` and
-   reconstructs *all* found volumes and adds them to ASW as *uncertain*. I.e.
-   depending on DSW, the VolumeManager will either call `volumePlugin.SetUp()`
-   to make sure the volume is fully mounted or it will call
-   `volumePlugin.TearDown()` to make sure the volume is unmounted. Only
-   information that is available in the Pod directory on the disk are
-   reconstructed into ASW at this point.
-   * Since the volume reconstruction can be imperfect and can miss `devicePath`,
-     VolumeManager adds all reconstructed volumes to `volumesNeedDevicePath`
-     array, to finish their reconstruction from `node.status.volumesAttached`
-   * All volumes that failed reconstruction are added to
-     `volumesFailedReconstruction` list.
-
-
-After **ASW** is populated, VolumeManager starts its reconciliation loop, i.e.
-starts comparing ASW and DSW and periodically calls:
+When kubelet starts, VolumeManager starts DSWP and reconciler
+[in parallel](https://github.com/kubernetes/kubernetes/blob/575616cc72dbfdd070ead81ec29c0d4f00226487/pkg/kubelet/volumemanager/volume_manager.go#L288-L292).
+
+However, the first thing that the reconciler does before reconciling DSW and ASW
+is that it scans `/var/lib/kubelet/pods/*` and reconstructs **all** found
+volumes and adds them to ASW as *uncertain*. Only information that is available
+in the Pod directory on the disk are reconstructed into ASW at this point.
+* Since the volume reconstruction can be imperfect and can miss `devicePath`,
+  VolumeManager adds all reconstructed volumes to `volumesNeedDevicePath`
+  array, to finish their reconstruction from `node.status.volumesAttached`
+  later.
+* All volumes that failed reconstruction are added to
+  `volumesFailedReconstruction` list.
+
+After **ASW** is populated, reconciler starts its
+[reconciliation loop](https://github.com/kubernetes/kubernetes/blob/cca3d557e6ff7f265eca8517d7c4fa719077c8d1/pkg/kubelet/volumemanager/reconciler/reconciler_new.go#L33-L69):
 1. `mountOrAttachVolumes()` - mounts (and attaches, if necessary) volumes that
    are in DSW, but not in ASW. This can happen even before DSW is fully
-   populated, because ASW is fully populated at this point.
+   populated.
 
 2. `updateReconstructedDevicePaths()` - once kubelet gets connection to the API
    server and reads its own `Node.status`, volumes in `volumesNeedDevicePath`
@@ -329,11 +353,11 @@ starts comparing ASW and DSW and periodically calls:
 
 3. (Only once): Add all reconstructed volumes to `node.status.volumesInUse`.
    
-4. Only after DSW was fully populated (i.e. it contains at least Pods that were
-   Running when kubelet started), **and** `devicePaths` were populated from
+4. Only after DSW was fully populated (i.e. VolumeManager can tell if a volume
+   is really needed or not), **and** `devicePaths` were populated from
    `node.status`, VolumeManager can start unmounting volumes and calls:
-   1. `unmountVolumes()` - unmounts (`TearDown`) pod local volumes that are in
-      ASW and are not in DSW.
+   1. `unmountVolumes()` - unmounts (`TearDown`) pod local volume mounts that
+      are in ASW and are not in DSW.
    2. `unmountDetachDevices()` - unmounts (`UnmountDevice`) global volume mounts
       of volumes that are in ASW and are not in DSW.
    3. `cleanOrphanVolumes()` - tries to clean up `volumesFailedReconstruction`. 
@@ -341,25 +365,23 @@ starts comparing ASW and DSW and periodically calls:
       volume, because kubelet failed to reconstruct the volume spec from
       `/var/lib/kubelet/pods/<uid>/volumes/xyz`. Kubelet at least tries to
       unmount the directory and clean up any orphan files there.
+      This happens only once, `volumesFailedReconstruction` is cleared
+      afterwards.
 
 Note that e.g. `mountOrAttachVolumes` can call `volumePlugin.MountDevice` /
-`SetUp()` on a reconstructed volume (because it's *uncertain*) and finally
-update ASW, while the VolumeManager is still waiting for the API server to
-update `devicePath` of the same volume in ASW (step 2. above).
-We made sure that `updateReconstructedDevicePaths` will update the
-`devicePath` only for *uncertain* volumes, not to overwrite the *certain* ones.
+`SetUp()` on a reconstructed volume (because it was added to ASW as *uncertain*)
+and finally update ASW, while the VolumeManager is still waiting for the API
+server to update `devicePath` of the same volume in ASW (step 2. above). We made
+sure that `updateReconstructedDevicePaths()` will update the `devicePath` only
+for volumes that are still *uncertain*, not to overwrite the *certain* ones.
 
 ### Old VolumeManager startup
 
-When kubelet starts:
-
-1. VolumeManager starts populating DSW from PodManager, who reads Pods from
-   static pods and/or the API server.
-
-While DSW populator runs in the another thread, VolumeManager starts its
-reconciliation loop. Note that ASW is empty at this point + DSW is being
-populated. It starts comparing ASW and DSW and periodically following calls: 
+When kubelet starts, VolumeManager starts DSWP and the reconciler
+[in parallel](https://github.com/kubernetes/kubernetes/blob/575616cc72dbfdd070ead81ec29c0d4f00226487/pkg/kubelet/volumemanager/volume_manager.go#L288-L292).
 
+[The reconciler](https://github.com/kubernetes/kubernetes/blob/44b72d034852eb6da8916c82ce722af604b196c5/pkg/kubelet/volumemanager/reconciler/reconciler.go#L33-L45)
+then periodically does:
 1. `unmountVolumes()` - unmounts (`TearDown`) pod local volumes that are in
    ASW and are not in DSW. Since the ASW is initially empty, this call
    becomes useful later.
@@ -368,9 +390,8 @@ populated. It starts comparing ASW and DSW and periodically following calls:
    DSW, because ASW is empty. This actually the way how AWS is populated.
 3. `unmountDetachDevices()` - unmounts (`UnmountDevice`) global volume mounts
    of volumes that are in ASW and are not in DSW.
-4. Only once after DSW is fully populated (i.e. it contains at least Pods that
-   were Running when kubelet started):
-   1. VolumeManager scans `/var/lib/kubelet/pods/*`
+4. Only once after DSW is fully populated:
+   1. VolumeManager calls `sync()`, which scans `/var/lib/kubelet/pods/*`
       and reconstructs **only** volumes that are not in DSW. (If a volume is in
       DSW, we expect that it reaches ASW during step 3.)
       * `devicePath` of reconstructed volumes is populated from
@@ -416,7 +437,23 @@ This can inform certain test coverage improvements that we want to do before
 extending the production code to implement this enhancement.
 -->
 
-- `<package>`: `<date>` - `<test coverage>`
+All files are in `k8s.io/kubernetes/pkg/kubelet/volumemanager/reconciler/`,
+data taken on 2023-01-25.
+
+The old reconciler + reconstruction:
+- `reconciler.go`: `77.1`
+- `reconciler_new.go`: `73.3%`
+- `reconstruct.go`: `75.7%`
+- 
+The new reconciler + reconstruction
+- `reconciler_new.go`: `73.3%`
+- `reconstruct_new.go`: `66.2%`
+
+Common code:
+- `reconciler_common.go`: `86.2%`
+- `reconstruct_common.go`: `75.8%`
+
+TODO: check why is the new coverage actually lower.
 
 ##### Integration tests
 
@@ -428,7 +465,7 @@ For Beta and GA, add links to added tests together with links to k8s-triage for
 https://storage.googleapis.com/k8s-triage/index.html
 -->
 
-- <test>: <link to test coverage>
+None.
 
 ##### e2e tests
 
@@ -442,71 +479,40 @@ https://storage.googleapis.com/k8s-triage/index.html
 We expect no non-infra related flakes in the last month as a GA graduation criteria.
 -->
 
-- <test>: <link to test coverage>
-
-### Graduation Criteria
-
-<!--
-**Note:** *Not required until targeted at a release.*
-
-Define graduation milestones.
+- "Should test that pv used in a pod that is deleted while the kubelet is down
+  cleans up when the kubelet returns":
+  https://storage.googleapis.com/k8s-triage/index.html?sig=storage&test=Should%20test%20that%20pv%20used%20in%20a%20pod%20that%20is%20deleted%20while%20the%20kubelet%20is%20down%20cleans%20up%20when%20the%20kubelet%20returns
+- "Should test that pv used in a pod that is force deleted while the kubelet is
+  down cleans up when the kubelet returns":
+  https://storage.googleapis.com/k8s-triage/index.html?sig=storage&test=Should%20test%20that%20pv%20used%20in%20a%20pod%20that%20is%20force%20deleted%20while%20the%20kubelet%20is%20down%20cleans%20up%20when%20the%20kubelet%20returns
 
-These may be defined in terms of API maturity, [feature gate] graduations, or as
-something else. The KEP should keep this high-level with a focus on what
-signals will be looked at to determine graduation.
+Both are for the old reconstruction code, we don't have a job that enables
+alpha features + runs `[Disruptive]` tests.
 
-Consider the following in developing the graduation criteria for this enhancement:
-- [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
-- [Feature gate][feature gate] lifecycle
-- [Deprecation policy][deprecation-policy]
+TODO: check why it's flaky. *235 failures (3 in last day) out of 130688 builds
+from 1/11/2023, 1:00:33 AM to 1/25/2023*
 
-Clearly define what graduation means by either linking to the [API doc
-definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning)
-or by redefining what graduation means.
-
-In general we try to use the same stages (alpha, beta, GA), regardless of how the
-functionality is accessed.
-
-[feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
-[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
-[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
-
-Below are some examples to consider, in addition to the aforementioned [maturity levels][maturity-levels].
+### Graduation Criteria
 
 #### Alpha
 
 - Feature implemented behind a feature flag
-- Initial e2e tests completed and enabled
 
 #### Beta
 
-- Gather feedback from developers and surveys
-- Complete features A, B, C
-- Additional tests are in Testgrid and linked in KEP
+- Gather feedback from developers
 
 #### GA
 
-- N examples of real-world usage
-- N installs
-- More rigorous forms of testing—e.g., downgrade tests and scalability tests
-- Allowing time for feedback
-
-**Note:** Generally we also wait at least two releases between beta and
-GA/stable, because there's no opportunity for user feedback, or even bug reports,
-in back-to-back releases.
-
-**For non-optional features moving to GA, the graduation criteria must include
-[conformance tests].**
-
-[conformance tests]: https://git.k8s.io/community/contributors/devel/sig-architecture/conformance-tests.md
+- Allowing time for feedback.
+- No flakes in CI.
 
 #### Deprecation
 
 - Announce deprecation and support policy of the existing flag
-- Two versions passed since introducing the functionality that deprecates the flag (to address version skew)
+- No need to wait for two versions passed since introducing the functionality that deprecates the flag (to address version skew). The feature is local to a single kubelet.
 - Address feedback on usage/changed behavior, provided on GitHub issues
 - Deprecate the flag
--->
 
 ### Upgrade / Downgrade Strategy
 
@@ -522,6 +528,9 @@ enhancement:
   cluster required to make on upgrade, in order to make use of the enhancement?
 -->
 
+The feature is enabled by a single feature gate on kubelet and does not require
+any special upgrade / downgrade handling.
+
 ### Version Skew Strategy
 
 <!--
@@ -537,6 +546,10 @@ enhancement:
   CRI or CNI may require updating that component before the kubelet.
 -->
 
+The feature affects only how kubelet starts. It has no implications on
+other Kubernetes components or other kubelets. Therefore, we don't see any
+issues with any version skew.
+
 ## Production Readiness Review Questionnaire
 
 <!--
@@ -579,9 +592,9 @@ well as the [existing list] of feature gates.
 [existing list]: https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/
 -->
 
-- [ ] Feature gate (also fill in values in `kep.yaml`)
-  - Feature gate name:
-  - Components depending on the feature gate:
+- [X] Feature gate (also fill in values in `kep.yaml`)
+  - Feature gate name: NewVolumeManagerReconstruction
+  - Components depending on the feature gate: kubelet
 - [ ] Other
   - Describe the mechanism:
   - Will enabling / disabling the feature require downtime of the control
@@ -596,6 +609,8 @@ Any change of default behavior may be surprising to users or break existing
 automations, so be extremely careful here.
 -->
 
+It changes how kubelet starts. It has no visible effect in any API object.
+
 ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
 
 <!--
@@ -609,8 +624,12 @@ feature.
 NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
 -->
 
+The feature can be disabled without any issues.
+
 ###### What happens if we reenable the feature if it was previously rolled back?
 
+Nothing interesting.
+
 ###### Are there any tests for feature enablement/disablement?
 
 <!--
@@ -626,6 +645,12 @@ You can take a look at one potential example of such test in:
 https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
 -->
 
+We have unit tests for the feature disabled or enabled.
+It affects only kubelet startup and we don't change format of data present in
+the OS (mount table, content of `/var/lib/kubelet/pods/`), so we don't have
+automated tests to start kubelet with the feature enabled and then disable it
+or a vice versa.
+
 ### Rollout, Upgrade and Rollback Planning
 
 <!--
@@ -644,6 +669,9 @@ rollout. Similarly, consider large clusters and how enablement/disablement
 will rollout across nodes.
 -->
 
+In theory, if new reconstruction is buggy, kubelet either does not come up at
+all (crashes, hangs) or does not unmount volumes that it should unmount.
+
 ###### What specific metrics should inform a rollback?
 
 <!--
@@ -651,6 +679,9 @@ What signals should users be paying attention to when the feature is young
 that might indicate a serious problem?
 -->
 
+TODO. VolumeManager startup, esp. volume reconstruction, does not have any
+metrics.
+
 ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
 
 <!--
@@ -659,12 +690,16 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
 are missing a bunch of machinery and tooling and can't do that now.
 -->
 
+TODO
+
 ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
 
 <!--
 Even if applying deprecation policies, they may still surprise some users.
 -->
 
+No.
+
 ### Monitoring Requirements
 
 <!--
@@ -674,6 +709,8 @@ For GA, this section is required: approvers should be able to confirm the
 previous answers based on experience in the field.
 -->
 
+TODO whole chapter.
+
 ###### How can an operator determine if the feature is in use by workloads?
 
 <!--
@@ -761,6 +798,8 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
       - Impact of its degraded performance or high-error rates on the feature:
 -->
 
+No.
+
 ### Scalability
 
 <!--
@@ -788,6 +827,8 @@ Focusing mostly on:
     heartbeats, leader election, etc.)
 -->
 
+No.
+
 ###### Will enabling / using this feature result in introducing new API types?
 
 <!--
@@ -797,6 +838,8 @@ Describe them, providing:
   - Supported number of objects per namespace (for namespace-scoped objects)
 -->
 
+No.
+
 ###### Will enabling / using this feature result in any new calls to the cloud provider?
 
 <!--
@@ -805,6 +848,8 @@ Describe them, providing:
   - Estimated increase:
 -->
 
+No.
+
 ###### Will enabling / using this feature result in increasing size or count of the existing API objects?
 
 <!--
@@ -814,6 +859,8 @@ Describe them, providing:
   - Estimated amount of new objects: (e.g., new Object X for every existing Pod)
 -->
 
+No.
+
 ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
 
 <!--
@@ -825,6 +872,9 @@ Think about adding additional work or introducing new steps in between
 [existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
 -->
 
+Kubelet startup could be slower, but that would be a bug. In theory, the old
+and new VolumeManager startup does the same things, just in a different order.  
+
 ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
 
 <!--
@@ -837,6 +887,8 @@ This through this both in small and large cases, again with respect to the
 [supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
 -->
 
+No.
+
 ### Troubleshooting
 
 <!--
@@ -852,6 +904,9 @@ details). For now, we leave it here.
 
 ###### How does this feature react if the API server and/or etcd is unavailable?
 
+Kubelet won't start unmounting volumes that are not needed. But that was the
+behavior also before this KEP.
+
 ###### What are other known failure modes?
 
 <!--