-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3756: Add volume reconstruction KEP #3763
KEP-3756: Add volume reconstruction KEP #3763
Conversation
2c1f113
to
a594368
Compare
a594368
to
b9eb98c
Compare
Provisional PR: kubernetes/kubernetes#115268 |
b9eb98c
to
1b39aa7
Compare
Overall design looks good to me. The mechanism being proposed has already been reviewed while reviewing SELinux KEP and hence should be good to go. lgtm |
We needed to add | ||
[a complex workaround](https://github.com/kubernetes/kubernetes/pull/110670) | ||
to actually unmount a volume if it's initially in DSW, but user deletes all | ||
Pods that need it before the volume reaches ASW. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this feature directly goes beta, we will have to test if this new mechanism fixes the bug PR#110670 fixes or at least co-exists without stomping on each others toes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a e2e / integration test for #110670 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I don't think I wrote any e2e for this one. We only have unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still have the unit test, I think.
/assign @sunnylovestiramisu |
d0ddbb0
to
3e58181
Compare
f37da17
to
21a9280
Compare
I filed all mandatory chapters for beta (notice a new "Observability" chapter proposing some metrics), please let me know if I missed anything. |
And there are some TODO items, ideas would be welcome. (I'm going to dive to unit tests coverage). |
Lower unit test coverage of the new reconciler explained in the KEP, it will be fine once the new VolumeManager is enabled by default. |
And add Ci flake investigation.
f8891a1
to
ac09b7c
Compare
Both are for the old reconstruction code, we don't have a job that enables | ||
alpha features + runs `[Disruptive]` tests. | ||
|
||
Recent results: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the details of the test failures would be better tracked in a bug.
/approve |
/lgtm |
* `orphaned_volumes_cleanup_errors_total`: nr. of reports | ||
like `orphaned pod "<uid>" found, but XYZ failed` | ||
([example](https://github.com/kubernetes/kubernetes/blob/4fac7486d41c033d6bba9dfeda2356e8189035cd/pkg/kubelet/kubelet_volumes.go#L215)). | ||
These messages can be a symptom of failed reconstruction (e.g. | ||
[#105536](https://github.com/kubernetes/kubernetes/issues/105536)). | ||
Note that kubelet logs this periodically and bumping this metric periodically | ||
would not be useful. | ||
[`cleanupOrphanedPodDirs`](https://github.com/kubernetes/kubernetes/blob/4fac7486d41c033d6bba9dfeda2356e8189035cd/pkg/kubelet/kubelet_volumes.go#L168) | ||
needs to be changed to collect errors found during | ||
one `/var/lib/kubelet/pods/` check and report collected "nr of errors during | ||
the last housekeeping sweep (every 2 seconds)". | ||
* TODO: do we want to have a label to distinguish each error reason, | ||
e.g. "Pod found, but volumes are still mounted on disk" from say | ||
"orphaned pod %q found, but error occurred during reading of | ||
volume-subpaths dir from disk"? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to expand on the TODO item with an example. cleanupOrphanedPodDirs
can fail with:
Orphaned pod found, but volumes are not cleaned up
orphaned pod %q found, but error occurred during reading the pod dir from disk
Orphaned pod found, but failed to remove volumes subdir
orphaned pod %q found, but error occurred when trying to remove subdir %q: %v"
orphaned pod %q found, but error occurred when trying to remove the pod directory: %v
orphaned pod %q found, but failed to rmdir() volume at path %v: %v
orphaned pod %q found, but error occurred during reading of volume-subpaths dir from disk: %v
orphaned pod %q found, but error occurred when trying to remove the volumes dir: %v
And potentially many more. Do we want to have a label reason
with some enumerated values that would allow users to distinguish each error from each other or is it useless?
thanks for the PRR updates. approving PRR /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, jsafrane, msau42 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
One-line PR description: Add volume reconstruction KEP.
Issue link: Robust VolumeManager reconstruction after kubelet restart #3756
Note to reviewers: We're going directly to beta, because we had alpha phase in #1710 and we realized that VolumeManager rework is useful outside of the SELinux feature.
/sig storage