From 75926dd3dbd679ce0c80d6101c76a642c3fbd36c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Fri, 17 May 2024 18:28:59 +0800 Subject: [PATCH 01/17] initial version of "StatefulSet Support for Updating Volume Claim Template" --- .../README.md | 952 ++++++++++++++++++ .../kep.yaml | 50 + 2 files changed, 1002 insertions(+) create mode 100644 keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md create mode 100644 keps/sig-storage/NNNN-stateful-set-update-claim-template/kep.yaml diff --git a/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md b/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md new file mode 100644 index 00000000000..6a84f33f5b7 --- /dev/null +++ b/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md @@ -0,0 +1,952 @@ + +# KEP-NNNN: StatefulSet Support for Updating Volume Claim Template + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [Updated Reconciliation Logic](#updated-reconciliation-logic) + - [What PVC is capatible](#what-pvc-is-capatible) + - [Collected PVC Status](#collected-pvc-status) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1: Batch Expand Volumes](#story-1-batch-expand-volumes) + - [Story 2: Migrating Between Storage Providers](#story-2-migrating-between-storage-providers) + - [Story 3: Migrating Between Different Implementations of the Same Storage Provider](#story-3-migrating-between-different-implementations-of-the-same-storage-provider) + - [Story 4: Shinking the PV by Re-creating PVC](#story-4-shinking-the-pv-by-re-creating-pvc) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + - [extensively validate the updated volumeClaimTemplate](#extensively-validate-the-updated-volumeclaimtemplate) + - [Only support for updating volumeClaimTemplate.spec.resources.requests.storage](#only-support-for-updating-volumeclaimtemplatespecresourcesrequestsstorage) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + + + +Kubernetes does not support the modification of the `volumeClaimTemplate` of a StatefulSet currently. +This enhancement proposes to support arbitrary modifications to the `volumeClaimTemplate`, +automatically updating the associated PersistentVolumeClaim objects in-place if applicable. +Currently, PVC `spec.resources.requests.storage` and `spec.volumeAttributesClassName` +fields can be updated in-place. +For other fields, we support updating existing PersistentVolumeClaim objects with `OnDelete` strategy. +All the updates to PersistentVolumeClaim can be coordinated with `Pod` updates +to honor any dependencies between them. + +## Motivation + + + +Currently there are very few things that users can do to update the volumes of +their existing StatefulSet deployments. +They can only expand the volumes, or modify them with VolumeAttributesClass +by updating individual PersistentVolumeClaim objects as an ad-hoc operation. +When the StatefulSet scales up, the new PVC(s) will be created with the old +config and this again needs manual intervention. +Modifying immutable parameters, shinking, or even switch to another +storage provider is not possible currently. +This brings many headaches in a continuously evolving environment. + +### Goals + + +* Allow users to update the `volumeClaimTemplate` of a `StatefulSet` in place. +* Automatically update the associated PersistentVolumeClaim objects in-place if applicable. +* Support updating PersistentVolumeClaim objects with `OnDelete` strategy. +* Coordinate updates to `Pod` and PersistentVolumeClaim objects. +* Provide accurate status and error messages to users when the update fails. + +### Non-Goals + + +* Support automatic rolling update of PersistentVolumeClaim. +* Validate the updated `volumeClaimTemplate` as how PVC update does. +* Update ephemeral volumes. + + +## Proposal + + +1. Change API server to allow any updates to `volumeClaimTemplate` of a StatefulSet. + +2. Modify StatefulSet controller to add PVC reconciliation logic. + +3. Introduce a new field in StatefulSet `spec`: `volumeClaimUpdateStrategy` to + specify how to coordinate the update of PVCs and Pods. Possible values are: + - `OnDeleteAsync`: the default value, preserve the current behavior. + - `OnDeleteLockStep`: update PVCs first, then update Pods. See below for details. + +4. Collect the status of managed PVCs, and show them in the StatefulSet status. + +### Updated Reconciliation Logic + +How to update PVCs: +1. If `volumeClaimTemplate` and actual PVC only differ in mutable fields + (`spec.resources.requests.storage`, `spec.volumeAttributesClassName`, `metadata.labels`, and `metadata.annotations` currently), + update the PVC in-place to the extent possible. + Do not perform the update that will be rejected by API server, such as + decreasing the storage size below its current status. + Note that decrease the size can help recover from a failed expansion if + `RecoverVolumeExpansionFailure` feature gate is enabled. + +2. If it is not possible to make the PVC [capatible](#what-pvc-is-capatible), + do nothing. But when recreating a Pod and the corresponding PVC is deleting, + wait for the deletion then create a new PVC with the current template + together with the new Pod. + +When to update PVCs: +1. Before recreate the pod, additionally check that the PVC is + [capatible](#what-pvc-is-capatible) with the new `volumeClaimTemplate`. + If not, update the PVC after old Pod deleted, before creating new pod, + or if update is not possible: + - If `volumeClaimUpdateStrategy` is `OnDeleteLockStep`, + wait for the user to delete the old PVC manually before delete the old pod. + - If `volumeClaimUpdateStrategy` is `OnDeleteAsync`, + the diff is ignored and the pod recreation proceeds. + +2. If Pod spec does not change, only mutable fields in `volumeClaimTemplate` differ, + The PVCs should be updated just like Pods would. A replica is considered ready + if all its volumes are capatible with the new `volumeClaimTemplate`. + `.spec.ordinals` and `.spec.updateStrategy.rollingUpdate.partition` are also respected. + e.g.: + - If `.spec.updateStrategy.type` is `RollingUpdate`, + update the PVCs in the order from the largest ordinal to the smallest. + Only proceed to the next ordinal when all the PVCs of the previous ordinal + are capatible with the new `volumeClaimTemplate`. + - If `.spec.updateStrategy.type` is `OnDelete`, + Only update the PVC when the Pod is deleted. + + +### What PVC is capatible + +TODO + +### Collected PVC Status + +TODO + +### User Stories (Optional) + + + +#### Story 1: Batch Expand Volumes + +TODO + +#### Story 2: Migrating Between Storage Providers + +TODO + +#### Story 3: Migrating Between Different Implementations of the Same Storage Provider + +TODO + +#### Story 4: Shinking the PV by Re-creating PVC + +TODO + +### Notes/Constraints/Caveats (Optional) + + + +`volumeClaimUpdateStrategy` is introduce to keep capability of current deployed workloads. +StatefulSet currently accepts and uses existing PVCs that is not created by the controller, +So the `volumeClaimTemplate` and PVC can differ even before this enhancement. +Some users may choose to keep the PVCs of different replicas different. +We should not block the Pod updates for them. + +If `volumeClaimUpdateStrategy` is `OnDeleteAsync`, +then if the template and PVC differs other than mutable fields, and it is not deleting, +the PVC is not considered as managed by the StatefulSet. + +However, a workload may rely on some features provided by a specific PVC, +So we should provide a way to coordinate the update. +That's why we also need `OnDeleteLockStep`. + +We consider a StatefulSet in stable state if all the managed PVCs are capatible with the current template. +In a stable state, most operations are possible, and we are not actively fixing something. + +### Risks and Mitigations + + + +## Design Details + + + +### Test Plan + + + +[ ] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +- ``: `` - `` + +##### Integration tests + + + + + +- : + +##### e2e tests + + + +- : + +### Graduation Criteria + + + +### Upgrade / Downgrade Strategy + + + +### Version Skew Strategy + + + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [x] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: StatefulSetUpdateVolumeClaimTemplate + - Components depending on the feature gate: + - kube-apiserver + - kube-controller-manager +- [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? + +###### Does enabling the feature change any default behavior? + + +If the PVC capacity is smaller than that in the template, +the PVC will be expanded immediately after the feature is enbled. +This should be rare, the user must have created the PVC before the StatefulSet for this to happen. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +###### What happens if we reenable the feature if it was previously rolled back? + +###### Are there any tests for feature enablement/disablement? + + + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +###### Will enabling / using this feature result in introducing new API types? + + + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + + +### extensively validate the updated `volumeClaimTemplate` + +[KEP-0661] proposes that we should do extensive validation on the updated `volumeClaimTemplate`. +e.g., prevent decreasing the storage size, preventing expand if the storage class does not support it. +However, this have saveral drawbacks: +* Not reverting the `volumeClaimTemplate` when rollback the StatefulSet is confusing, +* This can be a barrier when recovering from a failed update. +* The validation is racy, especially when recovering from failed expansion. + We still need to consider most abnormal cases even we do those validations. +* This does not match the pattern of existing behaviors. + That is, the controller should take the expected state, retry as needed to reach that state. + For example, StatefulSet will not reject a invalid `serviceAccountName`. +* `volumeClaimTemplate` is also used when creating new PVCs, so even if the existing PVCs cannot be updated, + a user may still want to affect new PVCs. + +### Only support for updating `volumeClaimTemplate.spec.resources.requests.storage` + +[KEP-0661] only enables expanding the volume. However, because the StatefulSet can take pre-existing PVCs, +we still need to consider what to do when template and PVC don't match. +The complexity of this proposal will not decrease much if we only support expanding the volume. + +By enabling arbitrary updating to the `volumeClaimTemplate`, +we just acknowledge and officially support this use case. + +[KEP-0661]: https://github.com/kubernetes/enhancements/pull/3412 + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-storage/NNNN-stateful-set-update-claim-template/kep.yaml b/keps/sig-storage/NNNN-stateful-set-update-claim-template/kep.yaml new file mode 100644 index 00000000000..b922d003c9c --- /dev/null +++ b/keps/sig-storage/NNNN-stateful-set-update-claim-template/kep.yaml @@ -0,0 +1,50 @@ +title: StatefulSet Support for Updating Volume Claim Template +kep-number: NNNN +authors: + - "@huww98" +owning-sig: sig-storage +participating-sigs: + - sig-app +status: provisional +creation-date: 2024-05-17 +reviewers: + - "@kow3ns" + - "@gnufied" + - "@msau42" + - "@xing-yang" +approvers: + - "@kow3ns" + - "@xing-yang" + +see-also: + - "/keps/sig-storage/1790-recover-resize-failure" + - "/keps/sig-storage/3751-volume-attributes-class" +replaces: + - "https://github.com/kubernetes/enhancements/pull/2842" # Previous attempt on 0611 + - "https://github.com/kubernetes/enhancements/pull/3412" # Previous attempt on 0611 + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.31" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.31" + beta: "v1.32" + stable: "v1.33" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: StatefulSetUpdateVolumeClaimTemplate + components: + - kube-apiserver + - kube-controller-manager +disable-supported: true + +# The following PRR answers are required at beta release +metrics: [] From 30ea829c01cf443138dde514b4dea11b29b7cd12 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Fri, 17 May 2024 20:31:59 +0800 Subject: [PATCH 02/17] what is capatible --- .../README.md | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md b/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md index 6a84f33f5b7..36cf07c680b 100644 --- a/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md +++ b/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md @@ -289,7 +289,13 @@ When to update PVCs: ### What PVC is capatible -TODO +A PVC is capatible with the template if: +- All the immutable fields match exactly; and +- `metadata.labels` and `metadata.annotations` of PVC is a superset of the template; and +- `status.capacity.storage` of PVC is greater than or equal to + the `spec.resources.requests.storage` of the template; and +- `status.currentVolumeAttributesClassName` of PVC is equal to + the `spec.volumeAttributesClassName` of the template. ### Collected PVC Status @@ -369,6 +375,10 @@ required) or even code snippets. If there's any ambiguity about HOW your proposal will be implemented, this is the place to discuss them. --> +We can use Server Side Apply to update the PVCs in-place, +so that we will not interfere with the user's manual changes, +e.g. to `metadata.labels` and `metadata.annotations`. + ### Test Plan -`volumeClaimUpdateStrategy` is introduce to keep capability of current deployed workloads. +`volumeClaimSyncStrategy` is introduce to keep capability of current deployed workloads. StatefulSet currently accepts and uses existing PVCs that is not created by the controller, So the `volumeClaimTemplate` and PVC can differ even before this enhancement. Some users may choose to keep the PVCs of different replicas different. We should not block the Pod updates for them. -If `volumeClaimUpdateStrategy` is `OnDeleteAsync`, -then if the template and PVC differs other than mutable fields, and it is not deleting, +If `volumeClaimSyncStrategy` is `Async`, +then if the template and PVC differs, and the PVC is not being deleted, the PVC is not considered as managed by the StatefulSet. However, a workload may rely on some features provided by a specific PVC, So we should provide a way to coordinate the update. -That's why we also need `OnDeleteLockStep`. +That's why we also need `LockStep`. We consider a StatefulSet in stable state if all the managed PVCs are capatible with the current template. In a stable state, most operations are possible, and we are not actively fixing something. @@ -612,9 +618,10 @@ well as the [existing list] of feature gates. Any change of default behavior may be surprising to users or break existing automations, so be extremely careful here. --> -If the PVC capacity is smaller than that in the template, -the PVC will be expanded immediately after the feature is enbled. -This should be rare, the user must have created the PVC before the StatefulSet for this to happen. +If `volumeClaimUpdateStrategy` is `OnDelete` and `volumeClaimSyncStrategy` is `Async` (the default values), +the behavior of StatefulSet controller is almost the same as before. +Except that if the PVC is deleting when performing rolling update, the controller will wait for the deletion +before creating the new Pod. This may bring additional delay if the PVC deletion is somehow blocked. ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? From df261df589a75811868f64fede3047a7be96fa6d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Sun, 19 May 2024 17:42:20 +0800 Subject: [PATCH 04/17] User stories --- .../README.md | 31 ++++++++++++++++--- 1 file changed, 27 insertions(+), 4 deletions(-) diff --git a/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md b/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md index 5e12fbc1df1..ff795cb6563 100644 --- a/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md +++ b/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md @@ -85,6 +85,7 @@ tags, and then generate with `hack/update-toc.sh`. - [Story 2: Migrating Between Storage Providers](#story-2-migrating-between-storage-providers) - [Story 3: Migrating Between Different Implementations of the Same Storage Provider](#story-3-migrating-between-different-implementations-of-the-same-storage-provider) - [Story 4: Shinking the PV by Re-creating PVC](#story-4-shinking-the-pv-by-re-creating-pvc) + - [Story 5: Asymmetric Replicas](#story-5-asymmetric-replicas) - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) - [Risks and Mitigations](#risks-and-mitigations) - [Design Details](#design-details) @@ -318,19 +319,41 @@ bogged down. #### Story 1: Batch Expand Volumes -TODO +We're running a CI/CD system and the end-to-end automation is desired. +To expand the volumes managed by a StatefulSet, +we can just use the same pipeline that we are already using to updating the Pod. +All the test, review, approval, and rollback process can be reused. #### Story 2: Migrating Between Storage Providers -TODO +We decide to switch from home-made local storage to the storage provided by a cloud provider. +We can not afford any downtime, so we don't want to delete and recreate the StatefulSet. +Our app can automatically rebuild the data in the new storage from other replicas. +So we update the `volumeClaimTemplate` of the StatefulSet, +delete the PVC and Pod of one replica, let the controller re-create them, +then monitor the rebuild process. +Once the rebuild completes successfully, we proceed to the next replica. #### Story 3: Migrating Between Different Implementations of the Same Storage Provider -TODO +Our storage provider has a new version that provides new features, but can not be upgraded in-place. +We can prepare some new PersistentVolumes using the new version, but referencing the same disk +from the provider as the in-use PVs. +Then the same update process as Story 2 can be used. +Although the PVCs are recreated, the data is preserved, so no rebuild is needed. #### Story 4: Shinking the PV by Re-creating PVC -TODO +After running our app for a while, we optimize the data layout and reduce the required storage size. +Now we want to shrink the PVs to save cost. +The same process as Story 2 can be used. + +#### Story 5: Asymmetric Replicas + +The replicas of our StatefulSet are not identical, so we still want to update +each PVC manually and separately. +Possibly we also update the `volumeClaimTemplate` for new replicas, +but we don't want the controller to interfere with the existing replicas. ### Notes/Constraints/Caveats (Optional) From 2e1de4e6f8fddf676b40667c624134037180e499 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Sun, 19 May 2024 18:01:02 +0800 Subject: [PATCH 05/17] we are already waiting for PVC deletion --- .../NNNN-stateful-set-update-claim-template/README.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md b/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md index ff795cb6563..a177a86d766 100644 --- a/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md +++ b/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md @@ -269,7 +269,11 @@ How to update PVCs: 2. If it is not possible to make the PVC [capatible](#what-pvc-is-capatible), do nothing. But when recreating a Pod and the corresponding PVC is deleting, wait for the deletion then create a new PVC with the current template - together with the new Pod. + together with the new Pod (already implemented). + When to update PVCs: 1. Before recreate the pod, additionally check that the PVC is @@ -641,10 +645,9 @@ well as the [existing list] of feature gates. Any change of default behavior may be surprising to users or break existing automations, so be extremely careful here. --> +No. If `volumeClaimUpdateStrategy` is `OnDelete` and `volumeClaimSyncStrategy` is `Async` (the default values), the behavior of StatefulSet controller is almost the same as before. -Except that if the PVC is deleting when performing rolling update, the controller will wait for the deletion -before creating the new Pod. This may bring additional delay if the PVC deletion is somehow blocked. ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? From a5e98fa254ac977de9f9e07584e8a3b69cf1cbb5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Sun, 19 May 2024 19:19:07 +0800 Subject: [PATCH 06/17] status --- .../README.md | 47 +++++++++++++++---- 1 file changed, 38 insertions(+), 9 deletions(-) diff --git a/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md b/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md index a177a86d766..803eb8437f1 100644 --- a/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md +++ b/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md @@ -78,7 +78,7 @@ tags, and then generate with `hack/update-toc.sh`. - [Non-Goals](#non-goals) - [Proposal](#proposal) - [Updated Reconciliation Logic](#updated-reconciliation-logic) - - [What PVC is capatible](#what-pvc-is-capatible) + - [What PVC is compatible](#what-pvc-is-compatible) - [Collected PVC Status](#collected-pvc-status) - [User Stories (Optional)](#user-stories-optional) - [Story 1: Batch Expand Volumes](#story-1-batch-expand-volumes) @@ -266,7 +266,7 @@ How to update PVCs: Note that decrease the size can help recover from a failed expansion if `RecoverVolumeExpansionFailure` feature gate is enabled. -2. If it is not possible to make the PVC [capatible](#what-pvc-is-capatible), +2. If it is not possible to make the PVC [compatible](#what-pvc-is-compatible), do nothing. But when recreating a Pod and the corresponding PVC is deleting, wait for the deletion then create a new PVC with the current template together with the new Pod (already implemented). @@ -277,7 +277,7 @@ Warning FailedCreate 3m58s (x7 over 3m58s) statefulset-controller cre When to update PVCs: 1. Before recreate the pod, additionally check that the PVC is - [capatible](#what-pvc-is-capatible) with the new `volumeClaimTemplate`. + [compatible](#what-pvc-is-compatible) with the new `volumeClaimTemplate`. If not, update the PVC after old Pod deleted, before creating new pod, or if update is not possible: - If `volumeClaimSyncStrategy` is `LockStep`, @@ -287,20 +287,20 @@ When to update PVCs: 2. If Pod spec does not change, only mutable fields in `volumeClaimTemplate` differ, The PVCs should be updated just like Pods would. A replica is considered ready - if all its volumes are capatible with the new `volumeClaimTemplate`. + if all its volumes are compatible with the new `volumeClaimTemplate`. `.spec.ordinals` and `.spec.updateStrategy.rollingUpdate.partition` are also respected. e.g.: - If `.spec.updateStrategy.type` is `RollingUpdate`, update the PVCs in the order from the largest ordinal to the smallest. Only proceed to the next ordinal when all the PVCs of the previous ordinal - are capatible with the new `volumeClaimTemplate`. + are compatible with the new `volumeClaimTemplate`. - If `.spec.updateStrategy.type` is `OnDelete`, Only update the PVC when the Pod is deleted. -### What PVC is capatible +### What PVC is compatible -A PVC is capatible with the template if: +A PVC is compatible with the template if: - All the immutable fields match exactly; and - `metadata.labels` and `metadata.annotations` of PVC is a superset of the template; and - `status.capacity.storage` of PVC is greater than or equal to @@ -310,7 +310,23 @@ A PVC is capatible with the template if: ### Collected PVC Status -TODO +For each PVC in the template: +- compatible: the number of PVCs that are compatible with the template. + These replicas will not be blocked on Pod recreation if `volumeClaimSyncStrategy` is `LockStep`. +- updating: the number of PVCs that are being updated in-place. +- overSized: the number of PVCs that are over-sized. +- totalCapacity: the sum of `status.capacity` of all the PVCs. + +Some fields in the `status` are also updated to reflect the staus of the PVCs: +- readyReplicas: in addition to pods, also consider the PVCs status. A PVC is not ready if: + - `volumeClaimUpdateStrategy` is `InPlace` and the PVC is updating; + - `volumeClaimSyncStrategy` is `LockStep` and the PVC is not compatible with the template; +- availableReplicas: total number of replicas of which both Pod and PVCs are ready for at least `minReadySeconds` +- currentRevision, updateRevision, currentReplicas, updatedReplicas + are updated to reflect the status of PVCs. + +With these changes, user can still use `kubectl rollout status` to monitor the update process, +both for in-place update and for the PVCs that need manual intervention. ### User Stories (Optional) @@ -382,9 +398,12 @@ However, a workload may rely on some features provided by a specific PVC, So we should provide a way to coordinate the update. That's why we also need `LockStep`. -We consider a StatefulSet in stable state if all the managed PVCs are capatible with the current template. +We consider a StatefulSet in stable state if all the managed PVCs are compatible with the current template. In a stable state, most operations are possible, and we are not actively fixing something. +The StatefulSet controller should also keeps the current and updated revision of the `volumeClaimTemplate`, +so that a `LockStep` StatefulSet can still re-create Pods and PVCs that are yet-to-be-updated. + ### Risks and Mitigations +When the `volumeClaimSyncStrategy` is set to `LockStep`, keeping PVCs that are +incompatible with the template is dangerous. This will block the Pod from being +recreated, and the workload will be unavailable if some Pods are evicted. +We should document this clearly and report the replica as not ready in the status +to warn the user. +this should only happen when the user manually updates the PVC, +or the `volumeClaimSyncStrategy` is updated to `LockStep` while the PVC is not compatible. + +TODO: Recover from failed in-place update (insufficient storage, etc.) +What else is needed in addition to revert the StatefulSet spec? ## Design Details From 4c1c619fe866cfa801ed228122d3999e32e67e3a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Sun, 19 May 2024 22:32:25 +0800 Subject: [PATCH 07/17] check compatible before advancing updatedReplicas --- .../README.md | 23 +++++++------------ 1 file changed, 8 insertions(+), 15 deletions(-) diff --git a/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md b/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md index 803eb8437f1..38b98a92173 100644 --- a/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md +++ b/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md @@ -276,25 +276,27 @@ Warning FailedCreate 3m58s (x7 over 3m58s) statefulset-controller cre --> When to update PVCs: -1. Before recreate the pod, additionally check that the PVC is +1. Before advancing `status.updatedReplicas` to the next replica, + additionally check that the PVCs of the next replica are [compatible](#what-pvc-is-compatible) with the new `volumeClaimTemplate`. If not, update the PVC after old Pod deleted, before creating new pod, or if update is not possible: + - If `volumeClaimSyncStrategy` is `LockStep`, - wait for the user to delete/update the old PVC manually before delete the old pod. + wait for the user to delete/update the old PVC manually. - If `volumeClaimSyncStrategy` is `Async`, - the diff is ignored and the pod recreation proceeds. + the diff is ignored and the normal rolling update proceeds. 2. If Pod spec does not change, only mutable fields in `volumeClaimTemplate` differ, The PVCs should be updated just like Pods would. A replica is considered ready if all its volumes are compatible with the new `volumeClaimTemplate`. - `.spec.ordinals` and `.spec.updateStrategy.rollingUpdate.partition` are also respected. + `spec.ordinals` and `spec.updateStrategy.rollingUpdate.partition` are also respected. e.g.: - - If `.spec.updateStrategy.type` is `RollingUpdate`, + - If `spec.updateStrategy.type` is `RollingUpdate`, update the PVCs in the order from the largest ordinal to the smallest. Only proceed to the next ordinal when all the PVCs of the previous ordinal are compatible with the new `volumeClaimTemplate`. - - If `.spec.updateStrategy.type` is `OnDelete`, + - If `spec.updateStrategy.type` is `OnDelete`, Only update the PVC when the Pod is deleted. @@ -320,7 +322,6 @@ For each PVC in the template: Some fields in the `status` are also updated to reflect the staus of the PVCs: - readyReplicas: in addition to pods, also consider the PVCs status. A PVC is not ready if: - `volumeClaimUpdateStrategy` is `InPlace` and the PVC is updating; - - `volumeClaimSyncStrategy` is `LockStep` and the PVC is not compatible with the template; - availableReplicas: total number of replicas of which both Pod and PVCs are ready for at least `minReadySeconds` - currentRevision, updateRevision, currentReplicas, updatedReplicas are updated to reflect the status of PVCs. @@ -417,14 +418,6 @@ How will UX be reviewed, and by whom? Consider including folks who also work outside the SIG or subproject. --> -When the `volumeClaimSyncStrategy` is set to `LockStep`, keeping PVCs that are -incompatible with the template is dangerous. This will block the Pod from being -recreated, and the workload will be unavailable if some Pods are evicted. -We should document this clearly and report the replica as not ready in the status -to warn the user. -this should only happen when the user manually updates the PVC, -or the `volumeClaimSyncStrategy` is updated to `LockStep` while the PVC is not compatible. - TODO: Recover from failed in-place update (insufficient storage, etc.) What else is needed in addition to revert the StatefulSet spec? From 15ae1f64338df289f8015dbea3e0427456591ffc Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Wed, 22 May 2024 20:10:24 +0800 Subject: [PATCH 08/17] add Kubernetes API Changes section --- .../README.md | 52 +++++++++++-------- 1 file changed, 29 insertions(+), 23 deletions(-) diff --git a/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md b/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md index 38b98a92173..81d53fc3adc 100644 --- a/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md +++ b/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md @@ -77,9 +77,9 @@ tags, and then generate with `hack/update-toc.sh`. - [Goals](#goals) - [Non-Goals](#non-goals) - [Proposal](#proposal) + - [Kubernetes API Changes](#kubernetes-api-changes) - [Updated Reconciliation Logic](#updated-reconciliation-logic) - [What PVC is compatible](#what-pvc-is-compatible) - - [Collected PVC Status](#collected-pvc-status) - [User Stories (Optional)](#user-stories-optional) - [Story 1: Batch Expand Volumes](#story-1-batch-expand-volumes) - [Story 2: Migrating Between Storage Providers](#story-2-migrating-between-storage-providers) @@ -242,17 +242,42 @@ nitty-gritty. 2. Modify StatefulSet controller to add PVC reconciliation logic. -3. Introduce a new field in StatefulSet `spec`: `volumeClaimUpdateStrategy` to +3. Collect the status of managed PVCs, and show them in the StatefulSet status. + +### Kubernetes API Changes + +Changes to StatefulSet `spec`: + +1. Introduce a new field in StatefulSet `spec`: `volumeClaimUpdateStrategy` to specify how to coordinate the update of PVCs and Pods. Possible values are: - `OnDelete`: the default value, only update the PVC when the the old PVC is deleted. - `InPlace`: update the PVC in-place if possible. Also includes the `OnDelete` behavior. -4. Introduce a new field in StatefulSet `spec.updateStrategy.rollingUpdate`: `volumeClaimSyncStrategy` +2. Introduce a new field in StatefulSet `spec.updateStrategy.rollingUpdate`: `volumeClaimSyncStrategy` to specify how to update PVCs and Pods. Possible values are: - `Async`: the default value, preseve the current behavior. - `LockStep`: update PVCs first, then update Pods. See below for details. -5. Collect the status of managed PVCs, and show them in the StatefulSet status. +Changes to StatefultSet `status`: + +Additionally collect the status of managed PVCs, and show them in the StatefulSet status. + +For each PVC in the template: +- compatible: the number of PVCs that are compatible with the template. + These replicas will not be blocked on Pod recreation if `volumeClaimSyncStrategy` is `LockStep`. +- updating: the number of PVCs that are being updated in-place. +- overSized: the number of PVCs that are over-sized. +- totalCapacity: the sum of `status.capacity` of all the PVCs. + +Some fields in the `status` are also updated to reflect the staus of the PVCs: +- readyReplicas: in addition to pods, also consider the PVCs status. A PVC is not ready if: + - `volumeClaimUpdateStrategy` is `InPlace` and the PVC is updating; +- availableReplicas: total number of replicas of which both Pod and PVCs are ready for at least `minReadySeconds` +- currentRevision, updateRevision, currentReplicas, updatedReplicas + are updated to reflect the status of PVCs. + +With these changes, user can still use `kubectl rollout status` to monitor the update process, +both for in-place update and for the PVCs that need manual intervention. ### Updated Reconciliation Logic @@ -310,25 +335,6 @@ A PVC is compatible with the template if: - `status.currentVolumeAttributesClassName` of PVC is equal to the `spec.volumeAttributesClassName` of the template. -### Collected PVC Status - -For each PVC in the template: -- compatible: the number of PVCs that are compatible with the template. - These replicas will not be blocked on Pod recreation if `volumeClaimSyncStrategy` is `LockStep`. -- updating: the number of PVCs that are being updated in-place. -- overSized: the number of PVCs that are over-sized. -- totalCapacity: the sum of `status.capacity` of all the PVCs. - -Some fields in the `status` are also updated to reflect the staus of the PVCs: -- readyReplicas: in addition to pods, also consider the PVCs status. A PVC is not ready if: - - `volumeClaimUpdateStrategy` is `InPlace` and the PVC is updating; -- availableReplicas: total number of replicas of which both Pod and PVCs are ready for at least `minReadySeconds` -- currentRevision, updateRevision, currentReplicas, updatedReplicas - are updated to reflect the status of PVCs. - -With these changes, user can still use `kubectl rollout status` to monitor the update process, -both for in-place update and for the PVCs that need manual intervention. - ### User Stories (Optional) -# KEP-NNNN: StatefulSet Support for Updating Volume Claim Template +# KEP-4650: StatefulSet Support for Updating Volume Claim Template +3. Use either current or updated revision of the `volumeClaimTemplate` to create/update the PVC, + just like Pod template. + When to update PVCs: -1. Before advancing `status.updatedReplicas` to the next replica, +1. If `volumeClaimSyncStrategy` is `LockStep`, + before advancing `status.updatedReplicas` to the next replica, additionally check that the PVCs of the next replica are [compatible](#what-pvc-is-compatible) with the new `volumeClaimTemplate`. - If not, update the PVC after old Pod deleted, before creating new pod, - or if update is not possible: - - - If `volumeClaimSyncStrategy` is `LockStep`, - wait for the user to delete/update the old PVC manually. - - If `volumeClaimSyncStrategy` is `Async`, - the diff is ignored and the normal rolling update proceeds. - -2. If Pod spec does not change, only mutable fields in `volumeClaimTemplate` differ, - The PVCs should be updated just like Pods would. A replica is considered ready - if all its volumes are compatible with the new `volumeClaimTemplate`. - `spec.ordinals` and `spec.updateStrategy.rollingUpdate.partition` are also respected. + If not, and we are not going to update it in-place automatically, + wait for the user to delete/update the old PVC manually. + +2. When doing rolling update, A replica is considered ready if the Pod is ready + and all its volumes are not being updated in-place. + Wait for a replica to be ready for at least `minReadySeconds` before proceeding to the next replica. + +3. Whenever we check for Pod update, also check for PVCs update. e.g.: - If `spec.updateStrategy.type` is `RollingUpdate`, update the PVCs in the order from the largest ordinal to the smallest. - Only proceed to the next ordinal when all the PVCs of the previous ordinal - are compatible with the new `volumeClaimTemplate`. - If `spec.updateStrategy.type` is `OnDelete`, Only update the PVC when the Pod is deleted. + +4. When updating the PVC in-place, if we also re-create the Pod, + update the PVC after old Pod deleted, together with creating new pod. + Otherwise, if pod is not changed, update the PVC only. + +Failure cases: don't left too many PVCs being updated in-place. We expect to update the PVCs in order. + +- If the PVC update fails, we should block the update process. + If the Pod is also deleted (by controller or manually), don't block the creation of new Pod. + We should retry and report events for this. + The events and status should look like those when the Pod creation fails. + +- While waiting for the PVC to reach the compatible state, + We should update status, just like what we do when waiting for Pod to be ready. + We should block the update process if the PVC is never compatible. + +- If the `volumeClaimTemplate` is updated again when the previous rollout is blocked, + similar to [Pods](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#forced-rollback), + user may need to manually deal with the blocking PVCs (update or delete them). ### What PVC is compatible @@ -348,7 +364,7 @@ bogged down. We're running a CI/CD system and the end-to-end automation is desired. To expand the volumes managed by a StatefulSet, -we can just use the same pipeline that we are already using to updating the Pod. +we can just use the same pipeline that we are already using to update the Pod. All the test, review, approval, and rollback process can be reused. #### Story 2: Migrating Between Storage Providers @@ -377,8 +393,8 @@ The same process as Story 2 can be used. #### Story 5: Asymmetric Replicas -The replicas of our StatefulSet are not identical, so we still want to update -each PVC manually and separately. +The storage requirement of different replicas are not identical, +so we still want to update each PVC manually and separately. Possibly we also update the `volumeClaimTemplate` for new replicas, but we don't want the controller to interfere with the existing replicas. @@ -391,6 +407,10 @@ Go in to as much detail as necessary here. This might be a good place to talk about core concepts and how they relate. --> +When designing the `InPlace` update strategy, we update the PVC like how we re-create the Pod. +i.e. we update the PVC whenever we would re-create the Pod; +we wait for the PVC to be compatible whenever we would wait for the Pod to be ready. + `volumeClaimSyncStrategy` is introduce to keep capability of current deployed workloads. StatefulSet currently accepts and uses existing PVCs that is not created by the controller, So the `volumeClaimTemplate` and PVC can differ even before this enhancement. @@ -398,16 +418,14 @@ Some users may choose to keep the PVCs of different replicas different. We should not block the Pod updates for them. If `volumeClaimSyncStrategy` is `Async`, -then if the template and PVC differs, and the PVC is not being deleted, -the PVC is not considered as managed by the StatefulSet. +we just ignore the PVCs that cannot be updated to be compatible with the new `volumeClaimTemplate`, +as what we do currently. +Of course, we report this in the status of the StatefulSet. However, a workload may rely on some features provided by a specific PVC, So we should provide a way to coordinate the update. That's why we also need `LockStep`. -We consider a StatefulSet in stable state if all the managed PVCs are compatible with the current template. -In a stable state, most operations are possible, and we are not actively fixing something. - The StatefulSet controller should also keeps the current and updated revision of the `volumeClaimTemplate`, so that a `LockStep` StatefulSet can still re-create Pods and PVCs that are yet-to-be-updated. @@ -994,7 +1012,8 @@ information to express the idea and why it was not acceptable. e.g., prevent decreasing the storage size, preventing expand if the storage class does not support it. However, this have saveral drawbacks: * Not reverting the `volumeClaimTemplate` when rollback the StatefulSet is confusing, -* This can be a barrier when recovering from a failed update. +* The validation can be a barrier when recovering from a failed update. + If RecoverVolumeExpansionFailure feature gate is enabled, we can recover from failed expansion by decreasing the size. * The validation is racy, especially when recovering from failed expansion. We still need to consider most abnormal cases even we do those validations. * This does not match the pattern of existing behaviors. From 80064340207676e8765ad0208875b26824d9d3dc Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Thu, 23 May 2024 11:53:49 +0800 Subject: [PATCH 11/17] volumeClaimTemplates --- .../README.md | 47 ++++++++++--------- 1 file changed, 24 insertions(+), 23 deletions(-) diff --git a/keps/sig-storage/4650-stateful-set-update-claim-template/README.md b/keps/sig-storage/4650-stateful-set-update-claim-template/README.md index 2fa59d523dc..ba924b25cac 100644 --- a/keps/sig-storage/4650-stateful-set-update-claim-template/README.md +++ b/keps/sig-storage/4650-stateful-set-update-claim-template/README.md @@ -107,8 +107,8 @@ tags, and then generate with `hack/update-toc.sh`. - [Implementation History](#implementation-history) - [Drawbacks](#drawbacks) - [Alternatives](#alternatives) - - [extensively validate the updated volumeClaimTemplate](#extensively-validate-the-updated-volumeclaimtemplate) - - [Only support for updating volumeClaimTemplate.spec.resources.requests.storage](#only-support-for-updating-volumeclaimtemplatespecresourcesrequestsstorage) + - [Extensively validate the updated volumeClaimTemplates](#extensively-validate-the-updated-volumeclaimtemplates) + - [Only support for updating storage size](#only-support-for-updating-storage-size) - [Infrastructure Needed (Optional)](#infrastructure-needed-optional) @@ -175,8 +175,8 @@ updates. [documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md --> -Kubernetes does not support the modification of the `volumeClaimTemplate` of a StatefulSet currently. -This enhancement proposes to support arbitrary modifications to the `volumeClaimTemplate`, +Kubernetes does not support the modification of the `volumeClaimTemplates` of a StatefulSet currently. +This enhancement proposes to support arbitrary modifications to the `volumeClaimTemplates`, automatically updating the associated PersistentVolumeClaim objects in-place if applicable. Currently, PVC `spec.resources.requests.storage` and `spec.volumeAttributesClassName` fields can be updated in-place. @@ -211,7 +211,7 @@ This brings many headaches in a continuously evolving environment. List the specific goals of the KEP. What is it trying to achieve? How will we know that this has succeeded? --> -* Allow users to update the `volumeClaimTemplate` of a `StatefulSet` in place. +* Allow users to update the `volumeClaimTemplates` of a `StatefulSet` in place. * Automatically update the associated PersistentVolumeClaim objects in-place if applicable. * Support updating PersistentVolumeClaim objects with `OnDelete` strategy. * Coordinate updates to `Pod` and PersistentVolumeClaim objects. @@ -224,7 +224,7 @@ What is out of scope for this KEP? Listing non-goals helps to focus discussion and make progress. --> * Support automatic rolling update of PersistentVolumeClaim. -* Validate the updated `volumeClaimTemplate` as how PVC update does. +* Validate the updated `volumeClaimTemplates` as how PVC update does. * Update ephemeral volumes. @@ -238,7 +238,7 @@ implementation. What is the desired outcome and how do we measure success?. The "Design Details" section below is for the real nitty-gritty. --> -1. Change API server to allow any updates to `volumeClaimTemplate` of a StatefulSet. +1. Change API server to allow any updates to `volumeClaimTemplates` of a StatefulSet. 2. Modify StatefulSet controller to add PVC reconciliation logic. @@ -283,7 +283,7 @@ both for in-place update and for the PVCs that need manual intervention. How to update PVCs: 1. If `volumeClaimUpdateStrategy` is `InPlace`, - and if `volumeClaimTemplate` and actual PVC only differ in mutable fields + and if `volumeClaimTemplates` and actual PVC only differ in mutable fields (`spec.resources.requests.storage`, `spec.volumeAttributesClassName`, `metadata.labels`, and `metadata.annotations` currently), update the PVC in-place to the extent possible. Do not perform the update that will be rejected by API server, such as @@ -299,14 +299,14 @@ Tested on Kubernetes v1.28, and I can see this event: Warning FailedCreate 3m58s (x7 over 3m58s) statefulset-controller create Pod test-rwop-0 in StatefulSet test-rwop failed error: pvc data-test-rwop-0 is being deleted --> -3. Use either current or updated revision of the `volumeClaimTemplate` to create/update the PVC, +3. Use either current or updated revision of the `volumeClaimTemplates` to create/update the PVC, just like Pod template. When to update PVCs: 1. If `volumeClaimSyncStrategy` is `LockStep`, before advancing `status.updatedReplicas` to the next replica, additionally check that the PVCs of the next replica are - [compatible](#what-pvc-is-compatible) with the new `volumeClaimTemplate`. + [compatible](#what-pvc-is-compatible) with the new `volumeClaimTemplates`. If not, and we are not going to update it in-place automatically, wait for the user to delete/update the old PVC manually. @@ -336,7 +336,7 @@ Failure cases: don't left too many PVCs being updated in-place. We expect to upd We should update status, just like what we do when waiting for Pod to be ready. We should block the update process if the PVC is never compatible. -- If the `volumeClaimTemplate` is updated again when the previous rollout is blocked, +- If the `volumeClaimTemplates` is updated again when the previous rollout is blocked, similar to [Pods](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#forced-rollback), user may need to manually deal with the blocking PVCs (update or delete them). @@ -372,7 +372,7 @@ All the test, review, approval, and rollback process can be reused. We decide to switch from home-made local storage to the storage provided by a cloud provider. We can not afford any downtime, so we don't want to delete and recreate the StatefulSet. Our app can automatically rebuild the data in the new storage from other replicas. -So we update the `volumeClaimTemplate` of the StatefulSet, +So we update the `volumeClaimTemplates` of the StatefulSet, delete the PVC and Pod of one replica, let the controller re-create them, then monitor the rebuild process. Once the rebuild completes successfully, we proceed to the next replica. @@ -395,7 +395,7 @@ The same process as Story 2 can be used. The storage requirement of different replicas are not identical, so we still want to update each PVC manually and separately. -Possibly we also update the `volumeClaimTemplate` for new replicas, +Possibly we also update the `volumeClaimTemplates` for new replicas, but we don't want the controller to interfere with the existing replicas. ### Notes/Constraints/Caveats (Optional) @@ -413,12 +413,12 @@ we wait for the PVC to be compatible whenever we would wait for the Pod to be re `volumeClaimSyncStrategy` is introduce to keep capability of current deployed workloads. StatefulSet currently accepts and uses existing PVCs that is not created by the controller, -So the `volumeClaimTemplate` and PVC can differ even before this enhancement. +So the `volumeClaimTemplates` and PVC can differ even before this enhancement. Some users may choose to keep the PVCs of different replicas different. We should not block the Pod updates for them. If `volumeClaimSyncStrategy` is `Async`, -we just ignore the PVCs that cannot be updated to be compatible with the new `volumeClaimTemplate`, +we just ignore the PVCs that cannot be updated to be compatible with the new `volumeClaimTemplates`, as what we do currently. Of course, we report this in the status of the StatefulSet. @@ -426,7 +426,7 @@ However, a workload may rely on some features provided by a specific PVC, So we should provide a way to coordinate the update. That's why we also need `LockStep`. -The StatefulSet controller should also keeps the current and updated revision of the `volumeClaimTemplate`, +The StatefulSet controller should also keeps the current and updated revision of the `volumeClaimTemplates`, so that a `LockStep` StatefulSet can still re-create Pods and PVCs that are yet-to-be-updated. ### Risks and Mitigations @@ -1006,12 +1006,12 @@ What other approaches did you consider, and why did you rule them out? These do not need to be as detailed as the proposal, but should include enough information to express the idea and why it was not acceptable. --> -### extensively validate the updated `volumeClaimTemplate` +### Extensively validate the updated `volumeClaimTemplates` -[KEP-0661] proposes that we should do extensive validation on the updated `volumeClaimTemplate`. +[KEP-0661] proposes that we should do extensive validation on the updated `volumeClaimTemplates`. e.g., prevent decreasing the storage size, preventing expand if the storage class does not support it. However, this have saveral drawbacks: -* Not reverting the `volumeClaimTemplate` when rollback the StatefulSet is confusing, +* Not reverting the `volumeClaimTemplates` when rollback the StatefulSet is confusing, * The validation can be a barrier when recovering from a failed update. If RecoverVolumeExpansionFailure feature gate is enabled, we can recover from failed expansion by decreasing the size. * The validation is racy, especially when recovering from failed expansion. @@ -1019,16 +1019,17 @@ However, this have saveral drawbacks: * This does not match the pattern of existing behaviors. That is, the controller should take the expected state, retry as needed to reach that state. For example, StatefulSet will not reject a invalid `serviceAccountName`. -* `volumeClaimTemplate` is also used when creating new PVCs, so even if the existing PVCs cannot be updated, +* `volumeClaimTemplates` is also used when creating new PVCs, so even if the existing PVCs cannot be updated, a user may still want to affect new PVCs. -### Only support for updating `volumeClaimTemplate.spec.resources.requests.storage` +### Only support for updating storage size -[KEP-0661] only enables expanding the volume. However, because the StatefulSet can take pre-existing PVCs, +[KEP-0661] only enables expanding the volume by updating `volumeClaimTemplates[*].spec.resources.requests.storage`. +However, because the StatefulSet can take pre-existing PVCs, we still need to consider what to do when template and PVC don't match. The complexity of this proposal will not decrease much if we only support expanding the volume. -By enabling arbitrary updating to the `volumeClaimTemplate`, +By enabling arbitrary updating to the `volumeClaimTemplates`, we just acknowledge and officially support this use case. [KEP-0661]: https://github.com/kubernetes/enhancements/pull/3412 From 9ef734c70a3879daad7cc31fcbccc7002557f5ac Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Sat, 25 May 2024 23:40:32 +0800 Subject: [PATCH 12/17] Change the owning-sig to sig-apps --- .../4650-stateful-set-update-claim-template/README.md | 0 .../4650-stateful-set-update-claim-template/kep.yaml | 4 ++-- 2 files changed, 2 insertions(+), 2 deletions(-) rename keps/{sig-storage => sig-apps}/4650-stateful-set-update-claim-template/README.md (100%) rename keps/{sig-storage => sig-apps}/4650-stateful-set-update-claim-template/kep.yaml (97%) diff --git a/keps/sig-storage/4650-stateful-set-update-claim-template/README.md b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md similarity index 100% rename from keps/sig-storage/4650-stateful-set-update-claim-template/README.md rename to keps/sig-apps/4650-stateful-set-update-claim-template/README.md diff --git a/keps/sig-storage/4650-stateful-set-update-claim-template/kep.yaml b/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml similarity index 97% rename from keps/sig-storage/4650-stateful-set-update-claim-template/kep.yaml rename to keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml index 97160bbc9d2..3a9e5ebf8db 100644 --- a/keps/sig-storage/4650-stateful-set-update-claim-template/kep.yaml +++ b/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml @@ -2,9 +2,9 @@ title: StatefulSet Support for Updating Volume Claim Template kep-number: 4650 authors: - "@huww98" -owning-sig: sig-storage +owning-sig: sig-apps participating-sigs: - - sig-app + - sig-storage status: provisional creation-date: 2024-05-17 reviewers: From 12808040df50319187e562fb35c2866e61507500 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Mon, 17 Jun 2024 19:44:33 +0800 Subject: [PATCH 13/17] update for comments Production Readiness review, etc. --- keps/prod-readiness/sig-apps/4650.yaml | 3 ++ .../README.md | 43 +++++++++++++++---- .../kep.yaml | 6 +-- 3 files changed, 40 insertions(+), 12 deletions(-) create mode 100644 keps/prod-readiness/sig-apps/4650.yaml diff --git a/keps/prod-readiness/sig-apps/4650.yaml b/keps/prod-readiness/sig-apps/4650.yaml new file mode 100644 index 00000000000..31adc0d5d14 --- /dev/null +++ b/keps/prod-readiness/sig-apps/4650.yaml @@ -0,0 +1,3 @@ +kep-number: 4650 +alpha: + approver: "@wojtek-t" diff --git a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md index ba924b25cac..68c67c31edf 100644 --- a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md +++ b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md @@ -201,8 +201,8 @@ They can only expand the volumes, or modify them with VolumeAttributesClass by updating individual PersistentVolumeClaim objects as an ad-hoc operation. When the StatefulSet scales up, the new PVC(s) will be created with the old config and this again needs manual intervention. -Modifying immutable parameters, shinking, or even switch to another -storage provider is not possible currently. +Modifying immutable parameters, shrinking, or even switching to another +storage provider is not currently possible. This brings many headaches in a continuously evolving environment. ### Goals @@ -678,12 +678,6 @@ well as the [existing list] of feature gates. - Components depending on the feature gate: - kube-apiserver - kube-controller-manager -- [ ] Other - - Describe the mechanism: - - Will enabling / disabling the feature require downtime of the control - plane? - - Will enabling / disabling the feature require downtime or reprovisioning - of a node? ###### Does enabling the feature change any default behavior? @@ -691,7 +685,9 @@ well as the [existing list] of feature gates. Any change of default behavior may be surprising to users or break existing automations, so be extremely careful here. --> -No. +The update to StatefulSet `volumeClaimTemplates` will be accepted by the API server while it is previously rejected. + +Otherwise No. If `volumeClaimUpdateStrategy` is `OnDelete` and `volumeClaimSyncStrategy` is `Async` (the default values), the behavior of StatefulSet controller is almost the same as before. @@ -707,9 +703,17 @@ feature. NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`. --> +Yes. Since the `volumeClaimTemplates` can already differ from the actual PVCs now, +disable this feature gate should not leave any inconsistent state. + +If the `volumeClaimTemplates` is updated then the feature is disabled and the StatefulSet is rolled back, +The `volumeClaimTemplates` will be kept as the latest version, and the history of them will be lost. ###### What happens if we reenable the feature if it was previously rolled back? +If the `volumeClaimUpdateStrategy` is already set to `InPlace` reenable the feature +will kick off the update process immediately. + ###### Are there any tests for feature enablement/disablement? +Will add unit tests for the StatefulSet controller with and without the feature gate, +`volumeClaimUpdateStrategy` set to `InPlace` and `OnDelete` respectively. ### Rollout, Upgrade and Rollback Planning @@ -886,6 +892,16 @@ Focusing mostly on: - periodic API calls to reconcile state (e.g. periodic fetching state, heartbeats, leader election, etc.) --> +- PATCH StatefulSet + - kubectl or other user agents +- PATCH PersistentVolumeClaim + - 1 per updated PVC in the StatefulSet (number of updated claim template * replica) + - StatefulSet controller (in KCM) + - triggered by the StatefulSet spec update +- PATCH StatefulSet status + - 1-2 per updated PVC in the StatefulSet (number of updated claim template * replica) + - StatefulSet controller (in KCM) + - triggered by the StatefulSet spec update and PVC status update ###### Will enabling / using this feature result in introducing new API types? @@ -895,6 +911,7 @@ Describe them, providing: - Supported number of objects per cluster - Supported number of objects per namespace (for namespace-scoped objects) --> +No ###### Will enabling / using this feature result in any new calls to the cloud provider? @@ -903,6 +920,7 @@ Describe them, providing: - Which API(s): - Estimated increase: --> +Not directly. The cloud provider may be called when the PVCs are updated. ###### Will enabling / using this feature result in increasing size or count of the existing API objects? @@ -912,6 +930,9 @@ Describe them, providing: - Estimated increase in size: (e.g., new annotation of size 32B) - Estimated amount of new objects: (e.g., new Object X for every existing Pod) --> +StatefulSet: +- `spec`: 2 new enum fields, ~10B +- `status`: 4 new integer fields, ~10B ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? @@ -923,6 +944,7 @@ Think about adding additional work or introducing new steps in between [existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos --> +No. ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? @@ -935,6 +957,8 @@ This through this both in small and large cases, again with respect to the [supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md --> +The logic of StatefulSet controller is more complex, more CPU will be used. +TODO: measure the actual increase. ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? @@ -947,6 +971,7 @@ If any of the resources can be exhausted, how this is mitigated with the existin Are there any tests that were run/should be run to understand performance characteristics better and validate the declared limits? --> +No. ### Troubleshooting diff --git a/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml b/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml index 3a9e5ebf8db..89587d8f26f 100644 --- a/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml +++ b/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml @@ -2,6 +2,7 @@ title: StatefulSet Support for Updating Volume Claim Template kep-number: 4650 authors: - "@huww98" + - "@vie-serendipity" owning-sig: sig-apps participating-sigs: - sig-storage @@ -12,6 +13,7 @@ reviewers: - "@gnufied" - "@msau42" - "@xing-yang" + - "@soltysh" approvers: - "@kow3ns" - "@xing-yang" @@ -33,9 +35,7 @@ latest-milestone: "v1.31" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: - alpha: "v1.31" - beta: "v1.32" - stable: "v1.33" + alpha: "v1.32" # The following PRR answers are required at alpha release # List the feature gate name and the components for which it must be enabled From d36c2ac672de769139047764b1a3d1213c614954 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Tue, 18 Jun 2024 00:19:57 +0800 Subject: [PATCH 14/17] some clarifications --- .../README.md | 55 ++++++++++++------- 1 file changed, 35 insertions(+), 20 deletions(-) diff --git a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md index 68c67c31edf..24f3cd3e236 100644 --- a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md +++ b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md @@ -177,9 +177,9 @@ updates. Kubernetes does not support the modification of the `volumeClaimTemplates` of a StatefulSet currently. This enhancement proposes to support arbitrary modifications to the `volumeClaimTemplates`, -automatically updating the associated PersistentVolumeClaim objects in-place if applicable. -Currently, PVC `spec.resources.requests.storage` and `spec.volumeAttributesClassName` -fields can be updated in-place. +automatically patching the associated PersistentVolumeClaim objects if applicable. +Currently, PVC `spec.resources.requests.storage`, `spec.volumeAttributesClassName`, `metadata.labels`, and `metadata.annotations` +can be patched. For other fields, we support updating existing PersistentVolumeClaim objects with `OnDelete` strategy. All the updates to PersistentVolumeClaim can be coordinated with `Pod` updates to honor any dependencies between them. @@ -211,8 +211,8 @@ This brings many headaches in a continuously evolving environment. List the specific goals of the KEP. What is it trying to achieve? How will we know that this has succeeded? --> -* Allow users to update the `volumeClaimTemplates` of a `StatefulSet` in place. -* Automatically update the associated PersistentVolumeClaim objects in-place if applicable. +* Allow users to update the `volumeClaimTemplates` of a `StatefulSet`. +* Automatically patch the associated PersistentVolumeClaim objects if applicable, without interrupting the running Pods. * Support updating PersistentVolumeClaim objects with `OnDelete` strategy. * Coordinate updates to `Pod` and PersistentVolumeClaim objects. * Provide accurate status and error messages to users when the update fails. @@ -223,8 +223,8 @@ know that this has succeeded? What is out of scope for this KEP? Listing non-goals helps to focus discussion and make progress. --> -* Support automatic rolling update of PersistentVolumeClaim. -* Validate the updated `volumeClaimTemplates` as how PVC update does. +* Support automatic re-creating of PersistentVolumeClaim. We will never delete a PVC automatically. +* Validate the updated `volumeClaimTemplates` as how PVC patch does. * Update ephemeral volumes. @@ -251,7 +251,7 @@ Changes to StatefulSet `spec`: 1. Introduce a new field in StatefulSet `spec`: `volumeClaimUpdateStrategy` to specify how to coordinate the update of PVCs and Pods. Possible values are: - `OnDelete`: the default value, only update the PVC when the the old PVC is deleted. - - `InPlace`: update the PVC in-place if possible. Also includes the `OnDelete` behavior. + - `InPlace`: patch the PVC in-place if possible. Also includes the `OnDelete` behavior. 2. Introduce a new field in StatefulSet `spec.updateStrategy.rollingUpdate`: `volumeClaimSyncStrategy` to specify how to update PVCs and Pods. Possible values are: @@ -265,7 +265,7 @@ Additionally collect the status of managed PVCs, and show them in the StatefulSe For each PVC in the template: - compatible: the number of PVCs that are compatible with the template. These replicas will not be blocked on Pod recreation if `volumeClaimSyncStrategy` is `LockStep`. -- updating: the number of PVCs that are being updated in-place. +- updating: the number of PVCs that are being updated in-place (e.g. expansion in progress). - overSized: the number of PVCs that are over-sized. - totalCapacity: the sum of `status.capacity` of all the PVCs. @@ -277,7 +277,7 @@ Some fields in the `status` are also updated to reflect the staus of the PVCs: are updated to reflect the status of PVCs. With these changes, user can still use `kubectl rollout status` to monitor the update process, -both for in-place update and for the PVCs that need manual intervention. +both for automated patching and for the PVCs that need manual intervention. ### Updated Reconciliation Logic @@ -285,11 +285,13 @@ How to update PVCs: 1. If `volumeClaimUpdateStrategy` is `InPlace`, and if `volumeClaimTemplates` and actual PVC only differ in mutable fields (`spec.resources.requests.storage`, `spec.volumeAttributesClassName`, `metadata.labels`, and `metadata.annotations` currently), - update the PVC in-place to the extent possible. - Do not perform the update that will be rejected by API server, such as - decreasing the storage size below its current status. - Note that decrease the size can help recover from a failed expansion if - `RecoverVolumeExpansionFailure` feature gate is enabled. + patch the PVC to the extent possible. + - `spec.resources.requests.storage` is patched to max(template spec, PVC status) + - Do not decreasing the storage size below its current status. + Note that decrease the size in PVC spec can help recover from a failed expansion if + `RecoverVolumeExpansionFailure` feature gate is enabled. + - `spec.volumeAttributesClassName` is patched to the template value. + - `metadata.labels` and `metadata.annotations` are patched with server side apply. 2. If it is not possible to make the PVC [compatible](#what-pvc-is-compatible), do nothing. But when recreating a Pod and the corresponding PVC is deleting, @@ -307,7 +309,7 @@ When to update PVCs: before advancing `status.updatedReplicas` to the next replica, additionally check that the PVCs of the next replica are [compatible](#what-pvc-is-compatible) with the new `volumeClaimTemplates`. - If not, and we are not going to update it in-place automatically, + If not, and if we are not going to patch it automatically, wait for the user to delete/update the old PVC manually. 2. When doing rolling update, A replica is considered ready if the Pod is ready @@ -321,7 +323,7 @@ When to update PVCs: - If `spec.updateStrategy.type` is `OnDelete`, Only update the PVC when the Pod is deleted. -4. When updating the PVC in-place, if we also re-create the Pod, +4. When patching the PVC, if we also re-create the Pod, update the PVC after old Pod deleted, together with creating new pod. Otherwise, if pod is not changed, update the PVC only. @@ -454,7 +456,7 @@ required) or even code snippets. If there's any ambiguity about HOW your proposal will be implemented, this is the place to discuss them. --> -We can use Server Side Apply to update the PVCs in-place, +We can use Server Side Apply to patch the PVCs, so that we will not interfere with the user's manual changes, e.g. to `metadata.labels` and `metadata.annotations`. @@ -1050,12 +1052,25 @@ However, this have saveral drawbacks: ### Only support for updating storage size [KEP-0661] only enables expanding the volume by updating `volumeClaimTemplates[*].spec.resources.requests.storage`. -However, because the StatefulSet can take pre-existing PVCs, +However, +1. because the StatefulSet can take pre-existing PVCs, we still need to consider what to do when template and PVC don't match. The complexity of this proposal will not decrease much if we only support expanding the volume. - By enabling arbitrary updating to the `volumeClaimTemplates`, we just acknowledge and officially support this use case. +1. We have VAC now, which is expected to go to beta soon. +And can be patched to existing PVC. We should also support patching VAC +by updating `volumeClaimTemplates`. + +### Patch PVCs regardless of the immutable fields + +We propose to patch the PVCs only when the immutable fields match. + +If only expansion is supported, patching regardless of the immutable fields can be a logical choice. +But this KEP also integrates with VAC. VAC is closely coupled with storage class. +Only patching VAC if storage class matches is a very logical choice. +And we'd better follow the same operation model for all mutable fields. + [KEP-0661]: https://github.com/kubernetes/enhancements/pull/3412 From 70bd32ff06eb9f53d7d2a42a2ee2e9b1eb6c4162 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Fri, 12 Jul 2024 10:55:29 +0800 Subject: [PATCH 15/17] Remove volumeClaimSyncStrategy. Don't allow editing immutable PVC fields. --- .../README.md | 137 ++++++++---------- 1 file changed, 58 insertions(+), 79 deletions(-) diff --git a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md index 24f3cd3e236..7d69552f181 100644 --- a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md +++ b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md @@ -82,10 +82,8 @@ tags, and then generate with `hack/update-toc.sh`. - [What PVC is compatible](#what-pvc-is-compatible) - [User Stories (Optional)](#user-stories-optional) - [Story 1: Batch Expand Volumes](#story-1-batch-expand-volumes) - - [Story 2: Migrating Between Storage Providers](#story-2-migrating-between-storage-providers) - - [Story 3: Migrating Between Different Implementations of the Same Storage Provider](#story-3-migrating-between-different-implementations-of-the-same-storage-provider) - - [Story 4: Shinking the PV by Re-creating PVC](#story-4-shinking-the-pv-by-re-creating-pvc) - - [Story 5: Asymmetric Replicas](#story-5-asymmetric-replicas) + - [Story 2: Shinking the PV by Re-creating PVC](#story-2-shinking-the-pv-by-re-creating-pvc) + - [Story 3: Asymmetric Replicas](#story-3-asymmetric-replicas) - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) - [Risks and Mitigations](#risks-and-mitigations) - [Design Details](#design-details) @@ -108,7 +106,9 @@ tags, and then generate with `hack/update-toc.sh`. - [Drawbacks](#drawbacks) - [Alternatives](#alternatives) - [Extensively validate the updated volumeClaimTemplates](#extensively-validate-the-updated-volumeclaimtemplates) - - [Only support for updating storage size](#only-support-for-updating-storage-size) + - [Support for updating arbitrary fields in volumeClaimTemplates](#support-for-updating-arbitrary-fields-in-volumeclaimtemplates) + - [Patch PVCs regardless of the immutable fields](#patch-pvcs-regardless-of-the-immutable-fields) +- [Support for automatically skip not managed PVCs](#support-for-automatically-skip-not-managed-pvcs) - [Infrastructure Needed (Optional)](#infrastructure-needed-optional) @@ -176,11 +176,10 @@ updates. --> Kubernetes does not support the modification of the `volumeClaimTemplates` of a StatefulSet currently. -This enhancement proposes to support arbitrary modifications to the `volumeClaimTemplates`, +This enhancement proposes to support modifications to the `volumeClaimTemplates`, automatically patching the associated PersistentVolumeClaim objects if applicable. Currently, PVC `spec.resources.requests.storage`, `spec.volumeAttributesClassName`, `metadata.labels`, and `metadata.annotations` can be patched. -For other fields, we support updating existing PersistentVolumeClaim objects with `OnDelete` strategy. All the updates to PersistentVolumeClaim can be coordinated with `Pod` updates to honor any dependencies between them. @@ -201,8 +200,6 @@ They can only expand the volumes, or modify them with VolumeAttributesClass by updating individual PersistentVolumeClaim objects as an ad-hoc operation. When the StatefulSet scales up, the new PVC(s) will be created with the old config and this again needs manual intervention. -Modifying immutable parameters, shrinking, or even switching to another -storage provider is not currently possible. This brings many headaches in a continuously evolving environment. ### Goals @@ -211,8 +208,8 @@ This brings many headaches in a continuously evolving environment. List the specific goals of the KEP. What is it trying to achieve? How will we know that this has succeeded? --> -* Allow users to update the `volumeClaimTemplates` of a `StatefulSet`. -* Automatically patch the associated PersistentVolumeClaim objects if applicable, without interrupting the running Pods. +* Allow users to update some fields of `volumeClaimTemplates` of a `StatefulSet`. +* Automatically patch the associated PersistentVolumeClaim objects, without interrupting the running Pods. * Support updating PersistentVolumeClaim objects with `OnDelete` strategy. * Coordinate updates to `Pod` and PersistentVolumeClaim objects. * Provide accurate status and error messages to users when the update fails. @@ -226,6 +223,7 @@ and make progress. * Support automatic re-creating of PersistentVolumeClaim. We will never delete a PVC automatically. * Validate the updated `volumeClaimTemplates` as how PVC patch does. * Update ephemeral volumes. +* Patch PVCs that are different from the template, e.g. StatefulSet adopts the pre-existing PVCs. ## Proposal @@ -238,7 +236,11 @@ implementation. What is the desired outcome and how do we measure success?. The "Design Details" section below is for the real nitty-gritty. --> -1. Change API server to allow any updates to `volumeClaimTemplates` of a StatefulSet. +1. Change API server to allow specific updates to `volumeClaimTemplates` of a StatefulSet: + * `labels` + * `annotations` + * `resources.requests.storage` + * `volumeAttributesClassName` 2. Modify StatefulSet controller to add PVC reconciliation logic. @@ -248,15 +250,10 @@ nitty-gritty. Changes to StatefulSet `spec`: -1. Introduce a new field in StatefulSet `spec`: `volumeClaimUpdateStrategy` to - specify how to coordinate the update of PVCs and Pods. Possible values are: - - `OnDelete`: the default value, only update the PVC when the the old PVC is deleted. - - `InPlace`: patch the PVC in-place if possible. Also includes the `OnDelete` behavior. - -2. Introduce a new field in StatefulSet `spec.updateStrategy.rollingUpdate`: `volumeClaimSyncStrategy` - to specify how to update PVCs and Pods. Possible values are: - - `Async`: the default value, preseve the current behavior. - - `LockStep`: update PVCs first, then update Pods. See below for details. +Introduce a new field in StatefulSet `spec`: `volumeClaimUpdateStrategy` to +specify how to coordinate the update of PVCs and Pods. Possible values are: +- `OnDelete`: the default value, only update the PVC when the the old PVC is deleted. +- `InPlace`: patch the PVC in-place if possible. Also includes the `OnDelete` behavior. Changes to StatefultSet `status`: @@ -264,9 +261,9 @@ Additionally collect the status of managed PVCs, and show them in the StatefulSe For each PVC in the template: - compatible: the number of PVCs that are compatible with the template. - These replicas will not be blocked on Pod recreation if `volumeClaimSyncStrategy` is `LockStep`. + These replicas will not be blocked on Pod recreation. - updating: the number of PVCs that are being updated in-place (e.g. expansion in progress). -- overSized: the number of PVCs that are over-sized. +- overSized: the number of PVCs that are larger than the template. - totalCapacity: the sum of `status.capacity` of all the PVCs. Some fields in the `status` are also updated to reflect the staus of the PVCs: @@ -305,9 +302,8 @@ Warning FailedCreate 3m58s (x7 over 3m58s) statefulset-controller cre just like Pod template. When to update PVCs: -1. If `volumeClaimSyncStrategy` is `LockStep`, - before advancing `status.updatedReplicas` to the next replica, - additionally check that the PVCs of the next replica are +1. before advancing `status.updatedReplicas` to the next replica, + check that the PVCs of the next replica are [compatible](#what-pvc-is-compatible) with the new `volumeClaimTemplates`. If not, and if we are not going to patch it automatically, wait for the user to delete/update the old PVC manually. @@ -369,31 +365,19 @@ To expand the volumes managed by a StatefulSet, we can just use the same pipeline that we are already using to update the Pod. All the test, review, approval, and rollback process can be reused. -#### Story 2: Migrating Between Storage Providers +#### Story 2: Shinking the PV by Re-creating PVC -We decide to switch from home-made local storage to the storage provided by a cloud provider. +After running our app for a while, we optimize the data layout and reduce the required storage size. +Now we want to shrink the PVs to save cost. We can not afford any downtime, so we don't want to delete and recreate the StatefulSet. +We also don't have the infrastructure to migrate between two StatefulSets. Our app can automatically rebuild the data in the new storage from other replicas. So we update the `volumeClaimTemplates` of the StatefulSet, delete the PVC and Pod of one replica, let the controller re-create them, then monitor the rebuild process. Once the rebuild completes successfully, we proceed to the next replica. -#### Story 3: Migrating Between Different Implementations of the Same Storage Provider - -Our storage provider has a new version that provides new features, but can not be upgraded in-place. -We can prepare some new PersistentVolumes using the new version, but referencing the same disk -from the provider as the in-use PVs. -Then the same update process as Story 2 can be used. -Although the PVCs are recreated, the data is preserved, so no rebuild is needed. - -#### Story 4: Shinking the PV by Re-creating PVC - -After running our app for a while, we optimize the data layout and reduce the required storage size. -Now we want to shrink the PVs to save cost. -The same process as Story 2 can be used. - -#### Story 5: Asymmetric Replicas +#### Story 3: Asymmetric Replicas The storage requirement of different replicas are not identical, so we still want to update each PVC manually and separately. @@ -413,23 +397,8 @@ When designing the `InPlace` update strategy, we update the PVC like how we re-c i.e. we update the PVC whenever we would re-create the Pod; we wait for the PVC to be compatible whenever we would wait for the Pod to be ready. -`volumeClaimSyncStrategy` is introduce to keep capability of current deployed workloads. -StatefulSet currently accepts and uses existing PVCs that is not created by the controller, -So the `volumeClaimTemplates` and PVC can differ even before this enhancement. -Some users may choose to keep the PVCs of different replicas different. -We should not block the Pod updates for them. - -If `volumeClaimSyncStrategy` is `Async`, -we just ignore the PVCs that cannot be updated to be compatible with the new `volumeClaimTemplates`, -as what we do currently. -Of course, we report this in the status of the StatefulSet. - -However, a workload may rely on some features provided by a specific PVC, -So we should provide a way to coordinate the update. -That's why we also need `LockStep`. - The StatefulSet controller should also keeps the current and updated revision of the `volumeClaimTemplates`, -so that a `LockStep` StatefulSet can still re-create Pods and PVCs that are yet-to-be-updated. +so that a StatefulSet can still re-create Pods and PVCs that are yet-to-be-updated. ### Risks and Mitigations @@ -690,7 +659,7 @@ automations, so be extremely careful here. The update to StatefulSet `volumeClaimTemplates` will be accepted by the API server while it is previously rejected. Otherwise No. -If `volumeClaimUpdateStrategy` is `OnDelete` and `volumeClaimSyncStrategy` is `Async` (the default values), +If `volumeClaimUpdateStrategy` is `OnDelete` (the default values), the behavior of StatefulSet controller is almost the same as before. ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? @@ -1038,29 +1007,29 @@ information to express the idea and why it was not acceptable. [KEP-0661] proposes that we should do extensive validation on the updated `volumeClaimTemplates`. e.g., prevent decreasing the storage size, preventing expand if the storage class does not support it. However, this have saveral drawbacks: -* Not reverting the `volumeClaimTemplates` when rollback the StatefulSet is confusing, -* The validation can be a barrier when recovering from a failed update. - If RecoverVolumeExpansionFailure feature gate is enabled, we can recover from failed expansion by decreasing the size. -* The validation is racy, especially when recovering from failed expansion. - We still need to consider most abnormal cases even we do those validations. -* This does not match the pattern of existing behaviors. - That is, the controller should take the expected state, retry as needed to reach that state. - For example, StatefulSet will not reject a invalid `serviceAccountName`. +* If we disallow decreasing, we make the editing a one-way road. + If a user edited it then found it was a mistake, there is no way back. + The StatefulSet will be broken forever. If this happens, the updates to pods will also be blocked. This is not acceptable. +* To mitigate the above issue, we will want to prevent the user from going down this one-way road by mistake. + We are forced to do way more validations on APIServer, which is very complex, and fragile (please see KEP-0661). + For example: check storage class allowVolumeExpansion, check each PVC's storage class and size, + basically duplicate all the validations we have done to PVC. + And even if we do all the validations, there are still race conditions and async failures that we are impossible to catch. + I see this as a major drawback of KEP-0661 that I want to avoid in this KEP. +* Validation means we should disable rollback of storage size. If we enable it later, it can surprise users, if it is not called a breaking change. +* The validation is conflict to RecoverVolumeExpansionFailure feature, although it is still alpha. * `volumeClaimTemplates` is also used when creating new PVCs, so even if the existing PVCs cannot be updated, a user may still want to affect new PVCs. +* It violates the high-level design. + The template describes a desired final state, rather than an immediate instruction. + A lot of things can happen externally after we update the template. + For example, I have an IaaS platform, which tries to kubectl apply one updated StatefulSet + one new StorageClass to the cluster to trigger the expansion of PVs. + We don't want to reject it just because the StorageClass is applied after the StatefulSet. -### Only support for updating storage size +### Support for updating arbitrary fields in `volumeClaimTemplates` -[KEP-0661] only enables expanding the volume by updating `volumeClaimTemplates[*].spec.resources.requests.storage`. -However, -1. because the StatefulSet can take pre-existing PVCs, -we still need to consider what to do when template and PVC don't match. -The complexity of this proposal will not decrease much if we only support expanding the volume. -By enabling arbitrary updating to the `volumeClaimTemplates`, -we just acknowledge and officially support this use case. -1. We have VAC now, which is expected to go to beta soon. -And can be patched to existing PVC. We should also support patching VAC -by updating `volumeClaimTemplates`. +No technical limitations. Just that we want to be careful and keep the changes small, so that we can move faster. +This is just an extra validation in APIServer. We may remove it later if we find it is not needed. ### Patch PVCs regardless of the immutable fields @@ -1072,6 +1041,16 @@ Only patching VAC if storage class matches is a very logical choice. And we'd better follow the same operation model for all mutable fields. +## Support for automatically skip not managed PVCs + +Introduce a new field in StatefulSet `spec.updateStrategy.rollingUpdate`: `volumeClaimSyncStrategy`. +If it is set to `Async`, then we skip patching the PVCs that are not managed by the StatefulSet (e.g. StorageClass does not match). + +The rules to determine what PVCs are managed are a little bit tricky. +We have to check each field, and determine what to do for each field. + +And still, we want to keep the changes small. + [KEP-0661]: https://github.com/kubernetes/enhancements/pull/3412 ## Infrastructure Needed (Optional) From 763b35fd9272edff361162227377c4670f79d8ce Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Thu, 22 Aug 2024 13:58:10 +0800 Subject: [PATCH 16/17] update with the implementation --- .../README.md | 37 ++++++++++++++++--- 1 file changed, 31 insertions(+), 6 deletions(-) diff --git a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md index 7d69552f181..fb47a510470 100644 --- a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md +++ b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md @@ -107,8 +107,10 @@ tags, and then generate with `hack/update-toc.sh`. - [Alternatives](#alternatives) - [Extensively validate the updated volumeClaimTemplates](#extensively-validate-the-updated-volumeclaimtemplates) - [Support for updating arbitrary fields in volumeClaimTemplates](#support-for-updating-arbitrary-fields-in-volumeclaimtemplates) - - [Patch PVCs regardless of the immutable fields](#patch-pvcs-regardless-of-the-immutable-fields) -- [Support for automatically skip not managed PVCs](#support-for-automatically-skip-not-managed-pvcs) + - [Patch PVC size regardless of the immutable fields](#patch-pvc-size-regardless-of-the-immutable-fields) + - [Support for automatically skip not managed PVCs](#support-for-automatically-skip-not-managed-pvcs) + - [Reconcile all PVCs regardless of Pod revision labels](#reconcile-all-pvcs-regardless-of-pod-revision-labels) + - [Treat all incompatible PVCs as unavailable replicas](#treat-all-incompatible-pvcs-as-unavailable-replicas) - [Infrastructure Needed (Optional)](#infrastructure-needed-optional) @@ -395,7 +397,7 @@ This might be a good place to talk about core concepts and how they relate. When designing the `InPlace` update strategy, we update the PVC like how we re-create the Pod. i.e. we update the PVC whenever we would re-create the Pod; -we wait for the PVC to be compatible whenever we would wait for the Pod to be ready. +we wait for the PVC to be compatible whenever we would wait for the Pod to be available. The StatefulSet controller should also keeps the current and updated revision of the `volumeClaimTemplates`, so that a StatefulSet can still re-create Pods and PVCs that are yet-to-be-updated. @@ -429,6 +431,9 @@ We can use Server Side Apply to patch the PVCs, so that we will not interfere with the user's manual changes, e.g. to `metadata.labels` and `metadata.annotations`. +New invariants established about PVCs: +If the Pod has revision A label, all its PVCs are either not existing yet, or updated to revision A. + ### Test Plan Will add unit tests for the StatefulSet controller with and without the feature gate, -`volumeClaimUpdateStrategy` set to `InPlace` and `OnDelete` respectively. +`volumeClaimUpdatePolicy` set to `InPlace` and `OnDelete` respectively. ### Rollout, Upgrade and Rollback Planning