Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add StatefulSet Volume Expansion Kep #660

Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 118 additions & 0 deletions keps/sig-apps/20181220-statefulset-volume-expansion.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
---
title: StatefulSet Volume Expansion
authors:
- "@sidakm"
owning-sig: sig-apps
participating-sigs:
- sig-storage
reviewers:
- "@janetkuo"
- "@gnufied"
approvers:
- "@kow3ns"
editor: TBD
creation-date: 2018-12-20
last-updated: 2019-08-02
status: provisional
see-also:
SidakM-zz marked this conversation as resolved.
Show resolved Hide resolved
- https://github.com/kubernetes/enhancements/issues/531
- https://github.com/kubernetes/enhancements/pull/737
- https://github.com/kubernetes/enhancements/issues/284
replaces:
- n/a
superseded-by:
- n/a
---

# StatefulSet Volume Expansion

## Table of Contents

* [Table of Contents](#table-of-contents)
* [Summary](#summary)
* [Motivation](#motivation)
* [Goals](#goals)
* [Proposal](#proposal)
* [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints)
* [Risks and Mitigations](#risks-and-mitigations)
* [Graduation Criteria](#graduation-criteria)
* [Implementation History](#implementation-history)

[Tools for generating]: https://github.com/ekalinin/github-markdown-toc

## Summary

The purpose of this enhancement is to allow for the expansion of persistent volume claims created by StatefulSets. This entails propagating increases to storage requests in `StatefulSets.volumeClaimTemplates` to associated persistent volume claims.

## Motivation
In Kubernetes v1.11 the persistent volume expansion feature was promoted to beta. This allowed users to expand volumes by editing storage requests in persistent volume claim objects.

Kubernetes creates a persistent volume for each `volumeClaimTemplate` in the `volumeClaimTemplates` component of a StatefulSet. However, it is not possible to expand persistent volumes, created by StatefulSets, by editing the source `volumeClaimTemplate` in the StatefulSet object. Therefore, it is necessary for the user to individually modify all pods' persistent volume claims, by increasing their storage requests, to expand the underlying persistent volumes. This would have to be repeated each time the number of replicas in the StatefulSet object is increased, since new persistent volumes would be created with the original storage request specified in `volumeClaimTemplates` component.

It would be easier and expected to allow for changes to storage requests in the `volumeClaimTemplates` component of a StatefulSet to propagate to all associated persistent volume claims.

Relevant Issues:

* https://github.com/kubernetes/kubernetes/issues/71477
* https://github.com/kubernetes/kubernetes/issues/72198

### Goals

Allow for increases to storage requests in the `volumeClaimTemplates` component of a StatefulSet to propagate to all associated persistent volume claims.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What other fields on the stateful set PVC template would we potentially propagate in the future?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I guess since we are loosening validation it may make sense to incorporate other potential fields. I could see propagating changes to the volumeClaimTemplates' ObjectMeta, specifically labels and annotations, being useful.


## Proposal

### Implementation Details/Notes/Constraints

The apiserver will allow for increases to storage requests in the `volumeClaimTemplates` component of a StatefulSet. Additionally, for each volumeClaimTemplate being expanded, it will be necessary to check if its associated StorageClass has volume expansion enabled. This will be achieved by updating the `PersistentVolumeClaimResize` admission controller to incorporate validating updates to `volumeClaimTemplates` within a StatefulSet. Specifically the admission controller will now also check if every `volumeClaimTemplate` in a StatefulSet, which is being resized, is associated with a StorageClass that has `allowVolumeExpansion = true`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be achieved by updating the PersistentVolumeClaimResize admission controller to incorporate validating updates to volumeClaimTemplates within a StatefulSet.

This is somewhat unusual... we don't typically look at all template objects in admission (admission plugins that look at pods don't typically look at replicaset/statefulset/deployment/job/cronjob templates, pvc admission plugins don't typically look at statefulset templates).

Do we currently gate statefulset creation on existence of the referenced storageclass?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also means that the operations of updating a storageclass to be resizeable and requesting a size increase in a statefulset are order-dependent. Is that what we want?

What is the behavior if we allowed requesting a size increase in a statefulset that could not be fulfilled because the storage class was not resizeable? (even with the admission check, it's possible to modify a resizable storageclass and mark it not resizeable, so the statefulset controller still needs to handle the case where resizing the PVC is not allowed)

Copy link
Author

@SidakM-zz SidakM-zz Sep 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we currently gate statefulset creation on existence of the referenced storageclass?

No the creation of a statefulset is not gated on the existence of the referenced storageclass.

This also means that the operations of updating a storageclass to be resizeable and requesting a size increase in a statefulset are order-dependent. Is that what we want?

I think what we want (similar to what you mentioned above) is to try our best to ensure that the statefuset controller does not allow a volumeClaimTemplate storage request to increase if we are certain it will fail. So we want to atleast make sure that the storageClass is marked as resizeable before we accept the storage request increase.

What is the behavior if we allowed requesting a size increase in a statefulset that could not be fulfilled because the storage class was not resizeable?

Now if the statefulset controller encounters the case where the PVC resize request could not be fulfilled either at the admission layer(i.e storageclass marked not resizeable in between statfulset spec update and the resize call to pvc) or due to the resize controller failing, the documented workaround in the kepis the suggested option. (Note this is based off of the documented workaround to recover from a failing volume expansion in general). So in this case deleting and recreating the statefulset set would suffice (pvcs remain so no data loss). A rollback would be possible if the volumeClaimTemplate was not in the controller revision as you mentioned below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be always be a chance that storageclass field does not reflect true capability of the volume plugin. for in-tree drivers this is easy and each plugin can be queried. For CSI plugins, we will have to rely on CSIDriver object but we weren't considering making CSIDriver a necessity for CSI volume expansion of stateful sets.

So the worst this could do is - it will cause all PVCs managed by stateful set to be stuck in "Resizing" condition. The resize will not actually be retried because resize controller knows that plugin itself does not support resize. The PVCs themselves will be usable and nothing stops them from being used in pod/statefulsets if they have "Resizing" condition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I see how this interacts with @liggitt's comment below and could prevent an user from rolling back to revision of StatefulSet.

Copy link
Author

@SidakM-zz SidakM-zz Oct 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above and in the comment below, including the requested volume size in the revision prevents the statefulset from being rolled back (decreases to volume requests are rejected at the api). However, the controller logic for updating pods is centred around the revision name. Thus if the storage request is not included in the revision, the update is not triggered.

Off the top of my head we could probably change another field in the spec which would also be included in the revision and could be rolled back. This would be changed along with any increase to a storage request. This way the controller update logic would remain unchanged and rollbacks would ignore storageRequests. On the other hand including a new field just for this purpose doesn't seem to be a good idea...

We could also mark the pod for termination if the requested size (in current set) > actual size (for the pvc). This would be another condition (along with a change in revision) that would trigger a pod to be deleted for update. Not sure if this would break any assumptions since updates seem to be based around changes to revision names. (i.e with this change pods could be deleted by controller if such a mismatch is detected even when an update wasn't explicitly triggered)

@janetkuo @gnufied thoughts?


SidakM-zz marked this conversation as resolved.
Show resolved Hide resolved
During the StatefulSet update process, the StatefulSet controller will detect an update to a `volumeClaimTemplate` by comparing the updated and current revision of the StatefulSet. This requires the `VolumeClaimTemplates` component of the StatefulSet to be recorded in the StatefulSet's `ControllerRevision` object.
Copy link
Member

@liggitt liggitt Sep 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as an alternative to putting the volumeTemplate size request increases in the controller history (where they block rollback), the reconcile loop that ensures PVCs exist could also compare requested size between the template and the existing PVC, and reconcile the existing PVC to request an increase. Was this discussed instead?

edit: The next paragraph seems to indicate this will be done anyway... it's not clear why the controller revision aspect is required.

Copy link
Author

@SidakM-zz SidakM-zz Sep 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this context the "updated set" is actually just the set patched with the update ControllerRevision. If the volumeClaimTemplates object is not in the ControllerRevision, the update revision name will not differ from the revision name marked on each pod (in the pods' labels) and thus the controller will not terminate the pod for an update.
We want the pod to be terminated to update its associated PVCs due to the reasons described in the KEP (to support both online and offline expansion).

So in essence the kep is suggesting to add the volumeClaimTemplates to the controller revision to ensure the pod terminates in accordance with the existing control flow for the statefulset updates, after which we resize the PVC while the pod is down and recreate the pod once the pvc resize suceeds.

If we don't want to include the volumeClaimTemplates in the revision we could probably modify the update control flow to still cause termination of the pod(hopefully without any side effects), but it was suggested that the volumeClaimTemplates be added to the ControllerRevision specifically for the case of rollback much earlier above.


While updating a pod, the StatefulSet controller will update a referenced persistent volume claim object if its storage request in the associated `volumeClaimTemplate` has been increased.

Not all volumes support online control-plane expansion so this design aims to support both online and offline control-plane volume expansion.

Copy link
Member

@gnufied gnufied Feb 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand this correctly - stateful set controller will ALWAYS delete and re-recreate pods? If ExpandInUsePersistentVolumes feature is enabled it will not wait for FileSystemResizePending and it could delete and re-create the pods at any point in time? This sounds racy - we could delete the pod while file system resize is ongoing on a node? I know we have some checks in place to ensure we don't allow pod deletion while an operation is ongoing at kubelet but it is better to avoid it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes as @kow3ns mentioned above the workloads controllers don't to in place updates for any other container or Pod modifications today. More specifically the Statefulset controller will always delete and recreate the pod when update and current revision of the statefulset are different. So until this functionality is built in general to allow for stable in-place updates of pods (and associated resources) an efficient workaround would be necessary for the issue you mentioned above.

What we could do is when we recognize that the specific volume type supports online expansion and ExpandInUsePersistentVolumes is enabled we would "wait" for expansion to complete before letting the controller delete the pod (just like we do here for the FileSystemResizePending during offline expansion) . This won't break the current flow of when and how pods are deleted and recreated by the controller and would still avoid the issue of the pod being deleted during file system resize.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think so. It makes sense to wait for resizing to finish before restarting the pod. Lets update spec to reflect this.

The functionality provided by this enhancement will be gated by the `StatefulSetVolumeExpansion` feature gate.

For the initial version of this feature we will simply be expanding all volumes when the associated pod is offline. Below is the outline of how a resize will be propagated to each PVC:

1. User updates a storage request in the `volumeClaimTemplates` component of a StatefulSet
2. Apiserver validation validates that `StatefulSetVolumeExpansion` is enabled and that the storage request has not been decreased.
3. The `PersistentVolumeClaimResize` admission controller verifies that the associated StorageClass for the `volumeClaimTemplate` being updated has `allowVolumeExpansion = true`
4. Each PVC will be resized after the associated pod has been deleted by the StatefulSet controller. The controller will wait for expansion to complete before the pod is recreated. The controller will also wait for FileSystemExpansion to complete before concluding.

In step 4 if volume expansion continues to fails we will document the following workaround for users. Note this is based off of the documented workaround to recover from a failing volume expansion.

1. Delete the StatefulSet. Note PVCs and PVs are preserved.
2. For the PV associated with the offending PVC, edit it to have the `Retain` reclaim policy.
3. Delete the offending PVC.
4. Recreate the PVC with the old spec and rebind to the same PV.
5. Recreate the StatefulSet with the old spec.

Note this issue can occur after some portion of the volumes have already been successfully resized. So after following the workaround the user can update the StatefulSet again to attempt to expand the remaining volumes.

#### Optimization for volumes that can be expanded online

Note this is an optimization that may be added in later versions of this feature.

Prerequisite: The StatefulSet controller must be able to determine if a volume supports online expansion.

If a volume is deemed to support online expansion it will be expanded before the pod is terminated for the update. The StatefulSet controller will wait for the file system resize to complete on all such volumes before terminating the associated pod.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Termination of the pod after a successful online expansion is unnecessary and incurs cost for the user. e.g., for our Elasticsearch implementation, we migrate data off of a pod before terminating it in the pre-shutdown hook. An additional sentence which describes the impediment to leaving the pod online would be helpful, so that consideration of a future enhancement will not need to retrace all the steps.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the ExpandInUsePersistentVolumes feature gate allows expanding of a Volume in use by a Pod.

It works like a charm ;-)

Otherwise, the volume will be expanded offline as outlined above after the pod is terminated and before it is recreated.

This design minimizes the time a StatefulSet pod is unavailable if all volumes support online expansion while still supporting volumes that can only be expanded offline.

### Risks and Mitigations

Since changes to `VolumeClaimTemplates` will be recorded in a statefulset's revisions, rolling back a volume expansion would imply the client is attempting to shrink volumes which is unsupported and would be rightfully invalidated by the apiserver. However, this also means that it is possible for a client to attempt to rollback another change in a revision X but be prevented from doing so as a volume was expanded in a revision Y where Y >= X. The implications and potential alternatives to this should be further discussed.
Copy link
Member

@liggitt liggitt Sep 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be resolved before this is implementable (doesn't need to block provisional merge)


## Graduation Criteria

Move to Alpha after initial implementation and approvals.

Consider optimizing for volumes that support online expansion for the Beta.

## Implementation History

* Initial implementation [PR](https://github.com/kubernetes/kubernetes/pull/71384/files)