Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add StatefulSet Volume Expansion Kep #660
Add StatefulSet Volume Expansion Kep #660
Changes from all commits
f4f3a48
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What other fields on the stateful set PVC template would we potentially propagate in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I guess since we are loosening validation it may make sense to incorporate other potential fields. I could see propagating changes to the volumeClaimTemplates' ObjectMeta, specifically labels and annotations, being useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is somewhat unusual... we don't typically look at all template objects in admission (admission plugins that look at pods don't typically look at replicaset/statefulset/deployment/job/cronjob templates, pvc admission plugins don't typically look at statefulset templates).
Do we currently gate statefulset creation on existence of the referenced storageclass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also means that the operations of updating a storageclass to be resizeable and requesting a size increase in a statefulset are order-dependent. Is that what we want?
What is the behavior if we allowed requesting a size increase in a statefulset that could not be fulfilled because the storage class was not resizeable? (even with the admission check, it's possible to modify a resizable storageclass and mark it not resizeable, so the statefulset controller still needs to handle the case where resizing the PVC is not allowed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No the creation of a statefulset is not gated on the existence of the referenced storageclass.
I think what we want (similar to what you mentioned above) is to try our best to ensure that the statefuset controller does not allow a volumeClaimTemplate storage request to increase if we are certain it will fail. So we want to atleast make sure that the storageClass is marked as resizeable before we accept the storage request increase.
Now if the statefulset controller encounters the case where the PVC resize request could not be fulfilled either at the admission layer(i.e storageclass marked not resizeable in between statfulset spec update and the resize call to pvc) or due to the resize controller failing, the documented workaround in the kepis the suggested option. (Note this is based off of the documented workaround to recover from a failing volume expansion in general). So in this case deleting and recreating the statefulset set would suffice (pvcs remain so no data loss). A rollback would be possible if the volumeClaimTemplate was not in the controller revision as you mentioned below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will be always be a chance that storageclass field does not reflect true capability of the volume plugin. for in-tree drivers this is easy and each plugin can be queried. For CSI plugins, we will have to rely on
CSIDriver
object but we weren't considering makingCSIDriver
a necessity for CSI volume expansion of stateful sets.So the worst this could do is - it will cause all PVCs managed by stateful set to be stuck in "Resizing" condition. The resize will not actually be retried because resize controller knows that plugin itself does not support resize. The PVCs themselves will be usable and nothing stops them from being used in pod/statefulsets if they have "Resizing" condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I see how this interacts with @liggitt's comment below and could prevent an user from rolling back to revision of StatefulSet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above and in the comment below, including the requested volume size in the revision prevents the statefulset from being rolled back (decreases to volume requests are rejected at the api). However, the controller logic for updating pods is centred around the revision name. Thus if the storage request is not included in the revision, the update is not triggered.
Off the top of my head we could probably change another field in the spec which would also be included in the revision and could be rolled back. This would be changed along with any increase to a storage request. This way the controller update logic would remain unchanged and rollbacks would ignore storageRequests. On the other hand including a new field just for this purpose doesn't seem to be a good idea...
We could also mark the pod for termination if the requested size (in current set) > actual size (for the pvc). This would be another condition (along with a change in revision) that would trigger a pod to be deleted for update. Not sure if this would break any assumptions since updates seem to be based around changes to revision names. (i.e with this change pods could be deleted by controller if such a mismatch is detected even when an update wasn't explicitly triggered)
@janetkuo @gnufied thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as an alternative to putting the volumeTemplate size request increases in the controller history (where they block rollback), the reconcile loop that ensures PVCs exist could also compare requested size between the template and the existing PVC, and reconcile the existing PVC to request an increase. Was this discussed instead?
edit: The next paragraph seems to indicate this will be done anyway... it's not clear why the controller revision aspect is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this context the "updated set" is actually just the set patched with the update ControllerRevision. If the volumeClaimTemplates object is not in the ControllerRevision, the update revision name will not differ from the revision name marked on each pod (in the pods' labels) and thus the controller will not terminate the pod for an update.
We want the pod to be terminated to update its associated PVCs due to the reasons described in the KEP (to support both online and offline expansion).
So in essence the kep is suggesting to add the volumeClaimTemplates to the controller revision to ensure the pod terminates in accordance with the existing control flow for the statefulset updates, after which we resize the PVC while the pod is down and recreate the pod once the pvc resize suceeds.
If we don't want to include the volumeClaimTemplates in the revision we could probably modify the update control flow to still cause termination of the pod(hopefully without any side effects), but it was suggested that the volumeClaimTemplates be added to the ControllerRevision specifically for the case of rollback much earlier above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I understand this correctly - stateful set controller will ALWAYS delete and re-recreate pods? If
ExpandInUsePersistentVolumes
feature is enabled it will not wait forFileSystemResizePending
and it could delete and re-create the pods at any point in time? This sounds racy - we could delete the pod while file system resize is ongoing on a node? I know we have some checks in place to ensure we don't allow pod deletion while an operation is ongoing at kubelet but it is better to avoid it?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes as @kow3ns mentioned above the workloads controllers don't to in place updates for any other container or Pod modifications today. More specifically the Statefulset controller will always delete and recreate the pod when update and current revision of the statefulset are different. So until this functionality is built in general to allow for stable in-place updates of pods (and associated resources) an efficient workaround would be necessary for the issue you mentioned above.
What we could do is when we recognize that the specific volume type supports online expansion and
ExpandInUsePersistentVolumes
is enabled we would "wait" for expansion to complete before letting the controller delete the pod (just like we do here for theFileSystemResizePending
during offline expansion) . This won't break the current flow of when and how pods are deleted and recreated by the controller and would still avoid the issue of the pod being deleted during file system resize.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think so. It makes sense to wait for resizing to finish before restarting the pod. Lets update spec to reflect this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Termination of the pod after a successful online expansion is unnecessary and incurs cost for the user. e.g., for our Elasticsearch implementation, we migrate data off of a pod before terminating it in the pre-shutdown hook. An additional sentence which describes the impediment to leaving the pod online would be helpful, so that consideration of a future enhancement will not need to retrace all the steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the
ExpandInUsePersistentVolumes
feature gate allows expanding of a Volume in use by a Pod.It works like a charm ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be resolved before this is implementable (doesn't need to block provisional merge)