Add KEP for recovering from volume expansion failure. #1516

gnufied · 2020-01-28T04:00:56Z

Add a KEP for recovering from resize failure.

xref: #1790

/assign @msau42 @jsafrane @saad-ali

cc @kubernetes/sig-storage-api-reviews

fejta-bot · 2020-01-28T04:52:36Z

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

keps/sig-storage/20200127-recover-from-resize.md

johnbelamaric · 2020-01-28T17:46:20Z

Please note that to make 1.18 this KEP needs to merge TODAY, with implementable status and both a test plan and grad criteria, or you will need to file an exception: https://github.com/kubernetes/sig-release/blob/master/releases/EXCEPTIONS.md

msau42 · 2020-01-28T19:36:11Z

/assign @davidz627

davidz627

mostly looks good but I have some questions. Also I think some of the wording is imprecise so sometimes it's hard to follow exactly why we're doing things or what component is doing what at what time.

keps/sig-storage/20200127-recover-from-resize.md

davidz627 · 2020-01-28T20:08:34Z

keps/sig-storage/20200127-recover-from-resize.md

+
+### Risks and Mitigations
+
+- One risk as mentioned above is, if expansion failed and user retried expansion(successfully) with smaller value, the quota code will keep reporting higher value. In practice though - this should be acceptable since such expansion failures should be rare and admin can unblock the user by increasing the quota or rebuilding PVC if needed. We will emit events on PV and PVC to alerts admins and users.


Is there any follow-up we can do to automatically reconcile quota or fix this weird state or will it just persist in the cluster "forever" until someone realizes something is wrong and has to go in and manually fix it.

We will document about how to fix the quota but I do not think there will be a mechanism possible to reconcile the quota automatically. Added something similar in alternatives considered section.

davidz627 · 2020-01-28T20:09:56Z

keps/sig-storage/20200127-recover-from-resize.md

+### Test Plan
+
+* Basic unit tests for storage strategy of PVC and quota system.
+* E2e tests using mock driver to cause failure on expansion and recovery.


should this be added to the external test suite so all CSI drivers can also be tested against this

davidz627 · 2020-01-28T23:40:41Z

FYI my working hours end around 7:00pm PST today. If you make an update before then I will be able to review it, otherwise I will not be able to lgtm this PR today (but someone else might)

bswartz

This approach doesn't match what I expected. It seems to focus on quota tracking, but not the problem of how to avoid getting the object into a state where the user is trying to shink the volume.

I had expected the purpose served by the new field to be "the size that kubernetes is trying to make the volume", and that is allowed to decrease, but only if the controller confirms that expansion is no in progress. AllocatedResources could serve that purpose, but I would expect it to go down in cases where the resize failed and the user reduced the Resources request to a smaller number.

bswartz · 2020-01-29T01:23:53Z

keps/sig-storage/20200127-recover-from-resize.md

+
+We however do have a problem with quota calculation because if a previously issued expansion is successful but is not recorded(or partially recorded) in api-server and user reduces requested size of the PVC, then quota controller will assume it as actual shrinking of volume and reduce used storage size by the user(incorrectly). Since we know actual size of the volume only after performing expansion(either on node or controller), allowing quota to be reduced on PVC size reduction will allow an user to abuse the quota system.
+
+To solve aformentioned problem - we propose that, a new field will be added to PVC, called `pvc.Spec.AllocatedResources`. This field is only allowed to increase and will be set by the api-server to value in `pvc.Spec.Resources` as long as `pvc.Spec.Resources > pvc.Spec.AllocatedResources`.  Quota controller code will be updated to use `max(pvc.Spec.Resources, pvc.Spec.AllocatedResources)` when calculating usage of the PVC under questions. This does mean that - if user expanded PVC to a size that it failed to expand to and user retries expansion with lower value, the used quota will still use value reported by failed expansion request.


I think you mean: This field is only allowed to increase and will be set by the api-server to value in pv.Spec.Capacity not pvc.Spec.Resources. Otherwise this is useless. The former is the actual size of the volume, the latter is the size the user controls.

bswartz · 2020-01-29T01:30:30Z

keps/sig-storage/20200127-recover-from-resize.md

+- User requests size to 20Gi.
+- Quota controller sees no change in storage usage by the PVC because `pvc.Spec.AllocatedResources` is `100Gi`.
+- Expansion succeeds and `pvc.Status.Capacity` and `pv.Spec.Capacity` report new size as `20Gi`.
+- `pvc.Spec.AllocatedResources` however keeps reporting `100Gi`.


I would expect AllocatedResources to decrease to 20Gi in this case.

why would you expect pvc.spec.allocatedResources to to be 20Gi? The field has been added only to track quota usage and quota has always been tracked by user requested size not by actual volume size. This is a known limitation of the KEP that - it does not reduce used quota when requested PVC size is reduced.

Okay I understand now

This can be very confusing for the user so it should be documented clearly on the Kubernetes doc website once this is implemented.

I have added this in graduation critirea.

gnufied · 2020-01-29T02:56:25Z

This approach doesn't match what I expected. It seems to focus on quota tracking, but not the problem of how to avoid getting the object into a state where the user is trying to shink the volume.

@bswartz I do not completely understand what you are trying to say. You seem to be saying - we should try to avoid putting the PVC in a state where it will require shrinking. Which would imply - if we know that a expansion request is going to fail because of capacity reasons, we should not attempt it? It might be a valid way to solve this problem if storage providers could tell us without actually expanding the volume, if the operation is going to succeed (something like a dry operation?). Is this something that will be possible for different storage types?

had expected the purpose served by the new field to be "the size that kubernetes is trying to make the volume", and that is allowed to decrease, but only if the controller confirms that expansion is no in progress. AllocatedResources could serve that purpose, but I would expect it to go down in cases where the resize failed and the user reduced the Resources request to a smaller number.

It is essentially racy to try and determine if expansion is in-progress or not. Expansion may not be in-progress but could be waiting in expand controller's request queue. Same thing could happen on the node too. Purpose of pvc.spec.allocatedResources is not for determing if an expansion request is in-flight or not but to track user requested pvc sizes.

bswartz · 2020-01-29T14:39:02Z

After talking to @gnufied in a Zoom meeting I understand why my proposal doesn't fix things, and while this proposal isn't perfect, it's the best alternative available. The main shortcoming of this proposal can be addressed with a doc change that explains the procedure for fixing quota issues after accidentally oversizing a volume.

xing-yang · 2020-02-04T16:34:17Z

This KEP should also cover the following case:
kubernetes-csi/external-resizer#62

The external-resizer should check if CSI driver supports ONLINE/OFFLINE volume expansion plugin capability. Currently this capability is documented in the CSI spec but not checked by the external-resizer.
If a CSI driver does not support ONLINE volume expansion plugin capability, the external-resizer should not call driver's ControllerVolumeExpansion RPC if volume is used by a pod.
If a CSI driver does not support OFFLINE volume expansion plugin capability, the external-resizer should Not call driver's ControllerVolumeExpansion RPC if volume is Not used by a pod.

gnufied · 2020-02-04T16:50:45Z

@xing-yang thanks for linking that issue, but I think that is a separate problem from problem we are trying to fix in this KEP. I have responded to the linked github issue. I do agree, it will be nice to fix the issue you linked above before volume expansion goes GA.

humblec · 2020-03-26T17:10:11Z

keps/sig-storage/20200127-recover-from-resize.md

+- While the storage backend is resizing the volume, user requests size 20Gi by changing pvc.spec.resources.requests["storage"] = "20Gi"
+- Quota controller sees no change in storage usage by the PVC because pvc.Spec.AllocatedResources is 100Gi.
+- Expansion succeeds and pvc.Status.Capacity and pv.Spec.Capacity report new size as 100Gi, as that's what the volume plugin did.
+


Isnt there a fourth case where the expansion happend in the storage backend, but fails on second step on node which is volume fs expansion fails on pod spawn meanwhile user changed/shrink the size to less capacity? or did I miss anything ?

I can add that as a fourth case but it does not affect the flow I was trying to illustrate, so I skipped.

xing-yang · 2020-05-18T19:10:59Z

keps/sig-storage/1790-recover-resize-failure/kep.yaml

+  - "@xing-yang"
+editor: TBD
+creation-date: 2020-01-27
+last-updated: 2020-01-27


nit: This should be updated.

fixed. But lost lgtm. can you add it back?

xing-yang · 2020-05-18T19:12:19Z

/lgtm

xing-yang · 2020-05-18T21:23:53Z

/lgtm

saad-ali · 2020-05-19T04:03:11Z

keps/sig-storage/1790-recover-resize-failure/README.md

+
+We however do have a problem with quota calculation because if a previously issued expansion is successful but is not recorded(or partially recorded) in api-server and user reduces requested size of the PVC, then quota controller will assume it as actual shrinking of volume and reduce used storage size by the user(incorrectly). Since we know actual size of the volume only after performing expansion(either on node or controller), allowing quota to be reduced on PVC size reduction will allow an user to abuse the quota system.
+
+To solve aforementioned problem - we propose that, a new field will be added to PVC, called `pvc.Spec.AllocatedResources`. This field is only allowed to increase and will be set by the api-server to value in `pvc.Spec.Resources` as long as `pvc.Spec.Resources > pvc.Spec.AllocatedResources`.  Quota controller code will be updated to use `max(pvc.Spec.Resources, pvc.Spec.AllocatedResources)` when calculating usage of the PVC under questions.


Can you explain why pvc.Status.Capacity can't be used for quota allocation? Why doesn't that always reflect the actual capacity?

Why is AllocatedResources under pvc.Spec and not pvc.Status?

Is AllocatedResources the best name? As I understand it, it will contain the max size requested (regardless of actual size), right? "Allocated" suggests that is the real size allocated on the backend.

Can you explain why pvc.Status.Capacity can't be used for quota allocation? Why doesn't that always reflect the actual capacity?

pvc.Status.Capacity can't be used for tracking quota because pvc.Status.Capacity is calculated after binding happens, which could be when pod is started. This would allow an user to overcommit because quota won't reflect accurate value until PVC is bound to a PV.

Why is AllocatedResources under pvc.Spec and not pvc.Status?

There are two reasons:

AllocatedResources is not volume size but more like whatever user has requested and towards which resize-controller was working to reconcile. It is possible that user has requested smaller size since then but that does not changes the fact that resize-controller has already tried to expand to AllocatedResources and might have partially succeeded. So AllocatedResources is maximum user requested size for this volume and does not reflect actual volume size of the PV.

Following from https://github.com/kubernetes/enhancements/pull/1342/files (pointed by Jordan), it also uses spec.containers[i].ResourcesAllocated for tracking user requests, so keeping this in PVC makes it somewhat consistent.

Is AllocatedResources the best name? As I understand it, it will contain the max size requested (regardless of actual size), right? "Allocated" suggests that is the real size allocated on the backend.

There are two reasons. First is to follow convention of vertical pod scaling KEP and then second one is - AllocatedResources does mean that this is the size resize controller (or pv) has tried to reconcile towards. It is possible that user reduced requested size and then resize controller stopped reconciliation but it may already have succeeded, so AllocatedResources does sound like okay name.

This is good context. Can you add it to the doc? Otherwise LGTM

added to the KEP. One under alternatives considered and another one about naming.

Also why we can't use pvc.status.capacity

saad-ali · 2020-05-19T22:08:51Z

I'm still a bit hesitant on adding a AllocatedResources field to fix this issue because I fear it can lead to user confusion (it's in the spec afterall). That said I do think this is an important issue to fix. So let's add the field as alpha and can reassess.

/approve
/lgtm

k8s-ci-robot · 2020-05-19T22:09:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gnufied, saad-ali

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-storage/OWNERS~~ [saad-ali]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot assigned jsafrane, msau42 and saad-ali Jan 28, 2020

k8s-ci-robot requested review from childsb and saad-ali January 28, 2020 04:01

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Jan 28, 2020

jsafrane reviewed Jan 28, 2020

View reviewed changes

keps/sig-storage/20200127-recover-from-resize.md Outdated Show resolved Hide resolved

gnufied mentioned this pull request Jan 28, 2020

Add support for resizing PVs #284

Closed

k8s-ci-robot assigned davidz627 Jan 28, 2020

davidz627 reviewed Jan 28, 2020

View reviewed changes

bswartz suggested changes Jan 29, 2020

View reviewed changes

bswartz reviewed Jan 29, 2020

View reviewed changes

gnufied force-pushed the recover-from-resize-failure branch from 7835a2b to aad8bba Compare January 29, 2020 03:51

bswartz approved these changes Jan 29, 2020

View reviewed changes

gnufied mentioned this pull request Jan 29, 2020

Requesting enhancement exception request for - allow users to recover from expansion failure #1529

Closed

humblec reviewed Mar 26, 2020

View reviewed changes

gnufied force-pushed the recover-from-resize-failure branch from aad8bba to 6de0f20 Compare April 29, 2020 02:25

xing-yang reviewed May 18, 2020

View reviewed changes

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels May 18, 2020

Add KEP for recovery from volume expansion failure

f6b8df5

gnufied force-pushed the recover-from-resize-failure branch from 25f700c to f6b8df5 Compare May 18, 2020 19:59

k8s-ci-robot added the sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. label May 18, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 18, 2020

saad-ali reviewed May 19, 2020

View reviewed changes

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2020

Add note about why we are using pvc.spec.AllocatedResources

b8d183e

Also why we can't use pvc.status.capacity

gnufied force-pushed the recover-from-resize-failure branch from d3de0c1 to b8d183e Compare May 19, 2020 21:36

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 19, 2020

k8s-ci-robot merged commit 6d3b6b7 into kubernetes:master May 19, 2020

k8s-ci-robot added this to the v1.19 milestone May 19, 2020

This was referenced Jul 6, 2020

Implement API changes necessary for recovering from expansion failure kubernetes/kubernetes#91976

Closed

Allow volume expansion to be cancelled kubernetes/kubernetes#92770

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add KEP for recovering from volume expansion failure. #1516

Add KEP for recovering from volume expansion failure. #1516

gnufied commented Jan 28, 2020 •

edited

Loading

fejta-bot commented Jan 28, 2020

johnbelamaric commented Jan 28, 2020

msau42 commented Jan 28, 2020

davidz627 left a comment

davidz627 Jan 28, 2020

gnufied Apr 29, 2020

davidz627 Jan 28, 2020

davidz627 commented Jan 28, 2020

bswartz left a comment

bswartz Jan 29, 2020

bswartz Jan 29, 2020

gnufied Jan 29, 2020

bswartz Jan 29, 2020

xing-yang May 13, 2020

gnufied May 18, 2020

gnufied commented Jan 29, 2020

bswartz commented Jan 29, 2020

xing-yang commented Feb 4, 2020

gnufied commented Feb 4, 2020

humblec Mar 26, 2020

gnufied May 18, 2020

xing-yang May 18, 2020

gnufied May 18, 2020

xing-yang commented May 18, 2020

xing-yang commented May 18, 2020

saad-ali May 19, 2020

gnufied May 19, 2020

saad-ali May 19, 2020

gnufied May 19, 2020

saad-ali commented May 19, 2020

k8s-ci-robot commented May 19, 2020


		### Risks and Mitigations

		- One risk as mentioned above is, if expansion failed and user retried expansion(successfully) with smaller value, the quota code will keep reporting higher value. In practice though - this should be acceptable since such expansion failures should be rare and admin can unblock the user by increasing the quota or rebuilding PVC if needed. We will emit events on PV and PVC to alerts admins and users.


		We however do have a problem with quota calculation because if a previously issued expansion is successful but is not recorded(or partially recorded) in api-server and user reduces requested size of the PVC, then quota controller will assume it as actual shrinking of volume and reduce used storage size by the user(incorrectly). Since we know actual size of the volume only after performing expansion(either on node or controller), allowing quota to be reduced on PVC size reduction will allow an user to abuse the quota system.

		To solve aformentioned problem - we propose that, a new field will be added to PVC, called `pvc.Spec.AllocatedResources`. This field is only allowed to increase and will be set by the api-server to value in `pvc.Spec.Resources` as long as `pvc.Spec.Resources > pvc.Spec.AllocatedResources`. Quota controller code will be updated to use `max(pvc.Spec.Resources, pvc.Spec.AllocatedResources)` when calculating usage of the PVC under questions. This does mean that - if user expanded PVC to a size that it failed to expand to and user retries expansion with lower value, the used quota will still use value reported by failed expansion request.


		We however do have a problem with quota calculation because if a previously issued expansion is successful but is not recorded(or partially recorded) in api-server and user reduces requested size of the PVC, then quota controller will assume it as actual shrinking of volume and reduce used storage size by the user(incorrectly). Since we know actual size of the volume only after performing expansion(either on node or controller), allowing quota to be reduced on PVC size reduction will allow an user to abuse the quota system.

		To solve aforementioned problem - we propose that, a new field will be added to PVC, called `pvc.Spec.AllocatedResources`. This field is only allowed to increase and will be set by the api-server to value in `pvc.Spec.Resources` as long as `pvc.Spec.Resources > pvc.Spec.AllocatedResources`. Quota controller code will be updated to use `max(pvc.Spec.Resources, pvc.Spec.AllocatedResources)` when calculating usage of the PVC under questions.

Add KEP for recovering from volume expansion failure. #1516

Add KEP for recovering from volume expansion failure. #1516

Conversation

gnufied commented Jan 28, 2020 • edited Loading

fejta-bot commented Jan 28, 2020

johnbelamaric commented Jan 28, 2020

msau42 commented Jan 28, 2020

davidz627 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidz627 commented Jan 28, 2020

bswartz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied commented Jan 29, 2020

bswartz commented Jan 29, 2020

xing-yang commented Feb 4, 2020

gnufied commented Feb 4, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xing-yang commented May 18, 2020

xing-yang commented May 18, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saad-ali commented May 19, 2020

k8s-ci-robot commented May 19, 2020

gnufied commented Jan 28, 2020 •

edited

Loading