Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add KEP for recovering from volume expansion failure. #1516

Merged
merged 2 commits into from
May 19, 2020

Conversation

gnufied
Copy link
Member

@gnufied gnufied commented Jan 28, 2020

Add a KEP for recovering from resize failure.

xref: #1790

/assign @msau42 @jsafrane @saad-ali

cc @kubernetes/sig-storage-api-reviews

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 28, 2020
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Jan 28, 2020
@fejta-bot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

keps/sig-storage/20200127-recover-from-resize.md Outdated Show resolved Hide resolved
keps/sig-storage/20200127-recover-from-resize.md Outdated Show resolved Hide resolved
keps/sig-storage/20200127-recover-from-resize.md Outdated Show resolved Hide resolved
keps/sig-storage/20200127-recover-from-resize.md Outdated Show resolved Hide resolved
keps/sig-storage/20200127-recover-from-resize.md Outdated Show resolved Hide resolved
@johnbelamaric
Copy link
Member

Please note that to make 1.18 this KEP needs to merge TODAY, with implementable status and both a test plan and grad criteria, or you will need to file an exception: https://github.com/kubernetes/sig-release/blob/master/releases/EXCEPTIONS.md

@msau42
Copy link
Member

msau42 commented Jan 28, 2020

/assign @davidz627

Copy link
Contributor

@davidz627 davidz627 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly looks good but I have some questions. Also I think some of the wording is imprecise so sometimes it's hard to follow exactly why we're doing things or what component is doing what at what time.

keps/sig-storage/20200127-recover-from-resize.md Outdated Show resolved Hide resolved
keps/sig-storage/20200127-recover-from-resize.md Outdated Show resolved Hide resolved
keps/sig-storage/20200127-recover-from-resize.md Outdated Show resolved Hide resolved
keps/sig-storage/20200127-recover-from-resize.md Outdated Show resolved Hide resolved
keps/sig-storage/20200127-recover-from-resize.md Outdated Show resolved Hide resolved

### Risks and Mitigations

- One risk as mentioned above is, if expansion failed and user retried expansion(successfully) with smaller value, the quota code will keep reporting higher value. In practice though - this should be acceptable since such expansion failures should be rare and admin can unblock the user by increasing the quota or rebuilding PVC if needed. We will emit events on PV and PVC to alerts admins and users.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any follow-up we can do to automatically reconcile quota or fix this weird state or will it just persist in the cluster "forever" until someone realizes something is wrong and has to go in and manually fix it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will document about how to fix the quota but I do not think there will be a mechanism possible to reconcile the quota automatically. Added something similar in alternatives considered section.

### Test Plan

* Basic unit tests for storage strategy of PVC and quota system.
* E2e tests using mock driver to cause failure on expansion and recovery.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be added to the external test suite so all CSI drivers can also be tested against this

@davidz627
Copy link
Contributor

FYI my working hours end around 7:00pm PST today. If you make an update before then I will be able to review it, otherwise I will not be able to lgtm this PR today (but someone else might)

Copy link
Contributor

@bswartz bswartz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach doesn't match what I expected. It seems to focus on quota tracking, but not the problem of how to avoid getting the object into a state where the user is trying to shink the volume.

I had expected the purpose served by the new field to be "the size that kubernetes is trying to make the volume", and that is allowed to decrease, but only if the controller confirms that expansion is no in progress. AllocatedResources could serve that purpose, but I would expect it to go down in cases where the resize failed and the user reduced the Resources request to a smaller number.


We however do have a problem with quota calculation because if a previously issued expansion is successful but is not recorded(or partially recorded) in api-server and user reduces requested size of the PVC, then quota controller will assume it as actual shrinking of volume and reduce used storage size by the user(incorrectly). Since we know actual size of the volume only after performing expansion(either on node or controller), allowing quota to be reduced on PVC size reduction will allow an user to abuse the quota system.

To solve aformentioned problem - we propose that, a new field will be added to PVC, called `pvc.Spec.AllocatedResources`. This field is only allowed to increase and will be set by the api-server to value in `pvc.Spec.Resources` as long as `pvc.Spec.Resources > pvc.Spec.AllocatedResources`. Quota controller code will be updated to use `max(pvc.Spec.Resources, pvc.Spec.AllocatedResources)` when calculating usage of the PVC under questions. This does mean that - if user expanded PVC to a size that it failed to expand to and user retries expansion with lower value, the used quota will still use value reported by failed expansion request.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean: This field is only allowed to increase and will be set by the api-server to value in pv.Spec.Capacity not pvc.Spec.Resources. Otherwise this is useless. The former is the actual size of the volume, the latter is the size the user controls.

- User requests size to 20Gi.
- Quota controller sees no change in storage usage by the PVC because `pvc.Spec.AllocatedResources` is `100Gi`.
- Expansion succeeds and `pvc.Status.Capacity` and `pv.Spec.Capacity` report new size as `20Gi`.
- `pvc.Spec.AllocatedResources` however keeps reporting `100Gi`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect AllocatedResources to decrease to 20Gi in this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would you expect pvc.spec.allocatedResources to to be 20Gi? The field has been added only to track quota usage and quota has always been tracked by user requested size not by actual volume size. This is a known limitation of the KEP that - it does not reduce used quota when requested PVC size is reduced.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I understand now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be very confusing for the user so it should be documented clearly on the Kubernetes doc website once this is implemented.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added this in graduation critirea.

@gnufied
Copy link
Member Author

gnufied commented Jan 29, 2020

This approach doesn't match what I expected. It seems to focus on quota tracking, but not the problem of how to avoid getting the object into a state where the user is trying to shink the volume.

@bswartz I do not completely understand what you are trying to say. You seem to be saying - we should try to avoid putting the PVC in a state where it will require shrinking. Which would imply - if we know that a expansion request is going to fail because of capacity reasons, we should not attempt it? It might be a valid way to solve this problem if storage providers could tell us without actually expanding the volume, if the operation is going to succeed (something like a dry operation?). Is this something that will be possible for different storage types?

had expected the purpose served by the new field to be "the size that kubernetes is trying to make the volume", and that is allowed to decrease, but only if the controller confirms that expansion is no in progress. AllocatedResources could serve that purpose, but I would expect it to go down in cases where the resize failed and the user reduced the Resources request to a smaller number.

It is essentially racy to try and determine if expansion is in-progress or not. Expansion may not be in-progress but could be waiting in expand controller's request queue. Same thing could happen on the node too. Purpose of pvc.spec.allocatedResources is not for determing if an expansion request is in-flight or not but to track user requested pvc sizes.

@gnufied gnufied force-pushed the recover-from-resize-failure branch from 7835a2b to aad8bba Compare January 29, 2020 03:51
@bswartz
Copy link
Contributor

bswartz commented Jan 29, 2020

After talking to @gnufied in a Zoom meeting I understand why my proposal doesn't fix things, and while this proposal isn't perfect, it's the best alternative available. The main shortcoming of this proposal can be addressed with a doc change that explains the procedure for fixing quota issues after accidentally oversizing a volume.

@xing-yang
Copy link
Contributor

This KEP should also cover the following case:
kubernetes-csi/external-resizer#62

  • The external-resizer should check if CSI driver supports ONLINE/OFFLINE volume expansion plugin capability. Currently this capability is documented in the CSI spec but not checked by the external-resizer.
  • If a CSI driver does not support ONLINE volume expansion plugin capability, the external-resizer should not call driver's ControllerVolumeExpansion RPC if volume is used by a pod.
  • If a CSI driver does not support OFFLINE volume expansion plugin capability, the external-resizer should Not call driver's ControllerVolumeExpansion RPC if volume is Not used by a pod.

@gnufied
Copy link
Member Author

gnufied commented Feb 4, 2020

@xing-yang thanks for linking that issue, but I think that is a separate problem from problem we are trying to fix in this KEP. I have responded to the linked github issue. I do agree, it will be nice to fix the issue you linked above before volume expansion goes GA.

- While the storage backend is resizing the volume, user requests size 20Gi by changing pvc.spec.resources.requests["storage"] = "20Gi"
- Quota controller sees no change in storage usage by the PVC because pvc.Spec.AllocatedResources is 100Gi.
- Expansion succeeds and pvc.Status.Capacity and pv.Spec.Capacity report new size as 100Gi, as that's what the volume plugin did.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isnt there a fourth case where the expansion happend in the storage backend, but fails on second step on node which is volume fs expansion fails on pod spawn meanwhile user changed/shrink the size to less capacity? or did I miss anything ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add that as a fourth case but it does not affect the flow I was trying to illustrate, so I skipped.

@gnufied gnufied force-pushed the recover-from-resize-failure branch from aad8bba to 6de0f20 Compare April 29, 2020 02:25
@neolit123 neolit123 removed sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/multicluster Categorizes an issue or PR as relevant to SIG Multicluster. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/windows Categorizes an issue or PR as relevant to SIG Windows. labels May 18, 2020
- "@xing-yang"
editor: TBD
creation-date: 2020-01-27
last-updated: 2020-01-27
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This should be updated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. But lost lgtm. can you add it back?

@xing-yang
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels May 18, 2020
@gnufied gnufied force-pushed the recover-from-resize-failure branch from 25f700c to f6b8df5 Compare May 18, 2020 19:59
@k8s-ci-robot k8s-ci-robot added the sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. label May 18, 2020
@xing-yang
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 18, 2020

We however do have a problem with quota calculation because if a previously issued expansion is successful but is not recorded(or partially recorded) in api-server and user reduces requested size of the PVC, then quota controller will assume it as actual shrinking of volume and reduce used storage size by the user(incorrectly). Since we know actual size of the volume only after performing expansion(either on node or controller), allowing quota to be reduced on PVC size reduction will allow an user to abuse the quota system.

To solve aforementioned problem - we propose that, a new field will be added to PVC, called `pvc.Spec.AllocatedResources`. This field is only allowed to increase and will be set by the api-server to value in `pvc.Spec.Resources` as long as `pvc.Spec.Resources > pvc.Spec.AllocatedResources`. Quota controller code will be updated to use `max(pvc.Spec.Resources, pvc.Spec.AllocatedResources)` when calculating usage of the PVC under questions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can you explain why pvc.Status.Capacity can't be used for quota allocation? Why doesn't that always reflect the actual capacity?
  2. Why is AllocatedResources under pvc.Spec and not pvc.Status?
  3. Is AllocatedResources the best name? As I understand it, it will contain the max size requested (regardless of actual size), right? "Allocated" suggests that is the real size allocated on the backend.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why pvc.Status.Capacity can't be used for quota allocation? Why doesn't that always reflect the actual capacity?

pvc.Status.Capacity can't be used for tracking quota because pvc.Status.Capacity is calculated after binding happens, which could be when pod is started. This would allow an user to overcommit because quota won't reflect accurate value until PVC is bound to a PV.

Why is AllocatedResources under pvc.Spec and not pvc.Status?

There are two reasons:

  1. AllocatedResources is not volume size but more like whatever user has requested and towards which resize-controller was working to reconcile. It is possible that user has requested smaller size since then but that does not changes the fact that resize-controller has already tried to expand to AllocatedResources and might have partially succeeded. So AllocatedResources is maximum user requested size for this volume and does not reflect actual volume size of the PV.

  2. Following from https://github.com/kubernetes/enhancements/pull/1342/files (pointed by Jordan), it also uses spec.containers[i].ResourcesAllocated for tracking user requests, so keeping this in PVC makes it somewhat consistent.

Is AllocatedResources the best name? As I understand it, it will contain the max size requested (regardless of actual size), right? "Allocated" suggests that is the real size allocated on the backend.

There are two reasons. First is to follow convention of vertical pod scaling KEP and then second one is - AllocatedResources does mean that this is the size resize controller (or pv) has tried to reconcile towards. It is possible that user reduced requested size and then resize controller stopped reconciliation but it may already have succeeded, so AllocatedResources does sound like okay name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good context. Can you add it to the doc? Otherwise LGTM

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added to the KEP. One under alternatives considered and another one about naming.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2020
Also why we can't use pvc.status.capacity
@gnufied gnufied force-pushed the recover-from-resize-failure branch from d3de0c1 to b8d183e Compare May 19, 2020 21:36
@saad-ali
Copy link
Member

I'm still a bit hesitant on adding a AllocatedResources field to fix this issue because I fear it can lead to user confusion (it's in the spec afterall). That said I do think this is an important issue to fix. So let's add the field as alpha and can reassess.

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gnufied, saad-ali

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 19, 2020
@k8s-ci-robot k8s-ci-robot merged commit 6d3b6b7 into kubernetes:master May 19, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone May 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.