-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add KEP for recovering from volume expansion failure. #1516
Add KEP for recovering from volume expansion failure. #1516
Conversation
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
Please note that to make 1.18 this KEP needs to merge TODAY, with |
/assign @davidz627 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly looks good but I have some questions. Also I think some of the wording is imprecise so sometimes it's hard to follow exactly why we're doing things or what component is doing what at what time.
|
||
### Risks and Mitigations | ||
|
||
- One risk as mentioned above is, if expansion failed and user retried expansion(successfully) with smaller value, the quota code will keep reporting higher value. In practice though - this should be acceptable since such expansion failures should be rare and admin can unblock the user by increasing the quota or rebuilding PVC if needed. We will emit events on PV and PVC to alerts admins and users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any follow-up we can do to automatically reconcile quota or fix this weird state or will it just persist in the cluster "forever" until someone realizes something is wrong and has to go in and manually fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will document about how to fix the quota but I do not think there will be a mechanism possible to reconcile the quota automatically. Added something similar in alternatives considered section.
### Test Plan | ||
|
||
* Basic unit tests for storage strategy of PVC and quota system. | ||
* E2e tests using mock driver to cause failure on expansion and recovery. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be added to the external test suite so all CSI drivers can also be tested against this
FYI my working hours end around 7:00pm PST today. If you make an update before then I will be able to review it, otherwise I will not be able to lgtm this PR today (but someone else might) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach doesn't match what I expected. It seems to focus on quota tracking, but not the problem of how to avoid getting the object into a state where the user is trying to shink the volume.
I had expected the purpose served by the new field to be "the size that kubernetes is trying to make the volume", and that is allowed to decrease, but only if the controller confirms that expansion is no in progress. AllocatedResources could serve that purpose, but I would expect it to go down in cases where the resize failed and the user reduced the Resources request to a smaller number.
|
||
We however do have a problem with quota calculation because if a previously issued expansion is successful but is not recorded(or partially recorded) in api-server and user reduces requested size of the PVC, then quota controller will assume it as actual shrinking of volume and reduce used storage size by the user(incorrectly). Since we know actual size of the volume only after performing expansion(either on node or controller), allowing quota to be reduced on PVC size reduction will allow an user to abuse the quota system. | ||
|
||
To solve aformentioned problem - we propose that, a new field will be added to PVC, called `pvc.Spec.AllocatedResources`. This field is only allowed to increase and will be set by the api-server to value in `pvc.Spec.Resources` as long as `pvc.Spec.Resources > pvc.Spec.AllocatedResources`. Quota controller code will be updated to use `max(pvc.Spec.Resources, pvc.Spec.AllocatedResources)` when calculating usage of the PVC under questions. This does mean that - if user expanded PVC to a size that it failed to expand to and user retries expansion with lower value, the used quota will still use value reported by failed expansion request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean: This field is only allowed to increase and will be set by the api-server to value in pv.Spec.Capacity
not pvc.Spec.Resources
. Otherwise this is useless. The former is the actual size of the volume, the latter is the size the user controls.
- User requests size to 20Gi. | ||
- Quota controller sees no change in storage usage by the PVC because `pvc.Spec.AllocatedResources` is `100Gi`. | ||
- Expansion succeeds and `pvc.Status.Capacity` and `pv.Spec.Capacity` report new size as `20Gi`. | ||
- `pvc.Spec.AllocatedResources` however keeps reporting `100Gi`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect AllocatedResources to decrease to 20Gi in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would you expect pvc.spec.allocatedResources
to to be 20Gi? The field has been added only to track quota usage and quota has always been tracked by user requested size not by actual volume size. This is a known limitation of the KEP that - it does not reduce used quota when requested PVC size is reduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I understand now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be very confusing for the user so it should be documented clearly on the Kubernetes doc website once this is implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added this in graduation critirea.
@bswartz I do not completely understand what you are trying to say. You seem to be saying - we should try to avoid putting the PVC in a state where it will require shrinking. Which would imply - if we know that a expansion request is going to fail because of capacity reasons, we should not attempt it? It might be a valid way to solve this problem if storage providers could tell us without actually expanding the volume, if the operation is going to succeed (something like a dry operation?). Is this something that will be possible for different storage types?
It is essentially racy to try and determine if expansion is in-progress or not. Expansion may not be in-progress but could be waiting in expand controller's request queue. Same thing could happen on the node too. Purpose of |
7835a2b
to
aad8bba
Compare
After talking to @gnufied in a Zoom meeting I understand why my proposal doesn't fix things, and while this proposal isn't perfect, it's the best alternative available. The main shortcoming of this proposal can be addressed with a doc change that explains the procedure for fixing quota issues after accidentally oversizing a volume. |
This KEP should also cover the following case:
|
@xing-yang thanks for linking that issue, but I think that is a separate problem from problem we are trying to fix in this KEP. I have responded to the linked github issue. I do agree, it will be nice to fix the issue you linked above before volume expansion goes GA. |
- While the storage backend is resizing the volume, user requests size 20Gi by changing pvc.spec.resources.requests["storage"] = "20Gi" | ||
- Quota controller sees no change in storage usage by the PVC because pvc.Spec.AllocatedResources is 100Gi. | ||
- Expansion succeeds and pvc.Status.Capacity and pv.Spec.Capacity report new size as 100Gi, as that's what the volume plugin did. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isnt there a fourth case where the expansion happend in the storage backend, but fails on second step on node which is volume fs expansion fails on pod spawn meanwhile user changed/shrink the size to less capacity? or did I miss anything ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add that as a fourth case but it does not affect the flow I was trying to illustrate, so I skipped.
aad8bba
to
6de0f20
Compare
- "@xing-yang" | ||
editor: TBD | ||
creation-date: 2020-01-27 | ||
last-updated: 2020-01-27 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This should be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. But lost lgtm. can you add it back?
/lgtm |
25f700c
to
f6b8df5
Compare
/lgtm |
|
||
We however do have a problem with quota calculation because if a previously issued expansion is successful but is not recorded(or partially recorded) in api-server and user reduces requested size of the PVC, then quota controller will assume it as actual shrinking of volume and reduce used storage size by the user(incorrectly). Since we know actual size of the volume only after performing expansion(either on node or controller), allowing quota to be reduced on PVC size reduction will allow an user to abuse the quota system. | ||
|
||
To solve aforementioned problem - we propose that, a new field will be added to PVC, called `pvc.Spec.AllocatedResources`. This field is only allowed to increase and will be set by the api-server to value in `pvc.Spec.Resources` as long as `pvc.Spec.Resources > pvc.Spec.AllocatedResources`. Quota controller code will be updated to use `max(pvc.Spec.Resources, pvc.Spec.AllocatedResources)` when calculating usage of the PVC under questions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can you explain why
pvc.Status.Capacity
can't be used for quota allocation? Why doesn't that always reflect the actual capacity? - Why is
AllocatedResources
underpvc.Spec
and notpvc.Status
? - Is
AllocatedResources
the best name? As I understand it, it will contain the max size requested (regardless of actual size), right? "Allocated" suggests that is the real size allocated on the backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why pvc.Status.Capacity can't be used for quota allocation? Why doesn't that always reflect the actual capacity?
pvc.Status.Capacity
can't be used for tracking quota because pvc.Status.Capacity
is calculated after binding happens, which could be when pod is started. This would allow an user to overcommit because quota won't reflect accurate value until PVC is bound to a PV.
Why is AllocatedResources under pvc.Spec and not pvc.Status?
There are two reasons:
-
AllocatedResources
is not volume size but more like whatever user has requested and towards which resize-controller was working to reconcile. It is possible that user has requested smaller size since then but that does not changes the fact that resize-controller has already tried to expand toAllocatedResources
and might have partially succeeded. SoAllocatedResources
is maximum user requested size for this volume and does not reflect actual volume size of the PV. -
Following from https://github.com/kubernetes/enhancements/pull/1342/files (pointed by Jordan), it also uses
spec.containers[i].ResourcesAllocated
for tracking user requests, so keeping this in PVC makes it somewhat consistent.
Is AllocatedResources the best name? As I understand it, it will contain the max size requested (regardless of actual size), right? "Allocated" suggests that is the real size allocated on the backend.
There are two reasons. First is to follow convention of vertical pod scaling KEP and then second one is - AllocatedResources
does mean that this is the size resize controller (or pv) has tried to reconcile towards. It is possible that user reduced requested size and then resize controller stopped reconciliation but it may already have succeeded, so AllocatedResources
does sound like okay name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good context. Can you add it to the doc? Otherwise LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added to the KEP. One under alternatives considered and another one about naming.
Also why we can't use pvc.status.capacity
d3de0c1
to
b8d183e
Compare
I'm still a bit hesitant on adding a /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gnufied, saad-ali The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Add a KEP for recovering from resize failure.
xref: #1790
/assign @msau42 @jsafrane @saad-ali
cc @kubernetes/sig-storage-api-reviews