-
Notifications
You must be signed in to change notification settings - Fork 792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make EBS controllerexpansion idempotent #552
Make EBS controllerexpansion idempotent #552
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gnufied The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Pull Request Test Coverage Report for Build 1219
💛 - Coveralls |
/retest |
2 similar comments
/retest |
/retest |
bb38bad
to
a98b32a
Compare
/retest |
/assign @bertinatto @jsafrane |
pkg/cloud/cloud.go
Outdated
if latestMod != nil && modFetchError == nil { | ||
state := aws.StringValue(latestMod.ModificationState) | ||
if state == ec2.VolumeModificationStateModifying { | ||
return oldSizeGiB, fmt.Errorf("volume %q is still being expanded to size %d", volumeID, newSizeGiB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to make this function idempotent, we shouldn't return an error if the volume modification is in progress, no?
It feels like it should return success if the volume is already being modified to newSizeGiB, but the volume is being modified to a value different than newSizeGiB, then it's OK to return an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that doesn't seem right either, then it would return that resize has succeeded which is a lie.
attach/controllerpublish behaviour is to wait for the attach to complete if it's found to be in progress. The same could be done here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep - was just about to push a change that made the whole RPC call block until operation is finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that doesn't seem right either, then it would return that resize has succeeded which is a lie.
That's a good point, but currently we're doing the same thing in the first call (because it considers the volume resized once the operation is in the Optimizing state). So if Optimizing mean success for the first call, it should also mean success for the subsequent calls.
attach/controllerpublish behaviour is to wait for the attach to complete if it's found to be in progress. The same could be done here
+1
Yep - was just about to push a change that made the whole RPC call block until operation is finished.
By waiting for the modification to be in the Completed state? If so, can/should we change the behaviour in-tree as well, to be consistent with the CSI driver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright modified so that each subsequent resize call will now wait for previous resize to finish before returning.
By waiting for the modification to be in the Completed state? If so, can/should we change the behaviour in-tree as well, to be consistent with the CSI driver?
No not "Completed" state. I reverted the code to "optimizing" or "completed" check, because the main reason this was breaking was because describeVolume
can return updated size before volume has really finished resizing. "optimized" vs "completed" state had no impact on the outcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double checking, doc says size can take a few seconds after volume is in optimizing state, and I guess it does not matter because even if fs resize is attempted prematurely, the fs resize will just get retried and must eventually succeed.
"Size changes usually take a few seconds to complete and take effect after a volume is in the Optimizing state. "
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-volume-modifications.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't return size in fs resize, it just blindly tries to resize to the full disk size, whatever it may be at the time. So is it possible that we won't fs expand to the intended size if we don't wait?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shoot you are right.
// Resize perform resize of file system |
I am probably misreading the doc, the correct meaning must be that "optimizing" means the size change has already taken effect and only the performance change is still in progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Size changes usually take a few seconds to complete and take effect after a volume is in the Optimizing state.
tbh - the wording here is imprecise in EBS docs. We have been told by EBS engineers that volume is "ready-to-use
after it enters "optimizing" state but if we wait for "VolumeModificationComplete" state then, it can take really long time to complete resize operation and afaict there is no intermediate state between "optimizing" and "complete" (although docs do imply that there is a intermediate state where size changes take effect few seconds after volume enters "optimizing" state).
/assign @leakingtapan @wongma7 |
/retest |
When returning successful for volume expansion requests we should verify both volume size reported via DescribeVolume and pending volume modifications requests
a98b32a
to
bbf2ce6
Compare
pkg/cloud/cloud.go
Outdated
m, err := c.getLatestVolumeModification(ctx, volumeID) | ||
if err != nil { | ||
return 0, err | ||
m, modFetchError := c.getLatestVolumeModification(ctx, volumeID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is checking the ModificationState immediately after calling ModifyVolume still necessary, won't waitForVolumeSize do the same thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed most of this code. Still keeping some of the check in VolumeModification
object after modification, so as we can return success immediately if volume was previously expnaded to same size already..
9baf3f2
to
68686b1
Compare
lgtm |
/retest |
Now we are preventing such errors by checking volume modifications first
68686b1
to
0f5ece6
Compare
/retest |
1 similar comment
/retest |
@bertinatto this look good to u? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Fixes #498
This PR contains following fixes: