Make EBS controllerexpansion idempotent #552

gnufied · 2020-08-28T22:10:26Z

Fixes #498

This PR contains following fixes:

When describeVolume returns volume size to be greater than requested size, we also verify if volume has any modifications pending. I have observed that, AWS can set targetSize on volume even with pending volume modifications. This is the root cause of EBS resize flake.
When volume modifications check return volume resizing to be completed, we also verify volume size by describing volume. This is to ensure that - we minimize impact of eventual consistency by making two different API calls.

k8s-ci-robot · 2020-08-28T22:10:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gnufied

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [gnufied]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coveralls · 2020-08-28T22:13:44Z

Pull Request Test Coverage Report for Build 1219

44 of 60 (73.33%) changed or added relevant lines in 1 file are covered.
48 unchanged lines in 3 files lost coverage.
Overall coverage increased (+0.3%) to 80.27%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/cloud/cloud.go	44	60	73.33%

Files with Coverage Reduction	New Missed Lines	%
pkg/cloud/cloud.go	2	79.18%
pkg/util/util.go	4	89.74%
pkg/driver/controller.go	42	88.94%

Totals
Change from base Build 1209:	0.3%
Covered Lines:	1489
Relevant Lines:	1855

💛 - Coveralls

gnufied · 2020-08-28T23:38:12Z

/retest

gnufied · 2020-08-29T14:39:02Z

/retest

gnufied · 2020-08-30T21:21:52Z

/retest

gnufied · 2020-09-10T15:03:56Z

/retest

gnufied · 2020-09-10T15:04:09Z

/assign @bertinatto @jsafrane

pkg/cloud/cloud.go

bertinatto · 2020-09-11T12:53:04Z

pkg/cloud/cloud.go

+	if latestMod != nil && modFetchError == nil {
+		state := aws.StringValue(latestMod.ModificationState)
+		if state == ec2.VolumeModificationStateModifying {
+			return oldSizeGiB, fmt.Errorf("volume %q is still being expanded to size %d", volumeID, newSizeGiB)


In order to make this function idempotent, we shouldn't return an error if the volume modification is in progress, no?

It feels like it should return success if the volume is already being modified to newSizeGiB, but the volume is being modified to a value different than newSizeGiB, then it's OK to return an error?

that doesn't seem right either, then it would return that resize has succeeded which is a lie.

attach/controllerpublish behaviour is to wait for the attach to complete if it's found to be in progress. The same could be done here

Yep - was just about to push a change that made the whole RPC call block until operation is finished.

that doesn't seem right either, then it would return that resize has succeeded which is a lie.

That's a good point, but currently we're doing the same thing in the first call (because it considers the volume resized once the operation is in the Optimizing state). So if Optimizing mean success for the first call, it should also mean success for the subsequent calls.

attach/controllerpublish behaviour is to wait for the attach to complete if it's found to be in progress. The same could be done here

+1

Yep - was just about to push a change that made the whole RPC call block until operation is finished.

By waiting for the modification to be in the Completed state? If so, can/should we change the behaviour in-tree as well, to be consistent with the CSI driver?

Alright modified so that each subsequent resize call will now wait for previous resize to finish before returning.

By waiting for the modification to be in the Completed state? If so, can/should we change the behaviour in-tree as well, to be consistent with the CSI driver?

No not "Completed" state. I reverted the code to "optimizing" or "completed" check, because the main reason this was breaking was because describeVolume can return updated size before volume has really finished resizing. "optimized" vs "completed" state had no impact on the outcome.

double checking, doc says size can take a few seconds after volume is in optimizing state, and I guess it does not matter because even if fs resize is attempted prematurely, the fs resize will just get retried and must eventually succeed.

"Size changes usually take a few seconds to complete and take effect after a volume is in the Optimizing state. "
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-volume-modifications.html

We don't return size in fs resize, it just blindly tries to resize to the full disk size, whatever it may be at the time. So is it possible that we won't fs expand to the intended size if we don't wait?

shoot you are right.

aws-ebs-csi-driver/vendor/k8s.io/kubernetes/pkg/util/resizefs/resizefs_linux.go

Line 38 in 9b33867

// Resize perform resize of file system

I am probably misreading the doc, the correct meaning must be that "optimizing" means the size change has already taken effect and only the performance change is still in progress.

Size changes usually take a few seconds to complete and take effect after a volume is in the Optimizing state.

tbh - the wording here is imprecise in EBS docs. We have been told by EBS engineers that volume is "ready-to-use
after it enters "optimizing" state but if we wait for "VolumeModificationComplete" state then, it can take really long time to complete resize operation and afaict there is no intermediate state between "optimizing" and "complete" (although docs do imply that there is a intermediate state where size changes take effect few seconds after volume enters "optimizing" state).

pkg/cloud/cloud.go

bertinatto · 2020-09-11T13:00:48Z

/assign @leakingtapan @wongma7

wongma7 · 2020-09-11T23:39:17Z

/retest

When returning successful for volume expansion requests we should verify both volume size reported via DescribeVolume and pending volume modifications requests

wongma7 · 2020-09-14T21:53:44Z

pkg/cloud/cloud.go

-		m, err := c.getLatestVolumeModification(ctx, volumeID)
-		if err != nil {
-			return 0, err
+		m, modFetchError := c.getLatestVolumeModification(ctx, volumeID)


is checking the ModificationState immediately after calling ModifyVolume still necessary, won't waitForVolumeSize do the same thing?

I have removed most of this code. Still keeping some of the check in VolumeModification object after modification, so as we can return success immediately if volume was previously expnaded to same size already..

wongma7 · 2020-09-21T21:33:02Z

lgtm
/retest

gnufied · 2020-09-22T14:23:50Z

/retest

Now we are preventing such errors by checking volume modifications first

gnufied · 2020-09-22T18:52:38Z

/retest

gnufied · 2020-09-22T19:37:37Z

/retest

wongma7 · 2020-09-22T22:18:41Z

@bertinatto this look good to u?

bertinatto

/lgtm

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 28, 2020

k8s-ci-robot requested review from ddebroy and leakingtapan August 28, 2020 22:10

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 28, 2020

gnufied mentioned this pull request Aug 28, 2020

Attach volume fails after resizing completed #498

Closed

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 1, 2020

gnufied force-pushed the make-controller-expansion-idempotent branch 2 times, most recently from bb38bad to a98b32a Compare September 9, 2020 15:32

k8s-ci-robot assigned bertinatto and jsafrane Sep 10, 2020

bertinatto reviewed Sep 11, 2020

View reviewed changes

k8s-ci-robot assigned leakingtapan and wongma7 Sep 11, 2020

gnufied added 3 commits September 14, 2020 16:06

Make EBS controllerexpansion idempotent

56e368c

Verify pending volume modifications and size both

be154b5

When returning successful for volume expansion requests we should verify both volume size reported via DescribeVolume and pending volume modifications requests

Wait for modifications to complete

bbf2ce6

gnufied force-pushed the make-controller-expansion-idempotent branch from a98b32a to bbf2ce6 Compare September 14, 2020 20:29

wongma7 reviewed Sep 14, 2020

View reviewed changes

wongma7 approved these changes Sep 16, 2020

View reviewed changes

gnufied force-pushed the make-controller-expansion-idempotent branch from 9baf3f2 to 68686b1 Compare September 21, 2020 19:22

Remove code that handled resize progress

0f5ece6

Now we are preventing such errors by checking volume modifications first

gnufied force-pushed the make-controller-expansion-idempotent branch from 68686b1 to 0f5ece6 Compare September 22, 2020 18:35

bertinatto reviewed Sep 23, 2020

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 23, 2020

k8s-ci-robot merged commit e95bb1d into kubernetes-sigs:master Sep 23, 2020

gnufied mentioned this pull request Oct 1, 2020

Bug 1879192: Make EBS volume expansion idempotent openshift/aws-ebs-csi-driver#171

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make EBS controllerexpansion idempotent #552

Make EBS controllerexpansion idempotent #552

gnufied commented Aug 28, 2020 •

edited

Loading

k8s-ci-robot commented Aug 28, 2020

coveralls commented Aug 28, 2020 •

edited

Loading

gnufied commented Aug 28, 2020

gnufied commented Aug 29, 2020

gnufied commented Aug 30, 2020

gnufied commented Sep 10, 2020

gnufied commented Sep 10, 2020

bertinatto Sep 11, 2020

wongma7 Sep 12, 2020

gnufied Sep 12, 2020

bertinatto Sep 14, 2020

gnufied Sep 14, 2020 •

edited

Loading

wongma7 Sep 14, 2020

msau42 Sep 14, 2020

wongma7 Sep 14, 2020

gnufied Sep 15, 2020 •

edited

Loading

bertinatto commented Sep 11, 2020

wongma7 commented Sep 11, 2020

wongma7 Sep 14, 2020

gnufied Sep 21, 2020

wongma7 commented Sep 21, 2020

gnufied commented Sep 22, 2020

gnufied commented Sep 22, 2020

gnufied commented Sep 22, 2020

wongma7 commented Sep 22, 2020

bertinatto left a comment

Make EBS controllerexpansion idempotent #552

Make EBS controllerexpansion idempotent #552

Conversation

gnufied commented Aug 28, 2020 • edited Loading

k8s-ci-robot commented Aug 28, 2020

coveralls commented Aug 28, 2020 • edited Loading

Pull Request Test Coverage Report for Build 1219

💛 - Coveralls

gnufied commented Aug 28, 2020

gnufied commented Aug 29, 2020

gnufied commented Aug 30, 2020

gnufied commented Sep 10, 2020

gnufied commented Sep 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied Sep 14, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied Sep 15, 2020 • edited Loading

Choose a reason for hiding this comment

bertinatto commented Sep 11, 2020

wongma7 commented Sep 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wongma7 commented Sep 21, 2020

gnufied commented Sep 22, 2020

gnufied commented Sep 22, 2020

gnufied commented Sep 22, 2020

wongma7 commented Sep 22, 2020

bertinatto left a comment

Choose a reason for hiding this comment

gnufied commented Aug 28, 2020 •

edited

Loading

coveralls commented Aug 28, 2020 •

edited

Loading

gnufied Sep 14, 2020 •

edited

Loading

gnufied Sep 15, 2020 •

edited

Loading