detect Retry-After during async “does resource exist?” flow #2688

jackfrancis · 2022-09-30T17:08:46Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds a single additional vector of Retry-After HTTP response detection during async resource reconciliation. When we enter into the CreateResource async service method (which is in fact a "create or update resource" flow) then we initiate a GET against the Azure API to determine if the resource already exists. Because this GET may yield a HTTP 429 response (or any other response w/ Retry-After header data) we want to check that, to ensure that we inform the parent controller that we should requeue the request for later at a time after the specified Retry-After value from Azure.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

squashed commits
includes documentation
adds unit tests

Release note:

detect Retry-After during async “does resource exist?” flow

CecileRobertMichon · 2022-09-30T17:20:18Z

CreateResource async service method (which is in fact a "create or update resource" flow)

Feel free to rename it (here if you don't plan or backporting, or another PR if this one is meant to be backported), I honestly can't remember why it's called "CreateResource" (either ReconcileResource or CreateOrUpdateResource probably makes more sense)

azure/services/async/async.go

jackfrancis · 2022-09-30T17:22:13Z

CreateResource async service method (which is in fact a "create or update resource" flow)

Feel free to rename it (here if you don't plan or backporting, or another PR if this one is meant to be backported), I honestly can't remember why it's called "CreateResource" (either ReconcileResource or CreateOrUpdateResource probably makes more sense)

I tagged this as a feature, so I don't plan to backport it. I think I'd prefer to do a rename in a separate PR. That way we can spend some time w/ the community getting feedback on a better name as distinct from landing this change (which should be reviewed on its own merits).

CecileRobertMichon · 2022-09-30T17:23:33Z

azure/services/async/async.go

+			ret = 1 * time.Minute
+		}
+	}
+	return


could we return the Default requeue time instead of 0 if no retry after is found? Like above https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/2688/files#diff-0e657fbf13cf152e97cf8871a3baf550199d64f99f94316e7e1b9eeb5d6cc8e4R202

So the idea here would be that if we receive a strongly typed autorest.DetailedError error, then we should always classify that as a transient error w/ default requeue?

I think that would make sense, since we don't want to hammer the Azure APIs if an unexpected error occurs, what do you think?

I've moved all of this into the getRetryAfterFromError func, and now the GET in the flow of CreateResource always returns a transient azure.ReconcileError.

Is that correct? That's a change from the present, where non-404 errors are always returned as vanilla errors (err != nil && !azure.ResourceNotFound(err). Is there a chance that changing this to transient azure.ReconcileError will result in a mis-classification of non-transient errors?

azure/services/async/async.go

CecileRobertMichon · 2022-10-04T21:54:27Z

azure/services/async/async.go

+		if detailedError.Response != nil {
+			// If we have Retry-After HTTP header data for any reason, prefer it
+			if retryAfter := detailedError.Response.Header.Get("Retry-After"); retryAfter != "" {
+				// This handles the case where Retry-After data is in the form of units of seconds


are we expecting both scenarios to be possible (units of seconds and absolute time)? Same question with autorest.DetailedError vs not ? Are there any SDK contracts for those error responses / headers that we can follow? Seems strange that we have to handle multiple possibilities, it's like we're guessing what the reponse might look like instead of expecting a specific format/unit/type.

are we expecting both scenarios to be possible (units of seconds and absolute time)

As far as I can tell from research, the spec is overloaded to expect both value type flavors, so I think it's the best practice to deal with both types wherever we parse Retry-After HTTP header data.

Same question with autorest.DetailedError vs not

The autorest.DetailedError error implementation is one that we definitely know about from the usage above in the async flow, based on the specific SDK API we're currently re-using. In that sense you could say that this helper function is sort of tightly coupled to the particular implementation of capz at this point in time (initially this foo was inline in the CreateResource func but I split it out for code maintenance reasons). So tl;dr we only have access to Retry-After data in this particular case via the err response from the underlying service Get implementations, and we only know how to get at it (as of right now) if the err is of "type" autorest.DetailedError. Hope that makes sense!

CecileRobertMichon · 2022-10-04T21:56:38Z

azure/services/async/async.go

+				}
+				// If we didn't find Retry-After HTTP header data but the response type is 429,
+				// we'll have to come up with our sane default.
+			} else if detailedError.Response.StatusCode == http.StatusTooManyRequests {


same here, is that a real scenario or are we doing that just in case?

This is definitely a real scenario, as a autorest.DetailedError error has a strongly typed HTTP response object in it, which gives us access to the HTTP response status code.

The larger idea here is that not all API responses are well behaving, and if we get an HTTP 429 without Retry-After data we can at least trust that the HTTP 429 indicates we're sending too many requests, which is (IMO) sufficient justification for telling the controller to requeue the next request a little later in time than it would otherwise.

azure/defaults.go

CecileRobertMichon · 2022-10-05T22:18:02Z

/lgtm
/approve

k8s-ci-robot · 2022-10-05T22:18:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [CecileRobertMichon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jackfrancis · 2022-10-05T22:52:03Z

Yay here's a PR where UT coverage data is gathered and displayed as part of the single "-test" job (no extra "-coverage" job).

cc @mboersma @CecileRobertMichon (thanks again @fabriziopandini)

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels Sep 30, 2022

k8s-ci-robot requested a review from alexeldeib September 30, 2022 17:08

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 30, 2022

k8s-ci-robot requested a review from mboersma September 30, 2022 17:08

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 30, 2022

CecileRobertMichon reviewed Sep 30, 2022

View reviewed changes

azure/services/async/async.go Outdated Show resolved Hide resolved

CecileRobertMichon reviewed Sep 30, 2022

View reviewed changes

jackfrancis force-pushed the retry-after-create-resource-get branch from 40c87eb to a9c9715 Compare September 30, 2022 17:30

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 30, 2022

jackfrancis force-pushed the retry-after-create-resource-get branch 2 times, most recently from b33ee7b to d897e8b Compare September 30, 2022 18:13

CecileRobertMichon reviewed Sep 30, 2022

View reviewed changes

azure/services/async/async.go Outdated Show resolved Hide resolved

jackfrancis force-pushed the retry-after-create-resource-get branch 4 times, most recently from ea7b62f to df4a3b4 Compare September 30, 2022 21:01

jackfrancis added this to the v1.6 milestone Oct 4, 2022

CecileRobertMichon reviewed Oct 4, 2022

View reviewed changes

azure/services/async/async.go Outdated Show resolved Hide resolved

CecileRobertMichon reviewed Oct 4, 2022

View reviewed changes

azure/services/async/async.go Outdated Show resolved Hide resolved

CecileRobertMichon reviewed Oct 4, 2022

View reviewed changes

jackfrancis force-pushed the retry-after-create-resource-get branch from df4a3b4 to 8fa5c9d Compare October 4, 2022 21:59

CecileRobertMichon reviewed Oct 5, 2022

View reviewed changes

azure/defaults.go Outdated Show resolved Hide resolved

detect Retry-After during async “does resource exist?” flow

bcee198

jackfrancis force-pushed the retry-after-create-resource-get branch from 8fa5c9d to bcee198 Compare October 5, 2022 22:10

k8s-ci-robot assigned CecileRobertMichon Oct 5, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 5, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 5, 2022

k8s-ci-robot merged commit 0da9473 into kubernetes-sigs:main Oct 5, 2022

jackfrancis deleted the retry-after-create-resource-get branch October 5, 2022 23:21

jackfrancis mentioned this pull request Nov 3, 2022

Retry-After data should be evaluated in Azure API responses #2674

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

detect Retry-After during async “does resource exist?” flow #2688

detect Retry-After during async “does resource exist?” flow #2688

jackfrancis commented Sep 30, 2022

CecileRobertMichon commented Sep 30, 2022

jackfrancis commented Sep 30, 2022

CecileRobertMichon Sep 30, 2022

jackfrancis Sep 30, 2022

CecileRobertMichon Sep 30, 2022

jackfrancis Sep 30, 2022

CecileRobertMichon Oct 4, 2022

jackfrancis Oct 4, 2022

CecileRobertMichon Oct 4, 2022

jackfrancis Oct 4, 2022

CecileRobertMichon commented Oct 5, 2022

k8s-ci-robot commented Oct 5, 2022

jackfrancis commented Oct 5, 2022

detect Retry-After during async “does resource exist?” flow #2688

detect Retry-After during async “does resource exist?” flow #2688

Conversation

jackfrancis commented Sep 30, 2022

CecileRobertMichon commented Sep 30, 2022

jackfrancis commented Sep 30, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CecileRobertMichon commented Oct 5, 2022

k8s-ci-robot commented Oct 5, 2022

jackfrancis commented Oct 5, 2022