-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
detect Retry-After during async “does resource exist?” flow #2688
detect Retry-After during async “does resource exist?” flow #2688
Conversation
Feel free to rename it (here if you don't plan or backporting, or another PR if this one is meant to be backported), I honestly can't remember why it's called "CreateResource" (either |
I tagged this as a feature, so I don't plan to backport it. I think I'd prefer to do a rename in a separate PR. That way we can spend some time w/ the community getting feedback on a better name as distinct from landing this change (which should be reviewed on its own merits). |
azure/services/async/async.go
Outdated
ret = 1 * time.Minute | ||
} | ||
} | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we return the Default requeue time instead of 0 if no retry after is found? Like above https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/2688/files#diff-0e657fbf13cf152e97cf8871a3baf550199d64f99f94316e7e1b9eeb5d6cc8e4R202
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the idea here would be that if we receive a strongly typed autorest.DetailedError
error, then we should always classify that as a transient error w/ default requeue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would make sense, since we don't want to hammer the Azure APIs if an unexpected error occurs, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved all of this into the getRetryAfterFromError
func, and now the GET in the flow of CreateResource
always returns a transient azure.ReconcileError
.
Is that correct? That's a change from the present, where non-404 errors are always returned as vanilla errors (err != nil && !azure.ResourceNotFound(err)
. Is there a chance that changing this to transient azure.ReconcileError
will result in a mis-classification of non-transient errors?
40c87eb
to
a9c9715
Compare
b33ee7b
to
d897e8b
Compare
ea7b62f
to
df4a3b4
Compare
if detailedError.Response != nil { | ||
// If we have Retry-After HTTP header data for any reason, prefer it | ||
if retryAfter := detailedError.Response.Header.Get("Retry-After"); retryAfter != "" { | ||
// This handles the case where Retry-After data is in the form of units of seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we expecting both scenarios to be possible (units of seconds and absolute time)? Same question with autorest.DetailedError
vs not ? Are there any SDK contracts for those error responses / headers that we can follow? Seems strange that we have to handle multiple possibilities, it's like we're guessing what the reponse might look like instead of expecting a specific format/unit/type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we expecting both scenarios to be possible (units of seconds and absolute time)
As far as I can tell from research, the spec is overloaded to expect both value type flavors, so I think it's the best practice to deal with both types wherever we parse Retry-After
HTTP header data.
Same question with autorest.DetailedError vs not
The autorest.DetailedError
error implementation is one that we definitely know about from the usage above in the async flow, based on the specific SDK API we're currently re-using. In that sense you could say that this helper function is sort of tightly coupled to the particular implementation of capz at this point in time (initially this foo was inline in the CreateResource
func but I split it out for code maintenance reasons). So tl;dr we only have access to Retry-After
data in this particular case via the err
response from the underlying service Get
implementations, and we only know how to get at it (as of right now) if the err is of "type" autorest.DetailedError
. Hope that makes sense!
} | ||
// If we didn't find Retry-After HTTP header data but the response type is 429, | ||
// we'll have to come up with our sane default. | ||
} else if detailedError.Response.StatusCode == http.StatusTooManyRequests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, is that a real scenario or are we doing that just in case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely a real scenario, as a autorest.DetailedError
error has a strongly typed HTTP response object in it, which gives us access to the HTTP response status code.
The larger idea here is that not all API responses are well behaving, and if we get an HTTP 429 without Retry-After
data we can at least trust that the HTTP 429 indicates we're sending too many requests, which is (IMO) sufficient justification for telling the controller to requeue the next request a little later in time than it would otherwise.
df4a3b4
to
8fa5c9d
Compare
8fa5c9d
to
bcee198
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: CecileRobertMichon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Yay here's a PR where UT coverage data is gathered and displayed as part of the single "-test" job (no extra "-coverage" job). cc @mboersma @CecileRobertMichon (thanks again @fabriziopandini) |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR adds a single additional vector of
Retry-After
HTTP response detection during async resource reconciliation. When we enter into theCreateResource
async service method (which is in fact a "create or update resource" flow) then we initiate a GET against the Azure API to determine if the resource already exists. Because this GET may yield a HTTP 429 response (or any other response w/Retry-After
header data) we want to check that, to ensure that we inform the parent controller that we should requeue the request for later at a time after the specifiedRetry-After
value from Azure.Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #
Special notes for your reviewer:
Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.
TODOs:
Release note: