Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

detect Retry-After during async “does resource exist?” flow #2688

Conversation

jackfrancis
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds a single additional vector of Retry-After HTTP response detection during async resource reconciliation. When we enter into the CreateResource async service method (which is in fact a "create or update resource" flow) then we initiate a GET against the Azure API to determine if the resource already exists. Because this GET may yield a HTTP 429 response (or any other response w/ Retry-After header data) we want to check that, to ensure that we inform the parent controller that we should requeue the request for later at a time after the specified Retry-After value from Azure.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

detect Retry-After during async “does resource exist?” flow

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels Sep 30, 2022
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 30, 2022
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 30, 2022
@CecileRobertMichon
Copy link
Contributor

CreateResource async service method (which is in fact a "create or update resource" flow)

Feel free to rename it (here if you don't plan or backporting, or another PR if this one is meant to be backported), I honestly can't remember why it's called "CreateResource" (either ReconcileResource or CreateOrUpdateResource probably makes more sense)

@jackfrancis
Copy link
Contributor Author

CreateResource async service method (which is in fact a "create or update resource" flow)

Feel free to rename it (here if you don't plan or backporting, or another PR if this one is meant to be backported), I honestly can't remember why it's called "CreateResource" (either ReconcileResource or CreateOrUpdateResource probably makes more sense)

I tagged this as a feature, so I don't plan to backport it. I think I'd prefer to do a rename in a separate PR. That way we can spend some time w/ the community getting feedback on a better name as distinct from landing this change (which should be reviewed on its own merits).

ret = 1 * time.Minute
}
}
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the idea here would be that if we receive a strongly typed autorest.DetailedError error, then we should always classify that as a transient error w/ default requeue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would make sense, since we don't want to hammer the Azure APIs if an unexpected error occurs, what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved all of this into the getRetryAfterFromError func, and now the GET in the flow of CreateResource always returns a transient azure.ReconcileError.

Is that correct? That's a change from the present, where non-404 errors are always returned as vanilla errors (err != nil && !azure.ResourceNotFound(err). Is there a chance that changing this to transient azure.ReconcileError will result in a mis-classification of non-transient errors?

@jackfrancis jackfrancis force-pushed the retry-after-create-resource-get branch from 40c87eb to a9c9715 Compare September 30, 2022 17:30
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 30, 2022
@jackfrancis jackfrancis force-pushed the retry-after-create-resource-get branch 2 times, most recently from b33ee7b to d897e8b Compare September 30, 2022 18:13
@jackfrancis jackfrancis force-pushed the retry-after-create-resource-get branch 4 times, most recently from ea7b62f to df4a3b4 Compare September 30, 2022 21:01
@jackfrancis jackfrancis added this to the v1.6 milestone Oct 4, 2022
if detailedError.Response != nil {
// If we have Retry-After HTTP header data for any reason, prefer it
if retryAfter := detailedError.Response.Header.Get("Retry-After"); retryAfter != "" {
// This handles the case where Retry-After data is in the form of units of seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we expecting both scenarios to be possible (units of seconds and absolute time)? Same question with autorest.DetailedError vs not ? Are there any SDK contracts for those error responses / headers that we can follow? Seems strange that we have to handle multiple possibilities, it's like we're guessing what the reponse might look like instead of expecting a specific format/unit/type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we expecting both scenarios to be possible (units of seconds and absolute time)

As far as I can tell from research, the spec is overloaded to expect both value type flavors, so I think it's the best practice to deal with both types wherever we parse Retry-After HTTP header data.

Same question with autorest.DetailedError vs not

The autorest.DetailedError error implementation is one that we definitely know about from the usage above in the async flow, based on the specific SDK API we're currently re-using. In that sense you could say that this helper function is sort of tightly coupled to the particular implementation of capz at this point in time (initially this foo was inline in the CreateResource func but I split it out for code maintenance reasons). So tl;dr we only have access to Retry-After data in this particular case via the err response from the underlying service Get implementations, and we only know how to get at it (as of right now) if the err is of "type" autorest.DetailedError. Hope that makes sense!

}
// If we didn't find Retry-After HTTP header data but the response type is 429,
// we'll have to come up with our sane default.
} else if detailedError.Response.StatusCode == http.StatusTooManyRequests {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, is that a real scenario or are we doing that just in case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely a real scenario, as a autorest.DetailedError error has a strongly typed HTTP response object in it, which gives us access to the HTTP response status code.

The larger idea here is that not all API responses are well behaving, and if we get an HTTP 429 without Retry-After data we can at least trust that the HTTP 429 indicates we're sending too many requests, which is (IMO) sufficient justification for telling the controller to requeue the next request a little later in time than it would otherwise.

@jackfrancis jackfrancis force-pushed the retry-after-create-resource-get branch from df4a3b4 to 8fa5c9d Compare October 4, 2022 21:59
@jackfrancis jackfrancis force-pushed the retry-after-create-resource-get branch from 8fa5c9d to bcee198 Compare October 5, 2022 22:10
@CecileRobertMichon
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 5, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 5, 2022
@jackfrancis
Copy link
Contributor Author

Yay here's a PR where UT coverage data is gathered and displayed as part of the single "-test" job (no extra "-coverage" job).

cc @mboersma @CecileRobertMichon (thanks again @fabriziopandini)

@k8s-ci-robot k8s-ci-robot merged commit 0da9473 into kubernetes-sigs:main Oct 5, 2022
@jackfrancis jackfrancis deleted the retry-after-create-resource-get branch October 5, 2022 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants