-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 clusterctl: retry github i/o operations #6430
Conversation
retryableOperationInterval = 3 * time.Second | ||
retryableOperationTimeout = 5 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that GitHub is rate limited I suggest having a longer interval and shorter Timeout
Otherwise lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interval increased to every 10 seconds, timeout to 1 minute
/test pull-cluster-api-e2e-full-main |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understood this correctly, the intension of this PR (and the linked test flake) is to make github repository client resilient to network errors. In that case the approach in this PR may not work because wait.PollImmediate
will return immediately (and will not retry) if the condition function returns an error. For example, this would mean if the client.Repositories.GetReleaseByTag
call fails because of the network error and returns an error the client will return the error immediately and fail.
Since we are looking to basically retry when we hit network errors we might want to do something like this:
var err error
_ := wait.PollImmediate(retryableOperationInterval, retryableOperationTimeout, func() (bool, error) {
var getReleasesErr error
release, _, getReleasesErr = client.Repositories.GetReleaseByTag(context.TODO(), g.owner, g.repository, tag)
if getReleasesErr != nil {
err = getReleasesErr // Do this so we can capture the last error and return that to the layer above. We probably dont care about returning the poll timeout error.
// TODO: Here we should be check to see if err is ratelimting err and return immediately. No point in retrying if we are hitting a rate limitting error.
return false, nil
}
if release == nil {
return false, nil
}
return true, nil
})
if err != nil {
return nil, err
}
@ykakarap that's not my understanding of how https://github.com/kubernetes/apimachinery/blob/v0.23.5/pkg/util/wait/wait.go#L501 The advantage of using |
That is correct. PollImmediate will run the condition function immediately. I was talking about how we want to handle if the call to function My understanding: The intention of this PR to to retry the github calls if it fails because of network flakes. |
@ykakarap you're correct, thank you! https://go.dev/play/p/Z8ynq-10nhS Hang on for a correct implementation @fabriziopandini sorry, I'll submit a follow-up to #6431 with the correct implementation |
e8f5d75
to
1fef519
Compare
@ykakarap updated, PTAL |
1fef519
to
d541ce2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM for the approach pending the comment fix and the lint fixes.
b2618d1
to
62bf324
Compare
/test pull-cluster-api-e2e-full-main |
/retest |
62bf324
to
b75bc78
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/test pull-cluster-api-e2e-full-main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall.
Given that it seems hard to add unit tests for this, did we do a bit of manual testing to verify we're hitting the cases correctly? (I'm aware we can't test all of them)
@sbueringer these changes actually influence existing UT, so there is coverage there (this is why I manually override the timeout in UT so they don't take 1 min for each case :)) |
b75bc78
to
2589f82
Compare
2589f82
to
2e319ea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/lgtm |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fabriziopandini The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cherry-pick release-1.1 |
@jackfrancis: #6430 failed to apply on top of branch "release-1.1":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What this PR does / why we need it:
In response to a test flake, this PR introduces retry tolerance to GitHub API-dependent operations in clusterctl.
The passing (unchanged) UT should prove functional equivalence between this change and existing (no retry) behavior.
Reference test flake:
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-provider-azure-e2e-workload-upgrade-1-19-1-20-main/1516508031641194496
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #