🐛 clusterctl: retry github i/o operations #6430

jackfrancis · 2022-04-20T20:51:33Z

What this PR does / why we need it:

In response to a test flake, this PR introduces retry tolerance to GitHub API-dependent operations in clusterctl.

The passing (unchanged) UT should prove functional equivalence between this change and existing (no retry) behavior.

Reference test flake:

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-provider-azure-e2e-workload-upgrade-1-19-1-20-main/1516508031641194496

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

jackfrancis · 2022-04-20T20:51:48Z

cc @CecileRobertMichon

fabriziopandini · 2022-04-21T16:46:00Z

cmd/clusterctl/client/repository/repository_github.go

+	retryableOperationInterval = 3 * time.Second
+	retryableOperationTimeout  = 5 * time.Minute


Given that GitHub is rate limited I suggest having a longer interval and shorter Timeout
Otherwise lgtm

interval increased to every 10 seconds, timeout to 1 minute

fabriziopandini · 2022-04-21T16:46:36Z

/test pull-cluster-api-e2e-full-main

jackfrancis · 2022-04-21T18:19:14Z

/retest

ykakarap

If I understood this correctly, the intension of this PR (and the linked test flake) is to make github repository client resilient to network errors. In that case the approach in this PR may not work because wait.PollImmediate will return immediately (and will not retry) if the condition function returns an error. For example, this would mean if the client.Repositories.GetReleaseByTag call fails because of the network error and returns an error the client will return the error immediately and fail.

Since we are looking to basically retry when we hit network errors we might want to do something like this:

var err error
_ := wait.PollImmediate(retryableOperationInterval, retryableOperationTimeout, func() (bool, error) {
		var getReleasesErr error
		release, _, getReleasesErr = client.Repositories.GetReleaseByTag(context.TODO(), g.owner, g.repository, tag)
		if getReleasesErr != nil {
			err = getReleasesErr // Do this so we can capture the last error and return that to the layer above. We probably dont care about returning the poll timeout error.
                         // TODO: Here we should be check to see if err is ratelimting err and return immediately. No point in retrying if we are hitting a rate limitting error.
                        return false, nil
		}

		if release == nil {
			return false, nil
		}
		return true, nil
	})
if err != nil {
	return nil, err
}

jackfrancis · 2022-04-21T22:25:15Z

@ykakarap that's not my understanding of how PollImmediate works:

https://github.com/kubernetes/apimachinery/blob/v0.23.5/pkg/util/wait/wait.go#L501

The advantage of using PollImmediate over PullUntil is that the first operation is attempted straight away, while Poll waits for one interval (configurable) first of all before running.

ykakarap · 2022-04-21T22:30:04Z

@ykakarap that's not my understanding of how PollImmediate works:

https://github.com/kubernetes/apimachinery/blob/v0.23.5/pkg/util/wait/wait.go#L501

The advantage of using PollImmediate over PullUntil is that the first operation is attempted straight away, while Poll waits for one interval (configurable) first of all before running.

That is correct. PollImmediate will run the condition function immediately. I was talking about how we want to handle if the call to function client.Repositories.GetReleaseByTag returns an error because of a network flake. I assumed that was the intention of this PR. I might have misunderstood.

My understanding: The intention of this PR to to retry the github calls if it fails because of network flakes.

jackfrancis · 2022-04-21T22:42:27Z

@ykakarap you're correct, thank you!

https://go.dev/play/p/Z8ynq-10nhS

Hang on for a correct implementation

@fabriziopandini sorry, I'll submit a follow-up to #6431 with the correct implementation

jackfrancis · 2022-04-21T23:06:28Z

@ykakarap updated, PTAL

cmd/clusterctl/client/repository/repository_github.go

ykakarap

Overall LGTM for the approach pending the comment fix and the lint fixes.

cmd/clusterctl/client/repository/repository_github.go

jackfrancis · 2022-04-23T15:15:00Z

/test pull-cluster-api-e2e-full-main

jackfrancis · 2022-04-23T18:17:08Z

/retest

cmd/clusterctl/client/repository/repository_github.go

ykakarap

/lgtm

jackfrancis · 2022-04-25T21:08:55Z

/test pull-cluster-api-e2e-full-main

sbueringer

Looks good overall.

Given that it seems hard to add unit tests for this, did we do a bit of manual testing to verify we're hitting the cases correctly? (I'm aware we can't test all of them)

cmd/clusterctl/client/repository/repository_github.go

jackfrancis · 2022-04-27T16:37:05Z

@sbueringer these changes actually influence existing UT, so there is coverage there (this is why I manually override the timeout in UT so they don't take 1 min for each case :))

ykakarap

/lgtm

sbueringer · 2022-04-27T18:22:41Z

/lgtm

fabriziopandini · 2022-04-28T10:04:22Z

/lgtm
/approve

k8s-ci-robot · 2022-04-28T10:04:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabriziopandini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [fabriziopandini]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jackfrancis · 2022-04-28T15:08:53Z

/cherry-pick release-1.1

k8s-infra-cherrypick-robot · 2022-04-28T15:09:40Z

@jackfrancis: #6430 failed to apply on top of branch "release-1.1":

Applying: clusterctl: retry github i/o operations
Using index info to reconstruct a base tree...
M	cmd/clusterctl/client/repository/repository_github.go
Falling back to patching base and 3-way merge...
Auto-merging cmd/clusterctl/client/repository/repository_github.go
CONFLICT (content): Merge conflict in cmd/clusterctl/client/repository/repository_github.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 clusterctl: retry github i/o operations
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-1.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 20, 2022

k8s-ci-robot requested review from JoelSpeed and stmcginnis April 20, 2022 20:51

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 20, 2022

fabriziopandini reviewed Apr 21, 2022

View reviewed changes

ykakarap reviewed Apr 21, 2022

View reviewed changes

jackfrancis force-pushed the github-retries branch from e8f5d75 to 1fef519 Compare April 21, 2022 23:06

ykakarap reviewed Apr 21, 2022

View reviewed changes

cmd/clusterctl/client/repository/repository_github.go Outdated Show resolved Hide resolved

cmd/clusterctl/client/repository/repository_github.go Outdated Show resolved Hide resolved

cmd/clusterctl/client/repository/repository_github.go Show resolved Hide resolved

jackfrancis force-pushed the github-retries branch from 1fef519 to d541ce2 Compare April 21, 2022 23:38

ykakarap reviewed Apr 21, 2022

View reviewed changes

cmd/clusterctl/client/repository/repository_github.go Outdated Show resolved Hide resolved

jackfrancis force-pushed the github-retries branch 4 times, most recently from b2618d1 to 62bf324 Compare April 22, 2022 19:36

ykakarap suggested changes Apr 24, 2022

View reviewed changes

cmd/clusterctl/client/repository/repository_github.go Outdated Show resolved Hide resolved

jackfrancis force-pushed the github-retries branch from 62bf324 to b75bc78 Compare April 25, 2022 19:48

ykakarap approved these changes Apr 25, 2022

View reviewed changes

k8s-ci-robot assigned ykakarap Apr 25, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 25, 2022

apricote mentioned this pull request Apr 26, 2022

🐛 always wait for active Loadbalancer after getOrCreate kubernetes-sigs/cluster-api-provider-openstack#1200

Merged

sbueringer reviewed Apr 27, 2022

View reviewed changes

cmd/clusterctl/client/repository/repository_github.go Outdated Show resolved Hide resolved

jackfrancis force-pushed the github-retries branch from b75bc78 to 2589f82 Compare April 27, 2022 17:59

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 27, 2022

clusterctl: retry github i/o operations

2e319ea

jackfrancis force-pushed the github-retries branch from 2589f82 to 2e319ea Compare April 27, 2022 18:00

ykakarap approved these changes Apr 27, 2022

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 27, 2022

k8s-ci-robot assigned sbueringer Apr 27, 2022

k8s-ci-robot assigned fabriziopandini Apr 28, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 28, 2022

k8s-ci-robot merged commit 582b0e1 into kubernetes-sigs:main Apr 28, 2022

k8s-ci-robot added this to the v1.2 milestone Apr 28, 2022

jackfrancis deleted the github-retries branch April 28, 2022 15:08

jackfrancis mentioned this pull request Apr 28, 2022

🐛 clusterctl: retry github i/o operations #6461

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 clusterctl: retry github i/o operations #6430

🐛 clusterctl: retry github i/o operations #6430

jackfrancis commented Apr 20, 2022

jackfrancis commented Apr 20, 2022

fabriziopandini Apr 21, 2022

jackfrancis Apr 27, 2022

fabriziopandini commented Apr 21, 2022

jackfrancis commented Apr 21, 2022

ykakarap left a comment

jackfrancis commented Apr 21, 2022

ykakarap commented Apr 21, 2022

jackfrancis commented Apr 21, 2022

jackfrancis commented Apr 21, 2022

ykakarap left a comment

jackfrancis commented Apr 23, 2022

jackfrancis commented Apr 23, 2022

ykakarap left a comment

jackfrancis commented Apr 25, 2022

sbueringer left a comment

jackfrancis commented Apr 27, 2022

ykakarap left a comment

sbueringer commented Apr 27, 2022

fabriziopandini commented Apr 28, 2022

k8s-ci-robot commented Apr 28, 2022

jackfrancis commented Apr 28, 2022

k8s-infra-cherrypick-robot commented Apr 28, 2022

		retryableOperationInterval = 3 * time.Second
		retryableOperationTimeout = 5 * time.Minute

🐛 clusterctl: retry github i/o operations #6430

🐛 clusterctl: retry github i/o operations #6430

Conversation

jackfrancis commented Apr 20, 2022

jackfrancis commented Apr 20, 2022

fabriziopandini Apr 21, 2022

Choose a reason for hiding this comment

jackfrancis Apr 27, 2022

Choose a reason for hiding this comment

fabriziopandini commented Apr 21, 2022

jackfrancis commented Apr 21, 2022

ykakarap left a comment

Choose a reason for hiding this comment

jackfrancis commented Apr 21, 2022

ykakarap commented Apr 21, 2022

jackfrancis commented Apr 21, 2022

jackfrancis commented Apr 21, 2022

ykakarap left a comment

Choose a reason for hiding this comment

jackfrancis commented Apr 23, 2022

jackfrancis commented Apr 23, 2022

ykakarap left a comment

Choose a reason for hiding this comment

jackfrancis commented Apr 25, 2022

sbueringer left a comment

Choose a reason for hiding this comment

jackfrancis commented Apr 27, 2022

ykakarap left a comment

Choose a reason for hiding this comment

sbueringer commented Apr 27, 2022

fabriziopandini commented Apr 28, 2022

k8s-ci-robot commented Apr 28, 2022

jackfrancis commented Apr 28, 2022

k8s-infra-cherrypick-robot commented Apr 28, 2022