Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: retry on unauthorized error when retrieving resources by gvk #449

Merged
merged 2 commits into from
Aug 19, 2022

Conversation

leoluz
Copy link
Contributor

@leoluz leoluz commented Aug 19, 2022

Signed-off-by: Leonardo Luz Almeida [email protected]

@leoluz leoluz requested review from jannfis and crenshaw-dev August 19, 2022 02:43
serverRes = val
err = nil
} else {
retry.OnError(retry.DefaultRetry, isRetriable, func() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need an abort condition here, like a maximum retry count. Otherwise it'll retry unless the function succeeds (and in case of a hard authorization error, probably this would be forever).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Agree with @jannfis. You can use the .Cap field in wait.Backoff{} to cap the duration of retry.

  2. Why aren't we checking the error return value of retry.OnError()?

Copy link
Contributor Author

@leoluz leoluz Aug 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jannfis @jessesuen

The DefaultRetry is already providing the max retry count:

var DefaultRetry = wait.Backoff{
	Steps:    5,
	Duration: 10 * time.Millisecond,
	Factor:   1.0,
	Jitter:   0.1,
}

The unit tests are validating that the retry will do max 5 attempts and if it fails it will return the last error.
The retry is invoking additional attempts only if the returned error is Unauthorized as implemented in the isRetryable function:

	isRetryable := func(err error) bool {
		if apierr.IsUnauthorized(err) {
			return true
		}
		return false
	}

Why aren't we checking the error return value of retry.OnError()?

@jessesuen The error is handled by the code outside of the retry call. I kept the same logic there to preserve the current behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jannfis @jessesuen Added one more test-case to validate that retry is just invoked if returned error is Unauthorized

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you're right! The default retry has a default limit. Thanks for clarifying that.

Copy link
Member

@jessesuen jessesuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this change, you are only retrying api discovery. But couldn't the problem happen at any point in time?

I'm curious if your error scenario only happen during controller startup, or if it happens even when a controller has ben running for some time. KIAM is notorious for having auth errors during initial startup of a pod. If errors happen only at that time, if you would benefit from a different approach (e.g. auth check retry at startup instead of every reconciliation).

pkg/sync/sync_context.go Outdated Show resolved Hide resolved
serverRes = val
err = nil
} else {
retry.OnError(retry.DefaultRetry, isRetriable, func() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Agree with @jannfis. You can use the .Cap field in wait.Backoff{} to cap the duration of retry.

  2. Why aren't we checking the error return value of retry.OnError()?

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@leoluz
Copy link
Contributor Author

leoluz commented Aug 19, 2022

@jessesuen

Based on this change, you are only retrying api discovery. But couldn't the problem happen at any point in time?

This is a really good question that was intriguing me while chasing this bug. In fact for all our internal syncs (you know that we have quite a few a day), this is the only point where ArgoCD gets an Unauthorized error. @crenshaw-dev inspected the audit logs and realized that we are calling this API ~1M times a day. This is by far the most invoked request that we are sending to kube-api. I implemented a very basic cache in this PR just to avoid the same gvk to be called multiple times. We have applications with more than 500 RoleBindings in it. Previously it would call the API 500 times per sync and now it is cached. However I believe that we could be a lot more aggressive and cache this response for the entire controller life cycle. I would be super interested in hearing opinions about this. (cc @alexmt)

I'm curious if your error scenario only happen during controller startup, or if it happens even when a controller has ben running for some time. KIAM is notorious for having auth errors during initial startup of a pod. If errors happen only at that time, if you would benefit from a different approach (e.g. auth check retry at startup instead of every reconciliation).

This is not happening during controller bootstrap. It happens very sporadically actually. However ArgoCD 2.4.6 for some reason increased the occurrence of this type of error. Before we had ~3 Unauthorized a day and now we are having ~20. FYI, we already deployed a patched version with this retry in our servers and the errors are gone.

Copy link
Member

@jannfis jannfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

},
}
for _, tc := range testCases {
tc := tc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a stupid question, is that some special pattern I don't know of or just a mistake?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It copies the test case so things don't get weird in the parallelized tests.

Copy link
Contributor Author

@leoluz leoluz Aug 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a mistake :) This time at least :P

This required when doing table driven tests with t.Parallel().
Basically t.Parallel() will run the test in a goroutine and not wait for it to return. If we don't do this we are basically just executing the first test case and skipping all others.
This is the full explanation:
https://gist.github.com/posener/92a55c4cd441fc5e5e85f27bca008721

@jessesuen
Copy link
Member

and realized that we are calling this API ~1M times a day

Woah! that is definitely not supposed to happen. Discovery was always intended to be cached. I'm pretty sure we used to be doing this at a package level, or we might have assumed/expected this to happen in underlying k8s client-go. If this is not happening we absolutely should be doing it.

However I believe that we could be a lot more aggressive and cache this response for the entire controller life cycle.

I believe it must refresh as CRDs are introduced and removed. Also if control plane is upgraded from underneath the controller.

Copy link
Member

@jessesuen jessesuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I bet a big part of the resolution was just by reducing the sheer number of API calls we were making as part of sync. I'll file separate issue on caching API resource discovery.

@leoluz leoluz merged commit de08ec2 into argoproj:release-0.7 Aug 19, 2022
leoluz added a commit to leoluz/gitops-engine that referenced this pull request Aug 19, 2022
…goproj#449)

* fix: retry on unauthorized when retrieving resources by gvk

Signed-off-by: Leonardo Luz Almeida <[email protected]>

* add test case to validate retry is just invoked if error is Unauthorized

Signed-off-by: Leonardo Luz Almeida <[email protected]>

Signed-off-by: Leonardo Luz Almeida <[email protected]>
leoluz added a commit that referenced this pull request Aug 19, 2022
* fix: retry on unauthorized error when retrieving resources by gvk (#449)

* fix: retry on unauthorized when retrieving resources by gvk

Signed-off-by: Leonardo Luz Almeida <[email protected]>

* add test case to validate retry is just invoked if error is Unauthorized

Signed-off-by: Leonardo Luz Almeida <[email protected]>

Signed-off-by: Leonardo Luz Almeida <[email protected]>

* Fix merge conflict

Signed-off-by: Leonardo Luz Almeida <[email protected]>

* Fix lint

Signed-off-by: Leonardo Luz Almeida <[email protected]>

* Fix lint

Signed-off-by: Leonardo Luz Almeida <[email protected]>

Signed-off-by: Leonardo Luz Almeida <[email protected]>
leoluz added a commit to leoluz/gitops-engine that referenced this pull request Aug 19, 2022
…goproj#452)

* fix: retry on unauthorized error when retrieving resources by gvk (argoproj#449)

* fix: retry on unauthorized when retrieving resources by gvk

Signed-off-by: Leonardo Luz Almeida <[email protected]>

* add test case to validate retry is just invoked if error is Unauthorized

Signed-off-by: Leonardo Luz Almeida <[email protected]>

Signed-off-by: Leonardo Luz Almeida <[email protected]>

* Fix merge conflict

Signed-off-by: Leonardo Luz Almeida <[email protected]>

* Fix lint

Signed-off-by: Leonardo Luz Almeida <[email protected]>

* Fix lint

Signed-off-by: Leonardo Luz Almeida <[email protected]>

Signed-off-by: Leonardo Luz Almeida <[email protected]>
leoluz added a commit that referenced this pull request Aug 19, 2022
* fix: retry on unauthorized when retrieving resources by gvk

Signed-off-by: Leonardo Luz Almeida <[email protected]>

* add test case to validate retry is just invoked if error is Unauthorized

Signed-off-by: Leonardo Luz Almeida <[email protected]>

Signed-off-by: Leonardo Luz Almeida <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants