-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: retry on unauthorized error when retrieving resources by gvk #449
Conversation
Signed-off-by: Leonardo Luz Almeida <[email protected]>
pkg/sync/sync_context.go
Outdated
serverRes = val | ||
err = nil | ||
} else { | ||
retry.OnError(retry.DefaultRetry, isRetriable, func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need an abort condition here, like a maximum retry count. Otherwise it'll retry unless the function succeeds (and in case of a hard authorization error, probably this would be forever).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Agree with @jannfis. You can use the .Cap field in wait.Backoff{} to cap the duration of retry.
-
Why aren't we checking the error return value of retry.OnError()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DefaultRetry is already providing the max retry count:
var DefaultRetry = wait.Backoff{
Steps: 5,
Duration: 10 * time.Millisecond,
Factor: 1.0,
Jitter: 0.1,
}
The unit tests are validating that the retry will do max 5 attempts and if it fails it will return the last error.
The retry is invoking additional attempts only if the returned error is Unauthorized
as implemented in the isRetryable
function:
isRetryable := func(err error) bool {
if apierr.IsUnauthorized(err) {
return true
}
return false
}
Why aren't we checking the error return value of retry.OnError()?
@jessesuen The error is handled by the code outside of the retry call. I kept the same logic there to preserve the current behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jannfis @jessesuen Added one more test-case to validate that retry is just invoked if returned error is Unauthorized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah you're right! The default retry has a default limit. Thanks for clarifying that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on this change, you are only retrying api discovery. But couldn't the problem happen at any point in time?
I'm curious if your error scenario only happen during controller startup, or if it happens even when a controller has ben running for some time. KIAM is notorious for having auth errors during initial startup of a pod. If errors happen only at that time, if you would benefit from a different approach (e.g. auth check retry at startup instead of every reconciliation).
pkg/sync/sync_context.go
Outdated
serverRes = val | ||
err = nil | ||
} else { | ||
retry.OnError(retry.DefaultRetry, isRetriable, func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Agree with @jannfis. You can use the .Cap field in wait.Backoff{} to cap the duration of retry.
-
Why aren't we checking the error return value of retry.OnError()?
Signed-off-by: Leonardo Luz Almeida <[email protected]>
Kudos, SonarCloud Quality Gate passed! |
This is a really good question that was intriguing me while chasing this bug. In fact for all our internal syncs (you know that we have quite a few a day), this is the only point where ArgoCD gets an
This is not happening during controller bootstrap. It happens very sporadically actually. However ArgoCD 2.4.6 for some reason increased the occurrence of this type of error. Before we had ~3 Unauthorized a day and now we are having ~20. FYI, we already deployed a patched version with this retry in our servers and the errors are gone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
}, | ||
} | ||
for _, tc := range testCases { | ||
tc := tc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a stupid question, is that some special pattern I don't know of or just a mistake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It copies the test case so things don't get weird in the parallelized tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a mistake :) This time at least :P
This required when doing table driven tests with t.Parallel()
.
Basically t.Parallel()
will run the test in a goroutine and not wait for it to return. If we don't do this we are basically just executing the first test case and skipping all others.
This is the full explanation:
https://gist.github.com/posener/92a55c4cd441fc5e5e85f27bca008721
Woah! that is definitely not supposed to happen. Discovery was always intended to be cached. I'm pretty sure we used to be doing this at a package level, or we might have assumed/expected this to happen in underlying k8s client-go. If this is not happening we absolutely should be doing it.
I believe it must refresh as CRDs are introduced and removed. Also if control plane is upgraded from underneath the controller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I bet a big part of the resolution was just by reducing the sheer number of API calls we were making as part of sync. I'll file separate issue on caching API resource discovery.
…goproj#449) * fix: retry on unauthorized when retrieving resources by gvk Signed-off-by: Leonardo Luz Almeida <[email protected]> * add test case to validate retry is just invoked if error is Unauthorized Signed-off-by: Leonardo Luz Almeida <[email protected]> Signed-off-by: Leonardo Luz Almeida <[email protected]>
* fix: retry on unauthorized error when retrieving resources by gvk (#449) * fix: retry on unauthorized when retrieving resources by gvk Signed-off-by: Leonardo Luz Almeida <[email protected]> * add test case to validate retry is just invoked if error is Unauthorized Signed-off-by: Leonardo Luz Almeida <[email protected]> Signed-off-by: Leonardo Luz Almeida <[email protected]> * Fix merge conflict Signed-off-by: Leonardo Luz Almeida <[email protected]> * Fix lint Signed-off-by: Leonardo Luz Almeida <[email protected]> * Fix lint Signed-off-by: Leonardo Luz Almeida <[email protected]> Signed-off-by: Leonardo Luz Almeida <[email protected]>
…goproj#452) * fix: retry on unauthorized error when retrieving resources by gvk (argoproj#449) * fix: retry on unauthorized when retrieving resources by gvk Signed-off-by: Leonardo Luz Almeida <[email protected]> * add test case to validate retry is just invoked if error is Unauthorized Signed-off-by: Leonardo Luz Almeida <[email protected]> Signed-off-by: Leonardo Luz Almeida <[email protected]> * Fix merge conflict Signed-off-by: Leonardo Luz Almeida <[email protected]> * Fix lint Signed-off-by: Leonardo Luz Almeida <[email protected]> * Fix lint Signed-off-by: Leonardo Luz Almeida <[email protected]> Signed-off-by: Leonardo Luz Almeida <[email protected]>
* fix: retry on unauthorized when retrieving resources by gvk Signed-off-by: Leonardo Luz Almeida <[email protected]> * add test case to validate retry is just invoked if error is Unauthorized Signed-off-by: Leonardo Luz Almeida <[email protected]> Signed-off-by: Leonardo Luz Almeida <[email protected]>
Signed-off-by: Leonardo Luz Almeida [email protected]