-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add retry loop for client.get of replicaset as that sometimes fails #1072
Conversation
Signed-off-by: Kevin Earls <[email protected]>
} | ||
|
||
// use a retry loop to get the Deployment. A single call to client.get fails occasionally | ||
err := retry.OnError(backOff, checkError, getReplicaSet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use the same retry approach for other objects as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could. Do you have specific places you'd want to do this? So far this is the only place I've seen failures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh right this is the only place that uses k8s client in this function. I don't know any other place.
pkg/instrumentation/sdk.go
Outdated
Name: owner.Name, | ||
}, &rs) | ||
nsn := types.NamespacedName{Namespace: ns.Name, Name: owner.Name} | ||
backOff := wait.Backoff{Duration: 10 * time.Millisecond, Factor: 1.5, Jitter: 0.1, Steps: 20, Cap: 30 * time.Second} // TODO decide which of these we need and what they should be set to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the pod webhook timeout? cap of 30 seconds seems too much TBH.
Could we extract the retry functionality to a separate function? It will be useful if we want to reuse it for other objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pod webhook timeout is 10 seconds according to https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max wait 10 seconds is too much. Could we use e.g. 2-3 seconds?
pkg/instrumentation/sdk.go
Outdated
|
||
checkError := func(err error) bool { | ||
// if the error looks like 'ReplicaSet.apps "my-deployment-with-sidecar-f46b479f" not found' ignore it | ||
if strings.HasPrefix(err.Error(), "ReplicaSet.apps") && strings.HasSuffix(err.Error(), "not found") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there a better way to compare the error? e.g. apierrors.IsNotFound(err)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try that. It may take a while to make sure I get an error...
724e284
to
6280ca4
Compare
Signed-off-by: Kevin Earls <[email protected]>
Signed-off-by: Kevin Earls <[email protected]>
Signed-off-by: Kevin Earls <[email protected]>
@pavolloffay if you're still around... |
@kevinearls the PR looks good to me! :) but the cap of 10s seems too high. The timeout of the webhook is 10s, retrying for 10s and blocking the pod creation does not feel right. Could we start with e.g. 2-3s and see if that would work? |
Signed-off-by: Kevin Earls <[email protected]>
…pen-telemetry#1072) * Add retry loop for client.get of replicaset as that sometimes fails Signed-off-by: Kevin Earls <[email protected]> * Reduce total timeout, remove TODO Signed-off-by: Kevin Earls <[email protected]> * Use apierrors to check error Signed-off-by: Kevin Earls <[email protected]> * Appease the linter Signed-off-by: Kevin Earls <[email protected]> * Lower meximum wait time to 2 seconds Signed-off-by: Kevin Earls <[email protected]> Signed-off-by: Kevin Earls <[email protected]>
Signed-off-by: Kevin Earls [email protected]
This fixes some of the test failures we occasionally see in CI as described in #959
There are others which I believe are related to acquiring a leader lease taking more than 2.5 minutes. I will add more information about that to #959
Resolves #959