-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't deregister checks on stopped pods #384
Conversation
That's interesting that it doesn't wait. I'm curious if you saw that behavior in your testing because kube docs are saying that kubelet will wait for preStop to complete up to grace period seconds before sending SIGTERM to all containers. Docs for
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥
Needs a little merge love yet, but looks good, great work
@ishustava you're totally right, I misread the docs! |
3aa61c9
to
5dfcc30
Compare
There are cases where a pod shutting down will cause us to log errors about not being able to register a health check. Instead these should be warnings because it's not a situation we can recover from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
@@ -212,6 +220,11 @@ func (h *HealthCheckResource) registerConsulHealthCheck(client *api.Client, cons | |||
}, | |||
}) | |||
if err != nil { | |||
// Full error looks like: | |||
// Unexpected response code: 500 (ServiceID "consulnamespace/svc-id" does not exist) | |||
if strings.Contains(err.Error(), fmt.Sprintf("%s\" does not exist", serviceID)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: what if the service was deregistered due to agent restart? Look into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kschoche so in this case it will miss the event and re-register after reconcile period. I don't think this change makes it worse though because the retries (that happened before this change) happen so fast that they also wouldn't catch the service re-registering.
Co-authored-by: Iryna Shustava <[email protected]>
The ordering for stopping pods is:
In step 4, previously we would then try and update the status of the health check to failing because the pod is no longer ready. This would result in an error because the service was deregistered in step 1.
How I've tested this PR:
How I expect reviewers to test this PR:
ghcr.io/lkysow/consul-k8s-dev:nov06-hc-term
Checklist:
Status after termination