Don't deregister checks on stopped pods #384

lkysow · 2020-11-06T18:49:35Z

The ordering for stopping pods is:

Kube invokes preStop hook
Kube sends SIGTERM's to all containers
Kube waits for gracePeriod, then sends SIGKILL to non-stopped containers
Kube sends event to reconciler where all containers have their state "terminated"

In step 4, previously we would then try and update the status of the health check to failing because the pod is no longer ready. This would result in an error because the service was deregistered in step 1.

How I've tested this PR:

I investigated the state of the pods throughout the lifecycle and confirmed the state when the pod has been terminated. See JSON below.

How I expect reviewers to test this PR:

You can use image ghcr.io/lkysow/consul-k8s-dev:nov06-hc-term
Start and stop a connect pod
Observe no errors in the logs

Checklist:

Tests added

Status after termination

 "status": {
    "phase": "Running",
    "conditions": [
      {
        "type": "Initialized",
        "status": "True",
        "lastProbeTime": null,
        "lastTransitionTime": "2020-11-05T23:42:37Z"
      },
      {
        "type": "Ready",
        "status": "False",
        "lastProbeTime": null,
        "lastTransitionTime": "2020-11-05T23:43:32Z",
        "reason": "ContainersNotReady",
        "message": "containers with unready status: [static-client consul-connect-envoy-sidecar consul-connect-lifecycle-sidecar]"
      },
      {
        "type": "ContainersReady",
        "status": "False",
        "lastProbeTime": null,
        "lastTransitionTime": "2020-11-05T23:43:32Z",
        "reason": "ContainersNotReady",
        "message": "containers with unready status: [static-client consul-connect-envoy-sidecar consul-connect-lifecycle-sidecar]"
      },
      {
        "type": "PodScheduled",
        "status": "True",
        "lastProbeTime": null,
        "lastTransitionTime": "2020-11-05T23:42:25Z"
      }
    ],
    "hostIP": "172.18.0.2",
    "podIP": "10.244.0.19",
    "podIPs": [
      {
        "ip": "10.244.0.19"
      }
    ],
    "startTime": "2020-11-05T23:42:25Z",
    "initContainerStatuses": [
      {
        "name": "consul-connect-inject-init",
        "state": {
          "terminated": {
            "exitCode": 0,
            "reason": "Completed",
            "startedAt": "2020-11-05T23:42:26Z",
            "finishedAt": "2020-11-05T23:42:36Z",
            "containerID": "containerd://c27720e056a123f621b289ae3dd03b97223fc2bef9f9cd858140f6449f025d93"
          }
        },
        "lastState": {},
        "ready": true,
        "restartCount": 0,
        "image": "docker.io/library/consul:1.8.4",
        "imageID": "docker.io/library/consul@sha256:4cc02f91a918f08655b39c4369b65929013525cc020f01dadee6b0ec4cd972f5",
        "containerID": "containerd://c27720e056a123f621b289ae3dd03b97223fc2bef9f9cd858140f6449f025d93"
      }
    ],
    "containerStatuses": [
      {
        "name": "consul-connect-envoy-sidecar",
        "state": {
          "terminated": {
            "exitCode": 0,
            "reason": "Completed",
            "startedAt": "2020-11-05T23:42:40Z",
            "finishedAt": "2020-11-05T23:43:01Z",
            "containerID": "containerd://8c549db9a180a73b117db0b16bdc4bcc100b58f443af0be36708c48a7d4b4bf7"
          }
        },
        "lastState": {},
        "ready": false,
        "restartCount": 0,
        "image": "docker.io/envoyproxy/envoy-alpine:v1.14.4",
        "imageID": "docker.io/envoyproxy/envoy-alpine@sha256:9442ccf7b335045ab5ef596e26bec978b8ef46f8bce1dfeb19382c3f40208eb5",
        "containerID": "containerd://8c549db9a180a73b117db0b16bdc4bcc100b58f443af0be36708c48a7d4b4bf7",
        "started": false
      },
      {
        "name": "consul-connect-lifecycle-sidecar",
        "state": {
          "terminated": {
            "exitCode": 2,
            "reason": "Error",
            "startedAt": "2020-11-05T23:42:40Z",
            "finishedAt": "2020-11-05T23:43:01Z",
            "containerID": "containerd://6250bcc99ef9949e673900ad4ea542e4c86f45f897d9d644fcd4e4c769a088d2"
          }
        },
        "lastState": {},
        "ready": false,
        "restartCount": 0,
        "image": "ghcr.io/lkysow/consul-k8s-dev:nov04-early-reg3",
        "imageID": "sha256:228a53bab5c281bf1fd964a0dbed124be88b8f03340eae6b55a67838ab55464a",
        "containerID": "containerd://6250bcc99ef9949e673900ad4ea542e4c86f45f897d9d644fcd4e4c769a088d2",
        "started": false
      },
      {
        "name": "static-client",
        "state": {
          "terminated": {
            "exitCode": 137,
            "reason": "Error",
            "startedAt": "2020-11-05T23:42:40Z",
            "finishedAt": "2020-11-05T23:43:31Z",
            "containerID": "containerd://2efa28c88c73db86a3876a6804fe37bed2131b1a115840d9b23ea9fee268bdce"
          }
        },
        "lastState": {},
        "ready": false,
        "restartCount": 0,
        "image": "docker.io/tutum/curl:latest",
        "imageID": "sha256:1d133bc81b5f22bfcec1409c3154a53b8b15c89f78221117ca5422c77e080cf8",
        "containerID": "containerd://2efa28c88c73db86a3876a6804fe37bed2131b1a115840d9b23ea9fee268bdce",
        "started": false
      }
    ],
    "qosClass": "Burstable"
  }

ishustava · 2020-11-07T03:01:47Z

Kube sends SIGTERM's to all containers (doesn't wait for 1)

That's interesting that it doesn't wait. I'm curious if you saw that behavior in your testing because kube docs are saying that kubelet will wait for preStop to complete up to grace period seconds before sending SIGTERM to all containers.

Docs for preStop hook are also mentioning that the call is synchronous:

It [the call to the preStop hook] is blocking, meaning it is synchronous, so it must complete before the signal to stop the container can be sent.

kschoche

🔥
Needs a little merge love yet, but looks good, great work

lkysow · 2020-11-09T18:43:56Z

@ishustava you're totally right, I misread the docs!

There are cases where a pod shutting down will cause us to log errors about not being able to register a health check. Instead these should be warnings because it's not a situation we can recover from.

ishustava

Looks great!

connect-inject/health_check_resource.go

lkysow · 2020-11-10T18:36:48Z

connect-inject/health_check_resource.go

@@ -212,6 +220,11 @@ func (h *HealthCheckResource) registerConsulHealthCheck(client *api.Client, cons
 		},
 	})
 	if err != nil {
+		// Full error looks like:
+		// Unexpected response code: 500 (ServiceID "consulnamespace/svc-id" does not exist)
+		if strings.Contains(err.Error(), fmt.Sprintf("%s\" does not exist", serviceID)) {


todo: what if the service was deregistered due to agent restart? Look into this.

@kschoche so in this case it will miss the event and re-register after reconcile period. I don't think this change makes it worse though because the retries (that happened before this change) happen so fast that they also wouldn't catch the service re-registering.

Co-authored-by: Iryna Shustava <[email protected]>

lkysow requested review from kschoche, a team and ishustava and removed request for a team November 6, 2020 18:54

Base automatically changed from health-checks-early-registration to master November 6, 2020 19:55

kschoche approved these changes Nov 9, 2020

View reviewed changes

Don't deregister checks on stopped pods

5dfcc30

lkysow force-pushed the healthchecks-dereg branch from 3aa61c9 to 5dfcc30 Compare November 9, 2020 19:18

Warn instead of error when svc not registered

275a447

There are cases where a pod shutting down will cause us to log errors about not being able to register a health check. Instead these should be warnings because it's not a situation we can recover from.

ishustava approved these changes Nov 10, 2020

View reviewed changes

connect-inject/health_check_resource.go Outdated Show resolved Hide resolved

lkysow commented Nov 10, 2020

View reviewed changes

Update connect-inject/health_check_resource.go

c35e677

Co-authored-by: Iryna Shustava <[email protected]>

lkysow merged commit 2b86fdd into master Nov 10, 2020

lkysow deleted the healthchecks-dereg branch November 10, 2020 21:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't deregister checks on stopped pods #384

Don't deregister checks on stopped pods #384

lkysow commented Nov 6, 2020 •

edited

Loading

ishustava commented Nov 7, 2020

kschoche left a comment •

edited

Loading

lkysow commented Nov 9, 2020

ishustava left a comment

lkysow Nov 10, 2020

lkysow Nov 10, 2020

Don't deregister checks on stopped pods #384

Don't deregister checks on stopped pods #384

Conversation

lkysow commented Nov 6, 2020 • edited Loading

ishustava commented Nov 7, 2020

kschoche left a comment • edited Loading

Choose a reason for hiding this comment

lkysow commented Nov 9, 2020

ishustava left a comment

Choose a reason for hiding this comment

lkysow Nov 10, 2020

Choose a reason for hiding this comment

lkysow Nov 10, 2020

Choose a reason for hiding this comment

lkysow commented Nov 6, 2020 •

edited

Loading

kschoche left a comment •

edited

Loading