Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't deregister checks on stopped pods #384

Merged
merged 3 commits into from
Nov 10, 2020
Merged

Don't deregister checks on stopped pods #384

merged 3 commits into from
Nov 10, 2020

Conversation

lkysow
Copy link
Member

@lkysow lkysow commented Nov 6, 2020

The ordering for stopping pods is:

  1. Kube invokes preStop hook
  2. Kube sends SIGTERM's to all containers
  3. Kube waits for gracePeriod, then sends SIGKILL to non-stopped containers
  4. Kube sends event to reconciler where all containers have their state "terminated"

In step 4, previously we would then try and update the status of the health check to failing because the pod is no longer ready. This would result in an error because the service was deregistered in step 1.

How I've tested this PR:

  • I investigated the state of the pods throughout the lifecycle and confirmed the state when the pod has been terminated. See JSON below.

How I expect reviewers to test this PR:

  • You can use image ghcr.io/lkysow/consul-k8s-dev:nov06-hc-term
  • Start and stop a connect pod
  • Observe no errors in the logs

Checklist:

  • Tests added
Status after termination
 "status": {
    "phase": "Running",
    "conditions": [
      {
        "type": "Initialized",
        "status": "True",
        "lastProbeTime": null,
        "lastTransitionTime": "2020-11-05T23:42:37Z"
      },
      {
        "type": "Ready",
        "status": "False",
        "lastProbeTime": null,
        "lastTransitionTime": "2020-11-05T23:43:32Z",
        "reason": "ContainersNotReady",
        "message": "containers with unready status: [static-client consul-connect-envoy-sidecar consul-connect-lifecycle-sidecar]"
      },
      {
        "type": "ContainersReady",
        "status": "False",
        "lastProbeTime": null,
        "lastTransitionTime": "2020-11-05T23:43:32Z",
        "reason": "ContainersNotReady",
        "message": "containers with unready status: [static-client consul-connect-envoy-sidecar consul-connect-lifecycle-sidecar]"
      },
      {
        "type": "PodScheduled",
        "status": "True",
        "lastProbeTime": null,
        "lastTransitionTime": "2020-11-05T23:42:25Z"
      }
    ],
    "hostIP": "172.18.0.2",
    "podIP": "10.244.0.19",
    "podIPs": [
      {
        "ip": "10.244.0.19"
      }
    ],
    "startTime": "2020-11-05T23:42:25Z",
    "initContainerStatuses": [
      {
        "name": "consul-connect-inject-init",
        "state": {
          "terminated": {
            "exitCode": 0,
            "reason": "Completed",
            "startedAt": "2020-11-05T23:42:26Z",
            "finishedAt": "2020-11-05T23:42:36Z",
            "containerID": "containerd://c27720e056a123f621b289ae3dd03b97223fc2bef9f9cd858140f6449f025d93"
          }
        },
        "lastState": {},
        "ready": true,
        "restartCount": 0,
        "image": "docker.io/library/consul:1.8.4",
        "imageID": "docker.io/library/consul@sha256:4cc02f91a918f08655b39c4369b65929013525cc020f01dadee6b0ec4cd972f5",
        "containerID": "containerd://c27720e056a123f621b289ae3dd03b97223fc2bef9f9cd858140f6449f025d93"
      }
    ],
    "containerStatuses": [
      {
        "name": "consul-connect-envoy-sidecar",
        "state": {
          "terminated": {
            "exitCode": 0,
            "reason": "Completed",
            "startedAt": "2020-11-05T23:42:40Z",
            "finishedAt": "2020-11-05T23:43:01Z",
            "containerID": "containerd://8c549db9a180a73b117db0b16bdc4bcc100b58f443af0be36708c48a7d4b4bf7"
          }
        },
        "lastState": {},
        "ready": false,
        "restartCount": 0,
        "image": "docker.io/envoyproxy/envoy-alpine:v1.14.4",
        "imageID": "docker.io/envoyproxy/envoy-alpine@sha256:9442ccf7b335045ab5ef596e26bec978b8ef46f8bce1dfeb19382c3f40208eb5",
        "containerID": "containerd://8c549db9a180a73b117db0b16bdc4bcc100b58f443af0be36708c48a7d4b4bf7",
        "started": false
      },
      {
        "name": "consul-connect-lifecycle-sidecar",
        "state": {
          "terminated": {
            "exitCode": 2,
            "reason": "Error",
            "startedAt": "2020-11-05T23:42:40Z",
            "finishedAt": "2020-11-05T23:43:01Z",
            "containerID": "containerd://6250bcc99ef9949e673900ad4ea542e4c86f45f897d9d644fcd4e4c769a088d2"
          }
        },
        "lastState": {},
        "ready": false,
        "restartCount": 0,
        "image": "ghcr.io/lkysow/consul-k8s-dev:nov04-early-reg3",
        "imageID": "sha256:228a53bab5c281bf1fd964a0dbed124be88b8f03340eae6b55a67838ab55464a",
        "containerID": "containerd://6250bcc99ef9949e673900ad4ea542e4c86f45f897d9d644fcd4e4c769a088d2",
        "started": false
      },
      {
        "name": "static-client",
        "state": {
          "terminated": {
            "exitCode": 137,
            "reason": "Error",
            "startedAt": "2020-11-05T23:42:40Z",
            "finishedAt": "2020-11-05T23:43:31Z",
            "containerID": "containerd://2efa28c88c73db86a3876a6804fe37bed2131b1a115840d9b23ea9fee268bdce"
          }
        },
        "lastState": {},
        "ready": false,
        "restartCount": 0,
        "image": "docker.io/tutum/curl:latest",
        "imageID": "sha256:1d133bc81b5f22bfcec1409c3154a53b8b15c89f78221117ca5422c77e080cf8",
        "containerID": "containerd://2efa28c88c73db86a3876a6804fe37bed2131b1a115840d9b23ea9fee268bdce",
        "started": false
      }
    ],
    "qosClass": "Burstable"
  }

@lkysow lkysow requested review from kschoche, a team and ishustava and removed request for a team November 6, 2020 18:54
Base automatically changed from health-checks-early-registration to master November 6, 2020 19:55
@ishustava
Copy link
Contributor

Kube sends SIGTERM's to all containers (doesn't wait for 1)

That's interesting that it doesn't wait. I'm curious if you saw that behavior in your testing because kube docs are saying that kubelet will wait for preStop to complete up to grace period seconds before sending SIGTERM to all containers.

Docs for preStop hook are also mentioning that the call is synchronous:

It [the call to the preStop hook] is blocking, meaning it is synchronous, so it must complete before the signal to stop the container can be sent.

Copy link
Contributor

@kschoche kschoche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥
Needs a little merge love yet, but looks good, great work

@lkysow
Copy link
Member Author

lkysow commented Nov 9, 2020

@ishustava you're totally right, I misread the docs!

@lkysow lkysow force-pushed the healthchecks-dereg branch from 3aa61c9 to 5dfcc30 Compare November 9, 2020 19:18
There are cases where a pod shutting down will cause us to log errors
about not being able to register a health check. Instead these should be
warnings because it's not a situation we can recover from.
Copy link
Contributor

@ishustava ishustava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

connect-inject/health_check_resource.go Outdated Show resolved Hide resolved
@@ -212,6 +220,11 @@ func (h *HealthCheckResource) registerConsulHealthCheck(client *api.Client, cons
},
})
if err != nil {
// Full error looks like:
// Unexpected response code: 500 (ServiceID "consulnamespace/svc-id" does not exist)
if strings.Contains(err.Error(), fmt.Sprintf("%s\" does not exist", serviceID)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: what if the service was deregistered due to agent restart? Look into this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kschoche so in this case it will miss the event and re-register after reconcile period. I don't think this change makes it worse though because the retries (that happened before this change) happen so fast that they also wouldn't catch the service re-registering.

@lkysow lkysow merged commit 2b86fdd into master Nov 10, 2020
@lkysow lkysow deleted the healthchecks-dereg branch November 10, 2020 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants