Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanly draining envoy with DaemonSet/NLB deployment #145

Closed
cmaloney opened this issue Jan 13, 2018 · 18 comments · Fixed by #2227
Closed

Cleanly draining envoy with DaemonSet/NLB deployment #145

cmaloney opened this issue Jan 13, 2018 · 18 comments · Fixed by #2227
Assignees
Labels
area/deployment Issues or PRs related to deployment tooling or infrastructure. blocked/needs-product Categorizes the issue or PR as blocked because it needs a decision from product management. kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Milestone

Comments

@cmaloney
Copy link

sometimes it's necessary to re-deploy newer versions of contour / envoy into the cluster, and that involves killing off the running containers and replacing them with new ones.

If I'm running with an AWS ALB or classic load balancer I could probably add http healthchecks for the envoy health check endpoint + pod lifecycle hooks to make envoy stat failing it's healthcheck so it can be cleanly removed from the nlb without losing traffic

For my deployment to get HTTP2 and minimal hops, I'm using the DaemonSet / NLB deployment of contour and envoy. The NLB does TCP forwarding to contour. Currently, it healthchecks the port of contour it's forwarding traffic to (8080 or 8443). When I want / need to replace envoy, the pod is killed, envoy stops listening, but not long enough that the AWS NLB fails health checks and redirects traffic to the other host without having some lost traffic.

I'd like to help contribute closing this gap so that envoy and contour can safely / cleanly be upgraded without loss of traffic.

Two thoughts on how to do this:

  1. Figure out how to get it so the pods can be replaced with the envoy "hot restart" pieces (some form of https://www.envoyproxy.io/docs/envoy/latest/operations/hot_restarter) without dropping any traffic
  2. Teach contour to start an additional listener on a configurable port which is the "healthcheck" listener. All AWS NLBs get configured to use that port for whether a given instance should be used. When contour recieves a SIGTERM as part of it's lifecycle hooks (https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods) it shuts down this extra listener, which should cause health checks to fail and the removal of the instance from new traffic through the AWS NLB. After a configurable number of seconds from recieving SIGTERM, contour should put envoy into its "drain down" state where it stops accepting incoming connections / drains every existing connection. After the grace period for the pod finally comes up, or if envoy is fully drained, then the pod is killed and remaining pieces die non-gracefully / have their connections cut off.

I think the second option is easier to implement and get working correctly in all cases than the first, the first I see potential issues with the envoy<->contour communication conflicting between "new" and old pods, as well as how you cleanly roll out a replacement of an existing daemon set

@davecheney
Copy link
Contributor

Thanks for raising this issue. There's a PR in fly for adding a healthz to envoy and contour, #135.

@davecheney davecheney added kind/feature Categorizes issue or PR as related to a new feature. good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jan 13, 2018
@cmaloney
Copy link
Author

Healthz alone wouldn't solve it though with an NLB, since there are only TCP health checks available for a TCP NLB, but healthz on a extra port could definitely work

@davecheney davecheney added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. blocked/needs-product Categorizes the issue or PR as blocked because it needs a decision from product management. and removed good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 12, 2019
@davecheney
Copy link
Contributor

This issue has been p1 without a milestone for over a year. I need to get product input on before we can prioritise it.

@davecheney davecheney added this to the 1.0.0.beta.1 milestone Jun 18, 2019
@stevesloka
Copy link
Member

We do now have the preStop hooks enabled. When an Envoy pod is starting to shut down, the hook should fire telling Envoy to drain all connections: https://github.com/heptio/contour/blob/master/examples/ds-hostnet-split/03-envoy.yaml#L60-L63

@rochacon
Copy link
Contributor

rochacon commented Jul 18, 2019

@stevesloka I've been testing with this recently and it seems like only telling Envoy to stop accepting connections is not enough, since Kubernetes will send the SIGTERM immediately after running the preStop hook.

I've somewhat found a stable solution with this configuration changes:

  • Configured terminationGracePeriods: 60 for Envoy pod. This is the drain limit I want to support in my cluster.
  • Added --drain-time-s 60 to Envoy container, default is 600.
  • In the Envoy container, configured readinessProbe.periodSeconds: 3, resulting in more frequent readiness checks.
  • Added sleep 20 before and sleep 40 after the Envoy preStop command [1]. The initial sleep is to give time between rejecting connections and pod's Terminating status propagation. Then we tell Envoy to start rejecting connections by failing healthchecks. The final sleep holds SIGTERM until the pod is safely out of the service Endpoints and external LB health checks.
  • Configured Service.spec.externalTrafficPolicy: Local, this configures a Kubelet managed port (service.spec.healthCheckNodePort) that returns HTTP 200 if the Envoy pod is running on that node. @cmaloney this config has the behavior you described on Option 2, it works for both LoadBalancer and NodePort services, it is an easy way to avoid the extra network hop and maintain the original connection IP.
  • Split Contour and Envoy pods
  • Reduced AWS LB health check intervals to a minimum. IIRC, I'm using 2 unhealthy checks as threshold with 5s period. AWS LB target group draining settings must be below or equal your draining timeout (in my case, 60s).

With this I managed to get zero downtime Envoy rollouts on top of AWS NLB in a few simple tests (running hey against the NLB, the service below Envoy is a pod that sleeps for 1s before returning the HTTP response).

[1]

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 20 ; wget -qO- --post-data='' http://127.0.0.1:9001/healthcheck/fail; sleep 40"]

@stevesloka
Copy link
Member

Fantastic notes @rochacon!! Would you like to contribute this back as a doc or update our examples? If not I'd like to add this to the repo to help out others.

@cmaloney do you see this as maybe solving your original issue?

@rochacon
Copy link
Contributor

@stevesloka sure, I'm up for it, I didn't do it before because I think it is still a bit ugly and I want to do more tests.

@cmaloney
Copy link
Author

I'm not using contour in current envs (now at a different company than where I was using it) / don't have pieces to particularly test

@rochacon
Copy link
Contributor

I've been testing the mentioned configurations in an attempt to clean up and confirm the minimum required steps, still need a bit more checks before issuing the patch.

During the tests, I found out that NLBs provisioned through Kubernetes services objects have fixed healthcheck settings (30s interval, 3 unhealthy threshold, tested on v1.12.10), it ignores the current annotations for customization of this behavior, I believe this is due to the alpha/beta state of NLB integration. Not sure if this was improved in more recent Kubernetes versions or when using the external cloud-provider. Configuring the Envoy service as NodePort and managing the NLB externally is a way to customize this settings (I use this setup because I change Envoy pools and prefer to avoid replacing the NLB, swaping DNS, waiting propagation and etc).

Right now, I've achieved good results with the following settings on top of current examples:

  • Set terminationGracePeriodSeconds: 120 to the Envoy pod
  • Add sleep 118 after the Envoy /healthcheck/fail notification [1]
  • Added --drain-time-s 60 to Envoy container (my desired request time limit is 60s, regardless of Envoy rollouts)

This gives enough time for Kubernetes components and the NLB to notice the Unready status and remove the node/pod from the target group instances and delays the SIGTERM that will shutdown Envoy. It also follows Lyft rollout practices as described at envoyproxy/envoy#1990 (comment)

[1]

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "wget -qO- --post-data='' http://127.0.0.1:9001/healthcheck/fail; sleep 118"]

@alexbrand
Copy link
Contributor

I have been experimenting with this as well, and I have a proposal for properly draining Envoy that builds on @rochacon's suggestion.

Currently in master, we have a preStop hook in the Envoy container that POSTs to the /healthcheck/fail endpoint. This tells Envoy that it needs to start draining connections. In addition, the failing healthcheck triggers the readiness probe to fail, which in turn tells the system to stop sending traffic to the Envoy that is shutting down.

The problem with the current preStop hook is that it does not wait until Envoy has drained all the connections. To address this, I think we could leverage the stats endpoint to create a while loop that sleeps until all the connections have been drained. Once the connections have been drained, the preStop hook returns and lets Kubernetes continue the pod shutdown process.

Curious what everyone thinks of this idea. I have a PoC of that I'd be happy to contribute.

@davecheney
Copy link
Contributor

davecheney commented Aug 22, 2019 via email

@youngnick
Copy link
Member

I really like the stats watch idea, that's really good, much better than guessing timeouts. I'd be very interested to see the PoC.

@alexbrand
Copy link
Contributor

Yeah, I totally agree.

What I ended up doing was adding a sidecar to handle this. The sidecar sends the POST request to Envoy, and then starts watching the stats endpoint. Once the drain is complete, it creates a file in the already-existing config volume which signals Envoy that the drain is complete.

      - name: graceful-terminator
        image: busybox
        command:
          - ash
          - -c
          - while [[ ! -f /config/drain-complete ]]; do sleep 1; done; echo "Drain Complete"
        lifecycle:
          preStop:
            exec:
              command:
                - ash
                - -c
                - >
                  wget -q -O- --post-data "" localhost:9001/healthcheck/fail;
                  while [[ $(wget -q -O- localhost:9001/stats | grep http.ingress_http.downstream_cx_active | awk '{print $2}') != 0 ]];
                  do echo "waiting until active connections are drained"; sleep 1; done;
                  touch /config/drain-complete;
        volumeMounts:
          - mountPath: /config
            name: contour-config

The envoy preStop hook waits until the file appears:

        lifecycle:
          preStop:
            exec:
              command:
              - bash
              - -c
              - while [[ ! -f /config/drain-complete ]]; do echo "waiting for drain confirmation"; sleep 1; done;

The end result is:

  1. Envoy remains running until the preStop hook exits, letting requests/connections drain gracefully
  2. Envoy's preStop hook exits when it receives the signal from the termination sidecar
  3. The termination sidecar sends the signal when the number of active connections is zero
  4. No new connections are established because the readinessProbe is failing

All of this is governed by the terminationGracePeriodSeconds parameter in the pod spec.

Would love to upstream this if we think it is a reasonable approach.

@stevesloka
Copy link
Member

Would a small go program make this impl simpler? I like the idea but worry about all the code inline in the yaml definition. We could add tests, etc to the go code and maybe open up for other types of hooks that we might need in the future to manage Envoy's state.

@alexbrand
Copy link
Contributor

That is certainly another option!

@youngnick
Copy link
Member

I like the use of the config volume for signal-passing, that's neat.

I agree that a small go program that handles SIGTERM and then watches the Envoy stats until it's drained would be very clean, but doing it in bash is definitely a good start.

@davecheney
Copy link
Contributor

davecheney commented Aug 28, 2019 via email

@davecheney davecheney modified the milestones: 1.0.0-beta.1, 1.0.0-rc.1 Sep 5, 2019
@davecheney davecheney removed this from the 1.0.0-rc.1 milestone Sep 29, 2019
@davecheney davecheney added this to the 1.0.0-rc.2 milestone Sep 29, 2019
@youngnick
Copy link
Member

Next steps for this issue:

  • Investigate using a Go program (or a Contour subcommand) for draining Envoy.
  • Investigate manipulating LDS tables for draining too.

On to the backlog for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/deployment Issues or PRs related to deployment tooling or infrastructure. blocked/needs-product Categorizes the issue or PR as blocked because it needs a decision from product management. kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants