Graceful HTTP Connection Draining During Shutdown? #7841

naftulikay · 2019-08-06T21:20:09Z

Title: Graceful HTTP Connection Draining During Shutdown?

Description:

At my organization, we are preparing a large-scale Envoy deployment for serving all network traffic at the edge on ingress to our network. We are using static configuration files on disk and using the Python reloader script to execute hot reloads when our configuration on disk changes.

Sometimes, we need to stop and start the Envoy process completely.

I was hoping to see clean shutdown functionality similar to how NGINX handles a process shutdown. After a shutdown signal, NGINX stops accepting new TCP connections and attempts to complete all current HTTP requests with Connection: close to gracefully shut down the connections before stopping the process. This means that connections terminate cleanly, without a TCP close unless a timeout is exceeded.

In a nutshell, NGINX does this:

Receive terminate signal.
Stop accepting new TCP connections.
Wait for up to $TIMEOUT seconds for HTTP connections to terminate before terminating their TCP sockets.
For each existing request, serve Connection: close back in each response to cleanly close each connection.
Exit the process(es).

When I restart Envoy, either with SIGTERM or with the admin interface (/quitquitquit), I see TCP connection resets, rather than Connection: close. The hot reload process does connection closing properly, but it doesn't seem that shutdown does.

Repro steps:

We are using Envoy 1.11.0 on Ubuntu 16.04 in AWS.

I have open-sourced our load-testing tool using the exact Python version, requests version, etc. to reliably reproduce the issue.

Our systemd unit for running Envoy:

envoy.service

[Unit]
Description=Envoy Proxy
Requires=network-online.target
After=network-online.target

[Service]
Type=simple
Environment="ENVOY_CONFIG_FILE=/etc/envoy/envoy.yaml"
Environment="ENVOY_START_OPTS=--use-libevent-buffers 0 --parent-shutdown-time-s 60"
ExecStart=/usr/local/bin/envoy-restarter.py /usr/local/bin/start-envoy.sh
ExecReload=/bin/kill -HUP $MAINPID
ExecStop=/bin/kill -TERM $MAINPID
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

We are using the exact Python restarter tool that is currently in master.

The script that we have the Python restarter tool run is this:

#!/bin/bash

exec /usr/local/bin/envoy -c $ENVOY_CONFIG_FILE $ENVOY_START_OPTS --restart-epoch $RESTART_EPOCH

Config:

---
static_resources:
  listeners:
  - name: listener_http
    address:
      socket_address:
        address: 0.0.0.0
        port_value: 80

    filter_chains:
    - filters:
      - name: envoy.http_connection_manager
        config:
          stat_prefix: ingress_http
          route_config:
            name: route_config
            virtual_hosts:
            - name: abc.mycompany.com
              domains: ["abc.mycompany.com", "*.abc.mycompany.com"]
              routes:
                - match:
                    prefix: "/"
                  route:
                    cluster: abc

          http_filters:
          - name: envoy.router

  clusters:
  - name: abc
    connect_timeout: 1.0s
    type: LOGICAL_DNS
    dns_lookup_family: V4_ONLY
    lb_policy: ROUND_ROBIN
    hosts:
    - socket_address:
        address: abc.internal.mycompany.com
        port_value: 80
    circuit_breakers:
      thresholds:
        priority: DEFAULT
        max_connections: 100000000
        max_pending_requests: 1000000000
        max_requests: 100000000
        max_retries: 1000000000

admin:
  access_log_path: /tmp/admin_access.log
  address:
    socket_address: { address: 0.0.0.0, port_value: 9901 }

Logs:

Aug 06 21:16:27 hostname systemd[1]: Stopping Envoy Proxy...
Aug 06 21:16:27 hostname envoy-restarter.py[4406]: [2019-08-06 21:16:27.955][4408][warning][main] [source/server/server.cc:463] caught SIGTERM
Aug 06 21:16:27 hostname envoy-restarter.py[4406]: [2019-08-06 21:16:27.955][4408][info][main] [source/server/server.cc:567] shutting down server instance
Aug 06 21:16:27 hostname envoy-restarter.py[4406]: [2019-08-06 21:16:27.955][4408][info][main] [source/server/server.cc:521] main dispatch loop exited
Aug 06 21:16:27 hostname envoy-restarter.py[4406]: [2019-08-06 21:16:27.958][4408][info][main] [source/server/server.cc:560] exiting
Aug 06 21:16:28 hostname envoy-restarter.py[4406]: starting hot-restarter with target: /usr/local/bin/start-envoy.sh
Aug 06 21:16:28 hostname envoy-restarter.py[4406]: forking and execing new child process at epoch 0
Aug 06 21:16:28 hostname envoy-restarter.py[4406]: forked new child process with PID=4408
Aug 06 21:16:28 hostname envoy-restarter.py[4406]: got SIGTERM
Aug 06 21:16:28 hostname envoy-restarter.py[4406]: sending TERM to PID=4408
Aug 06 21:16:28 hostname envoy-restarter.py[4406]: got SIGTERM
Aug 06 21:16:28 hostname envoy-restarter.py[4406]: all children exited cleanly
Aug 06 21:16:28 hostname systemd[1]: Stopped Envoy Proxy.

The text was updated successfully, but these errors were encountered:

mattklein123 · 2019-08-06T21:47:21Z

We don't have this functionality today, but you can script around it using /healthcheck/fail and then wait for connections to drain.

naftulikay · 2019-08-06T22:03:33Z

@mattklein123 thank you! Will stay subscribed for potential updates.

derekargueta · 2019-08-06T22:56:26Z

Scripting around /healthcheck/fail and sleeping is also what we do for draining. After failing the healthcheck we watch the listeners' downstream_rq_active metric to drop to an acceptable range before terminating.

naftulikay · 2019-08-06T23:15:34Z

I'm considering building a more robust wrapper for doing hot reloads and graceful shutdowns 🤔

naftulikay · 2019-08-07T19:40:50Z

FYI @derekargueta when I run my Python load-testing tool against Envoy, then POST to /healthcheck/fail, I see that all responses have Connection: close, but new connections are still allowed, which won't properly evict Envoy from my load balancer.

Due to performance constraints, I have to use NLBs in my deployment, forwarding TCP packets rather than ALBs which understand HTTP, so it's not possible for me to use L7 HTTP health checks with my load balancer, rather just TCP opens to test connectivity.

Is there a way to tell Envoy to stop accepting new TCP connections and finish up and close active HTTP connections?

mattklein123 · 2019-08-07T19:52:02Z

@naftulikay use the health checking filter and have the NLB health check the Envoy and it will stop sending new connections.

naftulikay · 2019-08-07T20:17:17Z

So when an AWS load balancer is configured as a network load balancer (e.g. TCP forwarding), you can't use HTTP health checks against the instance. Rather, you have a boolean enabled or disabled of plain TCP-based health checks (e.g. "can I open a connection to the instance on the given port").

See the AWS documentation for more details.

I'm currently digging through the HTTP health check filters to try to understand more about how they work.

mattklein123 · 2019-08-07T20:21:31Z

We use NLBs at Lyft with HTTP health checks.

naftulikay · 2019-08-07T20:24:42Z

I'll ask our TAMs, but we have attempted to setup HTTP health checks on TCP target groups and the API fails creation. Our target groups are configured like so:

resource "aws_lb_target_group" "http" {
  name = "http"
  port = 80
  protocol = "TCP"
  vpc_id = "${var.vpc_id}"

  deregistration_delay = "${var.deregistration_delay}"

  health_check {
    enabled = true
    interval = 10
    port = "traffic-port"
    protocol = "TCP"
    healthy_threshold = 5
    unhealthy_threshold = 5
  }

  tags {
    Name = "http"
  }
}

derekargueta · 2019-08-08T02:11:24Z

(second'ing NLBs + HTTP healthchecks which is what we use as well)

@naftulikay the protocol in health_check should be HTTP. The documentation you linked only states that the default is TCP for NLBs, but you can set it to HTTP and set the path field for an endpoint to healthcheck. (HealthCheckProtocol section)

derekargueta · 2019-08-08T16:59:24Z

Looks like the issue you're experiencing may be related to this thread: hashicorp/terraform-provider-aws#2708 (comment)

stale · 2019-09-07T17:07:28Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

naftulikay · 2019-09-09T19:45:56Z

Can someone tag this as help-wanted? This is still a desirable function to have and I think the community would benefit from it, even if it isn't implemented in the short term.

wrowe · 2019-10-10T02:02:52Z

FWIW this is a duplicate issue, see also #2920 (and other previous tickets which were marked solved.)

There's no standard way in Thrift to notify a downstream that a server is about to go away, so this change proposes a configurable header name that will be used to signal an ongoing drain operation. Once we generally agree on the approach, I'll add: * tests * make hard-coded stuff configurable * possibly add an integration test Somewhat similar to envoyproxy#7841. Signed-off-by: Raul Gutierrez Segales <[email protected]>

mattklein123 added the enhancement Feature requests. Not bugs or questions. label Aug 6, 2019

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 7, 2019

stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Sep 9, 2019

mattklein123 added the help wanted Needs help! label Sep 9, 2019

naftulikay mentioned this issue Oct 10, 2019

Feature request: a means to refuse subsequent TCP connections while allowing current connections enough time to drain #2920

Open

auni53 mentioned this issue Dec 16, 2019

Confusion over TCP draining #9369

Closed

auni53 mentioned this issue Mar 31, 2020

Proposal: Last Call mode for Envoy [AKA graceful drain for drain manager on admin endpoint] #10592

Closed

sokaran mentioned this issue Apr 1, 2020

healthcheck filter: min_length constraint on headers to match #10618

Open

kazukousen mentioned this issue Aug 8, 2020

Ingress Gateway don't wait for connections to drain and reported http status code 502 error by LoadBalancer istio/istio#26302

Closed

rgs1 mentioned this issue Mar 7, 2022

thrift: add connection draining support #20236

Merged

norbjd mentioned this issue Feb 1, 2024

Kourier does not gracefully shut down knative-extensions/net-kourier#1118

Closed

davidalger mentioned this issue Feb 16, 2024

feat: Gracefully drain listeners before envoy shutdown on pod termination envoyproxy/gateway#2633

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful HTTP Connection Draining During Shutdown? #7841

Graceful HTTP Connection Draining During Shutdown? #7841

naftulikay commented Aug 6, 2019

mattklein123 commented Aug 6, 2019

naftulikay commented Aug 6, 2019

derekargueta commented Aug 6, 2019

naftulikay commented Aug 6, 2019

naftulikay commented Aug 7, 2019 •

edited

Loading

mattklein123 commented Aug 7, 2019

naftulikay commented Aug 7, 2019

mattklein123 commented Aug 7, 2019

naftulikay commented Aug 7, 2019

derekargueta commented Aug 8, 2019 •

edited

Loading

derekargueta commented Aug 8, 2019

stale bot commented Sep 7, 2019

naftulikay commented Sep 9, 2019

wrowe commented Oct 10, 2019

Graceful HTTP Connection Draining During Shutdown? #7841

Graceful HTTP Connection Draining During Shutdown? #7841

Comments

naftulikay commented Aug 6, 2019

mattklein123 commented Aug 6, 2019

naftulikay commented Aug 6, 2019

derekargueta commented Aug 6, 2019

naftulikay commented Aug 6, 2019

naftulikay commented Aug 7, 2019 • edited Loading

mattklein123 commented Aug 7, 2019

naftulikay commented Aug 7, 2019

mattklein123 commented Aug 7, 2019

naftulikay commented Aug 7, 2019

derekargueta commented Aug 8, 2019 • edited Loading

derekargueta commented Aug 8, 2019

stale bot commented Sep 7, 2019

naftulikay commented Sep 9, 2019

wrowe commented Oct 10, 2019

naftulikay commented Aug 7, 2019 •

edited

Loading

derekargueta commented Aug 8, 2019 •

edited

Loading