Need a setting to delay start of shutdown sequence #4156

millermatt · 2024-09-04T16:59:59Z

Description:

Describe the desired behavior, what scenario it enables and how it
would be used.

Current

Envoy Proxy container stops accepting new connections as soon as the Envoy Proxy pod starts terminating. For reasons described below, this means that zero downtime Envoy Proxy deployment rollouts are not possible when using AWS+NLB.

Desired

Envoy Proxy container should continue to accept new connections for a configurable amount of time when the Envoy Proxy pod starts terminating. This will allow zero downtime Envoy Proxy deployment rollouts.

Any extra documentation required to understand the issue.

Key points

AWS NLBs send new requests to NLB targets (Envoy Proxy pods) until the target has failed a minimum of 2 health checks (UnhealthyThresholdCount) that are a minimum of 5 seconds apart (HealthCheckIntervalSeconds).
Envoy Proxy pods stop accepting new requests immediately when they start terminating, so there is a guaranteed 5-10 seconds of downtime for new requests coming from the NLB when a proxy pod starts terminating.

Timing details

Here's a timeline where the pod start terminating immediately following a successful health check from the NLB:

time 0s: pod starts terminating
time 0s: envoy proxy container stops taking new requests, but waits for existing requests to finish
time 0s-5s: NLB gets new requests and sends them to the terminating pod, because it hasn't failed 2 health checks yet. the requests fail
time 5s: NLB sends a health check request. it fails
time 5s-10s: NLB gets new requests and sends them to the terminating pod, because it hasn't failed 2 health checks yet. the requests fail
time 10s: NLB sends a health check request. it fails
time 10s+: NLB no longer sends requests to the terminating pod

Possible solution

Configure the NLB target health checks to hit a new proxy route (e.g. /lbhealthz) that returns a 200 normally. This is important: a) It is not the same endpoint used for pod liveness/readiness probes and b) ShutdownManager knows about it and can toggle it to return non-200 status codes.
Add a gracefulShutdownDelay (default: 0 for backwards compatibility) or similar property to ShutdownConfig that controls the timing between when the pod goes into a terminating state and when shutdown_manager.go calls postEnvoyAdminAPI("healthcheck/fail").
- the shutdown container's preStop hook calls shutdown_manager.go:ShutDown() as it currently does
- shutdown_manager.go:ShutDown() would call a new /lbhealthcheck/fail endpoint on the proxy container immediately, causing /lbhealthz to no longer return a 200 status code. NLB health checks start failing, but new and existing requests continue to get processed.
- after gracefulShutdownDelay seconds, shudown_manager.go would call /healthcheck/fail and /drain_listeners?graceful&skip_exit.

AWS NLB users could set gracefulShutdownDelay to match their NLB's UnhealthyThresholdCount * HealthCheckIntervalSeconds to allow new requests while the NLB health checks are failing.

The text was updated successfully, but these errors were encountered:

czczycz · 2024-09-05T02:20:37Z

Same issue.
My proposed solution is:
Implement a controller to watch the status of envoyproxy Pods. When a Pod is found to be in the "Terminating" state, call the NLB (Network Load Balancer) API to remove traffic from this Pod.

arkodg · 2024-09-05T23:11:05Z

hey @millermatt, this issue you are hitting is probably very specific to your setup and configurations. I'd recommend also adding your EnvoyProxy config here so its easier to compare and also reproduce.

millermatt · 2024-09-06T15:54:25Z

For sure, thank you. I'm attaching files here for a bare minimum config that demonstrates the problem.
envoy-gateway-config-01-millermatt.zip

Here is my test process:

Deploy the attached config and wait for the NLB to be available with healthy targets in the target group.
Start curl in a loop against the load balancer
Do a rollout-restart on the Envoy Proxy Deployment created by the gateway controller
Observe that some of the curl requests are lost as soon as an Envoy Proxy Pod starts terminating

Given what I've described in this issue's description, we are guaranteed downtime, because

the terminating pods immediately stop accepting new requests
the load balancer target must experience at least 2 health check request failures before it stops sending new non-health-check requests to the terminating pod

Here is the curl loop command I use:

while true ; do echo "$(date) $(curl -sv https://<load-balancer-host> 2>/dev/stdout | grep '< HTTP')" ; done

This is what the curl loop output looks like as soon as a proxy pod starts terminating:

Fri Sep  6 10:35:29 CDT 2024 < HTTP/1.1 404 Not Found
Fri Sep  6 10:35:30 CDT 2024 < HTTP/1.1 404 Not Found
Fri Sep  6 10:35:30 CDT 2024 < HTTP/1.1 404 Not Found
Fri Sep  6 10:35:30 CDT 2024 < HTTP/1.1 404 Not Found
Fri Sep  6 10:35:31 CDT 2024
Fri Sep  6 10:35:36 CDT 2024
Fri Sep  6 10:35:42 CDT 2024 < HTTP/1.1 404 Not Found
Fri Sep  6 10:35:42 CDT 2024 < HTTP/1.1 404 Not Found
Fri Sep  6 10:35:42 CDT 2024 < HTTP/1.1 404 Not Found
Fri Sep  6 10:35:43 CDT 2024 < HTTP/1.1 404 Not Found

ross-at-coactive · 2024-09-08T15:59:37Z

Just starting with envoy gateway, and +1 here

I've used a custom-configured envoy in EKS with the ALB controller under a 5-9s uptime goal. Unfortunately we didn't use xDS so we used plain rollouts for (frequent-ish) configuration changes including traffic splitting. So we had to have no-downtime rollouts and followed this EKS guide for load balancing pretty closely. But (AFAIK) xDS doesn't solve for "normal" proxy pod churn anyway.

In my experience what is described here was 100% a requirement to not have (in our case) 503s during every rollout or otherwise-triggered pod shutdown during normal operations, like scale-in evictions after a period of high traffic. Our customized shutdown logic had exactly this kind of configurable delay in a preStop hook before initializing the graceful draining process, with parameters synced to the ALB health checks pointing at the admin server via helm templating voodoo. An AWS A/NLB can't get the endpoint/endpoint slice updates "fast" so it has to rely on "slow" heath checks pointed at envoy's admin readiness endpoint, which we can fail on the internal termination signal.

I bet it's junk but we used this hand-rolled shutdown handler sidecar modeled after what I saw in contour and discussed in a variety of envoy/istio and related github issues. Here's where this proposed config is used for external, target-specific probe failures to be observed, subsequently watching the draining process to admit process termination.

arkodg · 2024-09-10T00:36:25Z

@ross-at-coactive we do something similar in Envoy Gateway where the preStop hook calls an elaborate shutdown routine, more in

gateway/internal/cmd/envoy/shutdown_manager.go

Line 119 in 3584271

    
           func Shutdown(drainTimeout time.Duration, minDrainDuration time.Duration, exitAtConnections int) error {

@millermatt looks like you are using explicit NLB health checks

          # configure a network load balancer
          service.beta.kubernetes.io/aws-load-balancer-backend-protocol: ssl
          service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"
          service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: traffic-port
          service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: https
          service.beta.kubernetes.io/aws-load-balancer-healthcheck-success-codes: "200-499"
          service.beta.kubernetes.io/aws-load-balancer-healthcheck-timeout: "2"
          service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "2"

my suggestion is to define a ClientTrafficPolicy https://gateway.envoyproxy.io/docs/tasks/traffic/client-traffic-policy/ with a healthCheck section, something like

healthCheck:
  path: "/healthz"

and also add another annotation for path

service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: /healthz

as well as for interval

service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "2"

Reference: https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.3/guide/service/annotations/#health-check

millermatt · 2024-09-10T14:56:30Z

@arkodg

we do something similar in Envoy Gateway where the preStop hook calls an elaborate shutdown routine, more in

@ross-at-coactive can surely clarify more, but the routines are not similar. The Envoy Gateway routine you refer to causes all new requests to be refused as soon as the pod starts terminating. The routine @ross-at-coactive refers to says in its docs:

set a delay between health check failing and graceful connection draining, to account for some period during which new connections might still be arriving

This is the key part missing in the current Envoy Gateway shutdown implementation. There is no way to set a delay between the time when health checks start failing and graceful termination begins.

looks like you are using explicit NLB health checks

Yes, in an effort to cause them to fail ASAP.

Here is my current config after applying the ClientTrafficPolicy below and adding the healthcheck-interval annotation with the lowest allowed value (5, not 2):

metadata:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: ssl
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "5"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: /healthz
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: traffic-port
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: https
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-success-codes: "200"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-timeout: "2"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "2"
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
    service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
    service.beta.kubernetes.io/aws-load-balancer-ssl-cert: <redacted>
    service.beta.kubernetes.io/aws-load-balancer-ssl-negotiation-policy: ELBSecurityPolicy-TLS13-1-2-2021-06
    service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: preserve_client_ip.enabled=true,deregistration_delay.timeout_seconds=300
    service.beta.kubernetes.io/aws-load-balancer-type: external

Here is the resulting config in the AWS console:

At your suggestion, I've added this ClientTraffic policy:

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: ClientTrafficPolicy
metadata:
  name: mygateway
  namespace: proxy-namespace
spec:
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: mygateway
  healthCheck:
    path: "/healthz"

All other config is from the zip file I uploaded above.

There is still downtime as soon as any Envoy Proxy pod starts terminating.

Tue Sep 10 09:52:23 CDT 2024 < HTTP/1.1 200 OK
Tue Sep 10 09:52:23 CDT 2024 < HTTP/1.1 200 OK
Tue Sep 10 09:52:24 CDT 2024 < HTTP/1.1 200 OK
Tue Sep 10 09:52:24 CDT 2024 < HTTP/1.1 200 OK
Tue Sep 10 09:52:24 CDT 2024 < HTTP/1.1 200 OK
Tue Sep 10 09:52:25 CDT 2024 < HTTP/1.1 200 OK
Tue Sep 10 09:52:25 CDT 2024 < HTTP/1.1 200 OK
Tue Sep 10 09:52:26 CDT 2024 < HTTP/1.1 503 Service Unavailable
Tue Sep 10 09:52:26 CDT 2024 < HTTP/1.1 503 Service Unavailable
Tue Sep 10 09:52:27 CDT 2024 < HTTP/1.1 200 OK
Tue Sep 10 09:52:27 CDT 2024 < HTTP/1.1 200 OK
Tue Sep 10 09:52:28 CDT 2024 < HTTP/1.1 503 Service Unavailable
Tue Sep 10 09:52:28 CDT 2024 < HTTP/1.1 200 OK
Tue Sep 10 09:52:29 CDT 2024 < HTTP/1.1 200 OK
Tue Sep 10 09:52:29 CDT 2024 < HTTP/1.1 503 Service Unavailable
Tue Sep 10 09:52:30 CDT 2024 < HTTP/1.1 200 OK
Tue Sep 10 09:52:30 CDT 2024 < HTTP/1.1 503 Service Unavailable
Tue Sep 10 09:52:31 CDT 2024 < HTTP/1.1 503 Service Unavailable
Tue Sep 10 09:52:31 CDT 2024 < HTTP/1.1 503 Service Unavailable
Tue Sep 10 09:52:31 CDT 2024 < HTTP/1.1 503 Service Unavailable
Tue Sep 10 09:52:32 CDT 2024
Tue Sep 10 09:52:37 CDT 2024
Tue Sep 10 09:52:43 CDT 2024 < HTTP/1.1 200 OK
Tue Sep 10 09:52:43 CDT 2024 < HTTP/1.1 200 OK
Tue Sep 10 09:52:44 CDT 2024 < HTTP/1.1 200 OK
Tue Sep 10 09:52:44 CDT 2024 < HTTP/1.1 200 OK

I get the impression that your understanding is that the NLB will stop sending new non-health-check requests to the Envoy Proxy pod as soon as the pod starts failing health checks. This is not true. New requests will continue to arrive while the proxy container is draining and not accepting new requests.

Or perhaps you believe the pod will handle new requests while the pod is terminating? This is also not true. The preStop hook on the shutdown manager container tells the proxy container immediately to stop accepting new requests and to finish processing existing requests.

astjossem · 2024-09-10T15:04:47Z

I share this concern, as zero downtime upgrades are critical for us and we've yet been able to achieve them without some non standard additions.

@arkodg , I’m unsure how your proposed solution would fully address the issue because there is still a delay in the NLB health check. Even with faster health checks, the NLB will continue sending traffic to the pod for a few seconds after it starts terminating. If Envoy stops accepting new connections immediately upon termination, these requests will fail, leading to downtime.

The core issue remains: Envoy stops accepting new connections too quickly, and there is no configurable option to delay this behavior to allow the NLB time to stop routing traffic to the pod.

arkodg · 2024-09-10T16:54:44Z

When the preStop hook kicks in - we fail the health check (server) and kickstart graceful draining immediately, this means the envoy proxy will continue to accept newer connections (to account for the delay in the fronting LB) and discourage new requests (but still accept them) . This is done until there are no connections left or the drain timeout has reached (default is 600s)
https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/operations/draining#draining

What could be happening here is due to low number of connections, the connections may be 0 and the default minDrainDuration is 5s , so the proxy may be terminating after 5s . This minDrainDuration was added by @davidalger as another way to delay termination until the fronting LB updated its LB pool.
Can you try setting minDrainDuration to something higher in the EnvoyProxy config ? like 25s

shutdown:
  minDrainDuration: 25s

https://gateway.envoyproxy.io/docs/api/extension_types/#shutdownconfig

millermatt · 2024-09-10T17:40:43Z

(I've removed a response I added that had posted results of a bad test. I had changed the loop of curls to hit /healthz to verify the ClientTrafficPolicy early and failed to change it back from that.)

ross-at-coactive · 2024-09-10T17:48:42Z

@arkodg Thanks for all the info. I have been thinking about this and missing exactly what you describe above -- the health check that actually gets failed at the start of shutdown that the NLB (previously I used an ALB) uses for eventual eviction. Is that what you referenced above, /healthz (on the traffic port)? I was used to :9901/ready in plain envoy.

millermatt · 2024-09-10T17:52:30Z

Things are looking great now! Thank you!

My test is showing no lost requests across multiple rollout restarts.

we fail the health check (server) and kickstart graceful draining immediately, this means the envoy proxy will continue to accept newer connections (to account for the delay in the fronting LB) and discourage new requests (but still accept them)

For whatever reason - likely bad tests on my part - the part of still accepting new requests once graceful shutdown started never seemed to be the case.

With a minimal config and your suggestions here it definitely is acting that way.

ross-at-coactive · 2024-09-10T17:56:35Z

I'll still be checking this out but thank you both @millermatt and @arkodg , this is an important question for me I haven't really gotten to yet. I'll be following this logic too this week and verifying.

arkodg · 2024-09-16T17:26:07Z

@arkodg Thanks for all the info. I have been thinking about this and missing exactly what you describe above -- the health check that actually gets failed at the start of shutdown that the NLB (previously I used an ALB) uses for eventual eviction. Is that what you referenced above, /healthz (on the traffic port)? I was used to :9901/ready in plain envoy.

@ross-at-coactive explicit listener level / traffic port health checks can be configured using the ClientTrafficPolicy or you can reuse the pod level one ( /ready port 19001)

It looks like the minDrainDuration of 5s may be too low, and we may way to bump up the defaults to 10s

arkodg · 2024-09-16T17:33:31Z

also my hope was that explicit health checks wouldn't be needed on NLB and the corresponding k8s controllers (associated with the NLB) would peek into the API Server to see the status of the proxy endpoints and reconfigure the endpoint pool.
for e.g. in GKE, it retrieves this using health check probes to kube-proxy which retrieves it from API Server

millermatt added the triage label Sep 4, 2024

millermatt closed this as completed Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need a setting to delay start of shutdown sequence #4156

Need a setting to delay start of shutdown sequence #4156

millermatt commented Sep 4, 2024 •

edited

Loading

czczycz commented Sep 5, 2024

arkodg commented Sep 5, 2024

millermatt commented Sep 6, 2024 •

edited

Loading

ross-at-coactive commented Sep 8, 2024

arkodg commented Sep 10, 2024 •

edited

Loading

millermatt commented Sep 10, 2024 •

edited

Loading

astjossem commented Sep 10, 2024 •

edited

Loading

arkodg commented Sep 10, 2024 •

edited

Loading

millermatt commented Sep 10, 2024 •

edited

Loading

ross-at-coactive commented Sep 10, 2024

millermatt commented Sep 10, 2024

ross-at-coactive commented Sep 10, 2024 •

edited

Loading

arkodg commented Sep 16, 2024

arkodg commented Sep 16, 2024

Need a setting to delay start of shutdown sequence #4156

Need a setting to delay start of shutdown sequence #4156

Comments

millermatt commented Sep 4, 2024 • edited Loading

Current

Desired

Key points

Timing details

Possible solution

czczycz commented Sep 5, 2024

arkodg commented Sep 5, 2024

millermatt commented Sep 6, 2024 • edited Loading

ross-at-coactive commented Sep 8, 2024

arkodg commented Sep 10, 2024 • edited Loading

millermatt commented Sep 10, 2024 • edited Loading

astjossem commented Sep 10, 2024 • edited Loading

arkodg commented Sep 10, 2024 • edited Loading

millermatt commented Sep 10, 2024 • edited Loading

ross-at-coactive commented Sep 10, 2024

millermatt commented Sep 10, 2024

ross-at-coactive commented Sep 10, 2024 • edited Loading

arkodg commented Sep 16, 2024

arkodg commented Sep 16, 2024

millermatt commented Sep 4, 2024 •

edited

Loading

millermatt commented Sep 6, 2024 •

edited

Loading

arkodg commented Sep 10, 2024 •

edited

Loading

millermatt commented Sep 10, 2024 •

edited

Loading

astjossem commented Sep 10, 2024 •

edited

Loading

arkodg commented Sep 10, 2024 •

edited

Loading

millermatt commented Sep 10, 2024 •

edited

Loading

ross-at-coactive commented Sep 10, 2024 •

edited

Loading