-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need a setting to delay start of shutdown sequence #4156
Comments
Same issue. |
hey @millermatt, this issue you are hitting is probably very specific to your setup and configurations. I'd recommend also adding your |
For sure, thank you. I'm attaching files here for a bare minimum config that demonstrates the problem. Here is my test process:
Given what I've described in this issue's description, we are guaranteed downtime, because
Here is the curl loop command I use:
This is what the curl loop output looks like as soon as a proxy pod starts terminating:
|
Just starting with envoy gateway, and +1 here I've used a custom-configured envoy in EKS with the ALB controller under a 5-9s uptime goal. Unfortunately we didn't use xDS so we used plain rollouts for (frequent-ish) configuration changes including traffic splitting. So we had to have no-downtime rollouts and followed this EKS guide for load balancing pretty closely. But (AFAIK) xDS doesn't solve for "normal" proxy pod churn anyway. In my experience what is described here was 100% a requirement to not have (in our case) I bet it's junk but we used this hand-rolled shutdown handler sidecar modeled after what I saw in contour and discussed in a variety of envoy/istio and related github issues. Here's where this proposed config is used for external, target-specific probe failures to be observed, subsequently watching the draining process to admit process termination. |
@ross-at-coactive we do something similar in Envoy Gateway where the
@millermatt looks like you are using explicit NLB health checks
my suggestion is to define a ClientTrafficPolicy https://gateway.envoyproxy.io/docs/tasks/traffic/client-traffic-policy/ with a
and also add another annotation for path
as well as for interval
|
@ross-at-coactive can surely clarify more, but the routines are not similar. The Envoy Gateway routine you refer to causes all new requests to be refused as soon as the pod starts terminating. The routine @ross-at-coactive refers to says in its docs:
This is the key part missing in the current Envoy Gateway shutdown implementation. There is no way to set a delay between the time when health checks start failing and graceful termination begins.
Yes, in an effort to cause them to fail ASAP. Here is my current config after applying the ClientTrafficPolicy below and adding the healthcheck-interval annotation with the lowest allowed value ( metadata:
annotations:
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: ssl
service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "5"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: /healthz
service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: traffic-port
service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: https
service.beta.kubernetes.io/aws-load-balancer-healthcheck-success-codes: "200"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-timeout: "2"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "2"
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
service.beta.kubernetes.io/aws-load-balancer-ssl-cert: <redacted>
service.beta.kubernetes.io/aws-load-balancer-ssl-negotiation-policy: ELBSecurityPolicy-TLS13-1-2-2021-06
service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: preserve_client_ip.enabled=true,deregistration_delay.timeout_seconds=300
service.beta.kubernetes.io/aws-load-balancer-type: external Here is the resulting config in the AWS console: At your suggestion, I've added this ClientTraffic policy: apiVersion: gateway.envoyproxy.io/v1alpha1
kind: ClientTrafficPolicy
metadata:
name: mygateway
namespace: proxy-namespace
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: mygateway
healthCheck:
path: "/healthz" All other config is from the zip file I uploaded above. There is still downtime as soon as any Envoy Proxy pod starts terminating.
I get the impression that your understanding is that the NLB will stop sending new non-health-check requests to the Envoy Proxy pod as soon as the pod starts failing health checks. This is not true. New requests will continue to arrive while the proxy container is draining and not accepting new requests. Or perhaps you believe the pod will handle new requests while the pod is terminating? This is also not true. The |
I share this concern, as zero downtime upgrades are critical for us and we've yet been able to achieve them without some non standard additions. @arkodg , I’m unsure how your proposed solution would fully address the issue because there is still a delay in the NLB health check. Even with faster health checks, the NLB will continue sending traffic to the pod for a few seconds after it starts terminating. If Envoy stops accepting new connections immediately upon termination, these requests will fail, leading to downtime. The core issue remains: Envoy stops accepting new connections too quickly, and there is no configurable option to delay this behavior to allow the NLB time to stop routing traffic to the pod. |
When the preStop hook kicks in - we fail the health check (server) and kickstart graceful draining immediately, this means the envoy proxy will continue to accept newer connections (to account for the delay in the fronting LB) and discourage new requests (but still accept them) . This is done until there are no connections left or the drain timeout has reached (default is 600s) What could be happening here is due to low number of connections, the connections may be
https://gateway.envoyproxy.io/docs/api/extension_types/#shutdownconfig |
(I've removed a response I added that had posted results of a bad test. I had changed the loop of curls to hit |
@arkodg Thanks for all the info. I have been thinking about this and missing exactly what you describe above -- the health check that actually gets failed at the start of shutdown that the NLB (previously I used an ALB) uses for eventual eviction. Is that what you referenced above, |
Things are looking great now! Thank you! My test is showing no lost requests across multiple rollout restarts.
For whatever reason - likely bad tests on my part - the part of still accepting new requests once graceful shutdown started never seemed to be the case. With a minimal config and your suggestions here it definitely is acting that way. |
I'll still be checking this out but thank you both @millermatt and @arkodg , this is an important question for me I haven't really gotten to yet. I'll be following this logic too this week and verifying. |
@ross-at-coactive explicit listener level / traffic port health checks can be configured using the ClientTrafficPolicy or you can reuse the pod level one ( /ready port 19001) It looks like the minDrainDuration of |
also my hope was that explicit health checks wouldn't be needed on NLB and the corresponding k8s controllers (associated with the NLB) would peek into the API Server to see the status of the proxy endpoints and reconfigure the endpoint pool. |
Description:
Current
Envoy Proxy container stops accepting new connections as soon as the Envoy Proxy pod starts terminating. For reasons described below, this means that zero downtime Envoy Proxy deployment rollouts are not possible when using AWS+NLB.
Desired
Envoy Proxy container should continue to accept new connections for a configurable amount of time when the Envoy Proxy pod starts terminating. This will allow zero downtime Envoy Proxy deployment rollouts.
Key points
Timing details
Here's a timeline where the pod start terminating immediately following a successful health check from the NLB:
Possible solution
/lbhealthz
) that returns a 200 normally. This is important: a) It is not the same endpoint used for pod liveness/readiness probes and b) ShutdownManager knows about it and can toggle it to return non-200 status codes.gracefulShutdownDelay
(default:0
for backwards compatibility) or similar property to ShutdownConfig that controls the timing between when the pod goes into a terminating state and when shutdown_manager.go callspostEnvoyAdminAPI("healthcheck/fail")
.shutdown_manager.go:ShutDown()
as it currently doesshutdown_manager.go:ShutDown()
would call a new/lbhealthcheck/fail
endpoint on the proxy container immediately, causing/lbhealthz
to no longer return a 200 status code. NLB health checks start failing, but new and existing requests continue to get processed.gracefulShutdownDelay
seconds,shudown_manager.go
would call/healthcheck/fail
and/drain_listeners?graceful&skip_exit
.AWS NLB users could set
gracefulShutdownDelay
to match their NLB'sUnhealthyThresholdCount * HealthCheckIntervalSeconds
to allow new requests while the NLB health checks are failing.The text was updated successfully, but these errors were encountered: