-
Notifications
You must be signed in to change notification settings - Fork 958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support to add delay in node termination to honor ELB connection draining interval #4673
Comments
Which ELB are you referencing? What's the amount of time you need this delay? Karpenter currently will wait until the node is fully drained, terminate the instance, then remove the node. There's a period of time where the instance can still be shutting down when the node is deleted. Would it be sufficient to wait for the instance to be fully terminated to delete the node? |
I think this should be solved on pod level using terminationGracePeriodSeconds and lifecycle prestop setting. For us it works perfectly this is some example https://github.com/arvatoaws-labs/wordpress-bedrock-helm-chart/blob/master/templates/deployment-worker.yaml |
If implemented, kubernetes/enhancements#4212 (an API for declarative node maintenance / drains) might help with this. |
adding delay in pod shutdown solves problem of a node where pod is running but in a use case where |
We've been thinking about driving termination of nodes through adding a NoExecute taint on the node so that all DS pods and other non-drainable pods can clean up before the node is deleted. Would this be sufficient? Rather than add in a per-node delay, I'd rather drive this by when the pods themselves say everything has been cleaned up. |
will |
|
@njtran I'm not sure why this issue was closed in favor of the other issue, as the other issue does not address this specific use case. When using AWS NLB in instance mode with externalTrafficPolicy=cluster, a node will still receive traffic from the NLB during deregistration. This is normally between 180s-200s. Since the traffic policy is cluster, the node doesn't need to host the target pod as kube-proxy will side route traffic to another node where the pod exists. The issue here is when the node reaches a state at which karpenter determines it can destroy it the deregistration delay hasn't been reached on the NLB and the node is still likely handling connections and sending them to other nodes. Karpenter terminating the node breaks this and results in scores of 5xx errors. This is a well known issue and was resolved with cluster-autoscaler by implementing a termination wait lifecycle policy to the ASG so that a termination signal to the ASG started a timer and then destroyed the node. Being able to tell karpenter to wait 'n' number of minutes after it knows it can delete a node before actually doing so would resolve this issue. This is preferable to handling this with terminationGracePeriodSeconds and preStop, as that significantly impacts rollout timing on large deployments. Telling pods to wait 200s makes in cluster rolling restarts terrible. With the karpenter approach your deployment can roll fairly quickly, maintaining all connections, and is just a empty node waiting to be removed. Can we consider reopening this? EDIT: This appears to have existed in versions prior to 0.32.x: https://github.com/aws/karpenter-provider-aws/blob/release-v0.31.x/pkg/apis/crds/karpenter.sh_provisioners.yaml#L294-L300 This I'm not sure why it was removed, but if my assumption about it's behavior is correct, that seems to be exactly what would resolve this issue. |
well explained @awiesner4 as per our testing |
Thank you, @infa-ddeore. That actually makes sense based on the replacement functionality in NodePools of So the basic problem here is (and I still need to test this to prove it out) we can theoretically set
EDIT: |
Hey @awiesner4, sure I can re-open. This was closed in favor of kubernetes-sigs/karpenter#621 as this controls and ensures that all pods on a node are cleaned up prior to executing node termination. The disruption settings you're seeing happen way before this process. I'd recommend reading more about this in the docs, as emptiness that you're referring to in v0.31.1 is not the same here. Is there a signal that Karpenter can consume to know when it's waited long enough? Does seeing 5xx errors impact performance on your other NLBs? |
Yeah, I think that's where the disconnect is as it relates to this issue. Pod cleanup (i.e., connection draining) from a node is not equal to connection cleanup for that node when using AWS NLB in instance mode with To illustrate this, let's say you have a 2 node cluster with NodeA and NodeB. Both NodeA and NodeB have pods running on them that the AWS NLB is sending connections to. A connection from an NLB in this setup can be:
In this scenario, Karpenter decides the NodeA EC2 is less cost effective than another option, so it decides to consolidate and starts draining the pod on NodeA (and taint the node appropriately). When it does this, the When the pod on NodeA finishes draining, that does not mean that all connections to that Node have completed and been drained, as the NLB is on a separate loop, since it can use any node in the node group to send to any pod. Therefore, NodeA still likely has connections open from the NLB that are going through NodeA to NodeB to another pod that is not draining. The problem here is Karpenter assumes that the Pod removal on NodeA means it's good to terminate NodeA. Except that's not true because the NLB hasn't finished deregistering NodeA yet and still has open connections that go through NodeA to the pod(s) on NodeB. So when Karpenter kills NodeA before the NLB deregistration completes, this results in client facing 5xx errors.
A perfect solution would be for Karpenter to query the NLB to determine that the node is no longer registered and then, once the registration is complete, terminate the node. However, a perfectly acceptable solution -- and comparable to current solution with cluster-autoscaler -- is simply to have a timer for Karpenter that can be set to any value that simply says "wait 'x' amount of seconds after node meets termination conditions to actually send termination signal to AWS." Essentially just a dumb sleep. This way, we could set that equal to the deregistration delay value of the NLB, and the behavior would be something like:
As a side question, @njtran ... one way I'm looking to hack around this limitation right now is as follows. Would be interested to know if this would work based on your understanding:
This would allow our actual Deployments to restart in a normal fashion and move around quickly, but the extra Daemonset wouldn't terminate for a pre-determined time that should prolong the Karpenter termination decision until after the NLB deregistration completes. I figure this should work as I believe Karpenter does respect DaemonSets that were running, but ignores them if they restart. So, a prolonged DaemonSet could help me slow Karpenter down. EDIT: I think this answers that question: kubernetes-sigs/karpenter#621 (comment) i.e., as long as I don't tolerate those taints on my DS, that DS will be respected like any Deployment, and should slow Karpenter down while not impacting my rollout restarts on the Deployments I actually care about. EDIT 2: I actually think the above link is wrong, since the DaemonSet controller in K8s automatically adds those Tolerations to all DaemonSets: https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#taints-and-tolerations ... which means I think we're back to no real acceptable workaround on this. |
The comment you've referred to is still correct, since Karpenter will add its own taint, not the unschedulable taint to nodes that are being deleted. I think that daemonset workaround might work. You'd need to model the terminationGracePeriodSeconds as the sum of the pod's terminationGracePeriodSeconds + deregistration delay.
I don't think a sleep is the right way to solve this. If Karpenter picks a value, that means that much longer before instances are terminated. There is rarely a valid use-case for including a hacky hard-coded sleep, as this is essentially tech debt that we'd need to go back to evaluate any time Karpenter/EC2 has some performance improvements that make the sleep unnecessary. Although, there is a point, that this might be better encapsulated within kubernetes-sigs/karpenter#740. |
I understand and don't necessarily disagree completely on this. Just a point of clarification, though. I'm not asking Karpenter to have some hardcoded value. I'm asking that a parameter be exposed, with a default value of The only other way I really see to solve this is to have Karpenter inspect NLB membership information for nodes that are targeted for termination, and watch the NLB for when the node no longer exists as registered/draining (i.e., the node is not found in the target group)... and then send the termination on that. Unfortunately, the DaemonSet solution didn't work, as the DaemonSet controller automatically adds the tolerations to applied DaemonSets, so there's no real workaround to this. |
@awiesner4 is it correct to say that the target group for the LB waits the I'm trying to understand how much of an issue this will be for us, we are just about to roll out karpenter. I think that often we will be waiting for at least the default deregistration delay during pod draining anyway. |
@martinsmatthews That is correct. It's always exactly difficult to get the appropriate timing, as with In general, I believe your understanding is correct and that the deregistration clock basically starts when the aws-load-balancer-controller removes the node from the TargetGroupBinding, which then sends the deregistration signal to the ELB... not when the last pod is removed. |
This is interesting. We're basically talking about something very similar to what we're discussing here: #2917 (comment). Basically, we need a minNodeDrainTime to make this work out well. For spot interruption, this has utility because we want to make sure that the node stays alive as long as it can before draining so that we make sure that we maximize the pod lifetime and reduce application downtime. I think for spot interruption we would want to set a combination of minNodeDrainTime and maxNodeDrainTime but here it really seems like we need minNodeDrainTime and that's it |
To be clear, this solution should still work for the reasons that @njtran called out. Karpenter uses its own taint ( More info here: https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#taints-and-tolerations |
One other thought that I had (this is probably not the best option, but you could do it): You could create another controller that adds a finalizer to the node. It ensures that it doesn't remove the finalizer on the node until it is 300s past its deletionTimestamp. This should get the behavior you want until/if this feature is directly built-in to Karpenter. |
@jonathan-innis agreed, |
someone mentioned workaround at aws/aws-node-termination-handler#316 (comment) |
Thanks, @jonathan-innis . I hadn't come back to this in a while. I went back and retested and, you are correct, it does look as though creating a simple DaemonSet does work. I was able to coordinate the timing and, while I haven't been able to run our k6s tests through it to see if we drop any connections, the timing I'm seeing between node removal by Karpenter and it's deregistration in NLB suggests that the DaemonSet approach should work in the interim until this functionality is added into Karpenter. |
Is there an update on this? We recently migrated to Karpenter and after quite some debugging we noticed the NLB issues are arising because Karpenter terminates nodes before they can be de-registered in the NLB. It's usually not an issue when draining a node takes a bit of time, as the AWS load-balancer-controller will pick-up the taint of the nodes being drained and will start the de-registering process in the NLB. For nodes that are drained quickly however, the AWS load-balancer-controller will not recognize this in time and the node will be terminated by karpenter before it can be de-registered from the NLB. Adding a simple optional grace period before terminating nodes seems reasonable, as this is causing issues for everyone using Karpenter with NLBs in instance mode. Modifying the Userdata and deploying a Daemonset as mentioned in the workaround seems very hacky. |
Is there an update on this, we are also stuck with issues of node termination before draining from load balancer |
Separately, as I've been looking more into this: Does IP mode completely solve this problem? Since the connections are routed directly to the pods, doesn't that completely remove other nodes from the picture of connections? Assuming you have no Instance mode-based connections that are being made on your cluster, draining a node that doesn't contain a pod that is handling a connection shouldn't matter -- you should just be able to handle your connection draining at the pod-level at that point. cc: @awiesner4 @infa-ddeore We've been thinking a little bit more about how to handle this. I'm not in love with the time based solution since I'd rather just get a signal from ALB/NLB that it's safe to drain the node. For instance, this has become relevant with PVC volume attachments (see #6336) where we are proposing waiting for VolumeAttachment objects to be completely removed from the node before we consider the node fully drained and ready to be removed. Ideally, we could have some Kubernetes API-based representation of this draining completion from the ALB/NLB that we could trust. One option here is a finalizer -- though, I'm not quite sure how this would fair in instance mode when basically every node would have to receive this finalizer. If we were going to be consider a drain delay, I'd think that there'd have to be more than a single use-case to create a generalized solution for this problem -- else, we'd probably be better off just building something more tailored to this specific problem. |
Hey @jonathan-innis
Yes and no. IP mode does resolve this specific issue, but the problem with IP mode is that NLB registration takes anywhere from 3-5 minutes and is expected behavior. This means that, with PodReadinessGates configured (so that pods don't go ready until they're registered), we end up with extremely slow rollouts for our ingress deployments that have 100's of pods, as each pod takes 3-5 minutes to become ready as it's purely based on when the NLB completes registration. With I generally agree with your mental model of how to approach this. The robust solution is to have Karpenter determine the NLB that owns the node, and confirm when deregistration completes. Once it does, then remove the node. This would scale better if times fluctuated, etc.. My only mention of the time based solution is two-fold:
|
Just as an update for the rest of the users on this thread, we are moving forward with the DaemonSet approach. There's no user-data changes necessary. We're simply just creating a DaemonSet that runs on all of our ingress nodes: apiVersion: apps/v1
kind: DaemonSet
metadata:
name: karpenter-termination-waiter
namespace: istio-system
spec:
selector:
matchLabels:
app: karpenter-termination-waiter
template:
metadata:
labels:
app: karpenter-termination-waiter
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: service-group
operator: In
values:
- istio-igw
- istio-igw-pub-adserver
containers:
- command:
- sleep
- infinity
image: alpine:latest
lifecycle:
preStop:
exec:
command:
- sleep
- '300'
name: alpine
resources:
limits:
cpu: 5m
memory: 10Mi
requests:
cpu: 2m
memory: 5Mi
priorityClassName: system-node-critical
terminationGracePeriodSeconds: 300
tolerations:
- effect: NoSchedule
key: service-group
operator: Equal
value: istio-igw
- effect: NoSchedule
key: service-group
operator: Equal
value: istio-igw-pub-adserver
- effect: NoSchedule
key: karpenter.sh/nodepool
operator: Equal
value: plat-infra-cpu-istio-igw
- effect: NoSchedule
key: karpenter.sh/nodepool
operator: Equal
value: plat-infra-cpu-istio-igw-pub-adserver I've run this through lots of k6s tests where I'm sending lots of traffic through our ingress gateways and I am adding and removing pods/nodes and not dropping any connections. I can see the DaemonSet pod get terminated and begin it's 5m wait (and watch the NLB finish registration before the node is removed by Karpenter). It's not the ideal solution, but it does appear to work. |
@jonathan-innis just an update on the workaround suggested in this comment, the controller looked to be the cleanest/simplest option for us but it doesn't appear to work. The node remains in k8s until our controller removes its finalizer but karpenter still terminates the ec2 instance as soon as it has drained it. So we're going to pivot to the daemonset approach from above. This is when using the controller/finalizer on the node:
With the daemonset approach:
And like @awiesner4 says, no userdata change required |
Description
Observed Behavior:
instance is terminated immediately after draining
Expected Behavior:
add configurable delay between drain and terminate to honor ELB's connection draining interval
its a disruptive behaviour if node is deleted while serving existing traffic when ELB connection draining setting is enabled
Reproduction Steps (Please include YAML):
when karpeneter deletes the node, it gets removed from ELB and then deleted immediately
Versions:
kubectl version
): 1.26The text was updated successfully, but these errors were encountered: