EC2 instance graceful restart #358

amorey · 2021-04-14T11:16:50Z

Is there a recommended way to initiate a graceful restart of EC2 instances from kured? For my particular use case, I'd like to de-register instances from ELBs before rebooting the OS but I can't see a way to do this with the kured command line options.

evrardjp · 2021-04-14T11:24:12Z

Technically, the next release might help you (the main branch code can do it, it's not yet released).

What you're trying to achieve has never been publicly added to the project (and you might want to document it here, if you feel it's worth sharing). If someone did that, it didn't share the recipe.

Here is what I would do:

Build a restart script, which handles the de-registering of those instances, and reboot your node
Put that script into your kubernetes nodes
Change your rebootCommand in kured to point to that script.

There are maybe better ways, but I don't know those, and therefore I can't help more than that.

amorey · 2021-04-14T11:35:38Z

Thanks for your quick reply! Can you point me to the feature in the next release that might help?

As a variant on your suggestion, I could also create a shell script and make it accessible to kured as an attached volume in the kubernetes manifest.

evrardjp · 2021-04-14T11:54:03Z

While it's probably possible, it's not the easiest.

We currently wrap the rebootCommand to escape the container namespace. So basically the command is by default something similar to nsenter ... $rebootCommand. It's necessary so that you can restart the node (as systemctl reboot is absent from kured container). In your case, you would need to bypass that nsenter bit, and don't wrap the command. We don't have the code yet for that. (however due to code structure, it's a one liner change!). FYI, assuming you do that one liner change, you would still need to introduce the nsenter bit in your script. I think it's a good way to do it.

Did you think of an init container, which would drop the necessary script on your hosts instead? That seems the easiest: It will be able to create the script easily, and kured can still trigger it easily.

amorey · 2021-04-14T13:39:24Z

I hadn't thought of an init container but that would work in terms of dropping the script on the host. One disadvantage I can think of is AWS permissions. If the script were to execute inside the container namespace then you could use a tool like kube2iam to set the AWS IAM role for execution but I don't think that's an option on the host.

Would you be open to a command line option to enable reboot-command execution in the container namespace (e.g. --reboot-namespace [host|container])? That should give more control to the user over the reboot script.

evrardjp · 2021-04-14T16:06:14Z

I don't mind myself (though I would probably name that argument differently). We need to find whether it makes sense to also have the sentinel (check) in the same namespace or not (and therefore add another argument). Alternatively, we could simply expose that we are doing nsenter instead, by making sure it's included in the default command, and force our users to be explicit, rather than implicit. I am not sure yet.

I would love to discuss this in a next community meeting tbh.

In the meantime, feel free to propose a PR, and we can discuss it there : )
Assuming we go for a cli argument like wrap-commands-with-nsenter, your change should be a 3 liner: one global var, one argument in the CLI, and one if wrap-commands-with-nsenter {} (ok that's a multi-line, but you get the gist;) ).

amorey · 2021-04-14T22:30:59Z

Exposing the use of nsenter sounds like a great solution.

In addition, it would be useful to have the ability to run a script after the instance restarts (e.g. to add the instance back to its Auto Scaling group). Does it make sense to add this capability to kured? Or does it make more sense to hack this into the init container script?

evrardjp · 2021-04-15T07:41:38Z

@amorey The problem with exposing nsenter is user friendliness: what about the PID if it's not 1 (rancherOS)? Will people remember to put nsenter... in front of their reboot command? I am not sure it's easily done.

For a script after instance restart, it's effectively the role of your node's init system and/or kubernetes. I wouldn't do that in kured, myself.

amorey · 2021-04-15T09:37:54Z

@amorey The problem with exposing nsenter is user friendliness: what about the PID if it's not 1 (rancherOS)? Will people remember to put nsenter... in front of their reboot command? I am not sure it's easily done.

All good questions. Let me know what you end up deciding. In the mean time there should be work arounds for the execution context issues.

For a script after instance restart, it's effectively the role of your node's init system and/or kubernetes. I wouldn't do that in kured, myself.

Makes sense. Thanks for the advice.

amorey · 2021-04-21T15:16:27Z

I just came across this kubernetes blog post from today:
https://kubernetes.io/blog/2021/04/21/graceful-node-shutdown-beta/

If graceful node shutdown works as advertised then, in the future, kured should be able to offload node draining to kubernetes it might also take care of some of the issues I'm dealing with now.

In the mean time, I can trigger ELB de-registration from within kubernetes by adding a "node.kubernetes.io/exclude-from-external-load-balancers" label to the node. Can you recommend a way to do this with kured? In addition, I need to set a grace period to wait until ELB closes connections. Can I use --drain-grace-period for this?

amorey · 2021-05-20T12:42:11Z

@evrardjp I noticed that kured sets a node lock annotation which is removed after restart. If it were possible to set a node lock label that behaved in the same way then it would be possible to remove nodes from external load balancers gracefully before restart without having to use a custom script (using the "node.kubernetes.io/exclude-from-external-load-balancers" label). Would you be open to including this feature in kured?

evrardjp · 2021-07-28T08:01:06Z

I just came across this kubernetes blog post from today:
https://kubernetes.io/blog/2021/04/21/graceful-node-shutdown-beta/

If graceful node shutdown works as advertised then, in the future, kured should be able to offload node draining to kubernetes it might also take care of some of the issues I'm dealing with now.

In the mean time, I can trigger ELB de-registration from within kubernetes by adding a "node.kubernetes.io/exclude-from-external-load-balancers" label to the node. Can you recommend a way to do this with kured? In addition, I need to set a grace period to wait until ELB closes connections. Can I use --drain-grace-period for this?

Yes, I read this recently. When kured will be supporting 1.21 and above only, we can probably remove a bunch of code.

evrardjp · 2021-07-28T08:01:48Z

@evrardjp I noticed that kured sets a node lock annotation which is removed after restart. If it were possible to set a node lock label that behaved in the same way then it would be possible to remove nodes from external load balancers gracefully before restart without having to use a custom script (using the "node.kubernetes.io/exclude-from-external-load-balancers" label). Would you be open to including this feature in kured?

I saw your PR. It's awesome. I like it so far :) Let's discuss it there!

Previously, kured issued the system reboot command without first removing the node from any connected external load balancers (ELBs). This behavior caused downtime on restart because ELBs send traffic to kube-proxy pods running on nodes until the ELB health checks fail or the node is de-registered explicitly. This patch solves the problem by adding a "node.kubernetes.io/exclude-from -external-load-balancers" label to the node before restart which tells the Kubernetes control plane to de-register the node from any connected ELBs. The node label is removed after restart which causes the control plane to re-register the node with the ELBs. Close kubereboot#358

…ceful removal/addition from external load balancers Previously, kured issued the system reboot command without first removing nodes from any connected external load balancers (ELBs). This behavior caused downtime on restart because ELBs send traffic to kube-proxy pods running on nodes until the ELB health checks fail or the node is de-registered explicitly. This patch solves the problem by adding a command line argument (`label-with-exclude-from-external-lbs`) that, when enabled, adds a "node.kubernetes.io/exclude-from-external-load-balancers" label to nodes undergoing kured reboot. This label tells the Kubernetes control plane to de-register the affected node from any connected ELBs. The node label is removed after restart which causes the control plane to re-register the node with the ELBs. Close kubereboot#358

github-actions · 2021-09-27T01:47:26Z

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

amorey · 2021-09-27T06:27:09Z

/remove stale

…ceful removal/addition from external load balancers Previously, kured issued the system reboot command without first removing nodes from any connected external load balancers (ELBs). This behavior caused downtime on restart because ELBs send traffic to kube-proxy pods running on nodes until the ELB health checks fail or the node is de-registered explicitly. This patch solves the problem by adding a command line argument (`label-with-exclude-from-external-lbs`) that, when enabled, adds a "node.kubernetes.io/exclude-from-external-load-balancers" label to nodes undergoing kured reboot. This label tells the Kubernetes control plane to de-register the affected node from any connected ELBs. The node label is removed after restart which causes the control plane to re-register the node with the ELBs. Close kubereboot#358

github-actions · 2021-11-27T01:46:14Z

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

amorey · 2021-11-27T09:50:27Z

/remove stale

…ceful removal/addition from external load balancers Previously, kured issued the system reboot command without first removing nodes from any connected external load balancers (ELBs). This behavior caused downtime on restart because ELBs send traffic to kube-proxy pods running on nodes until the ELB health checks fail or the node is de-registered explicitly. This patch solves the problem by adding a command line argument (`label-with-exclude-from-external-lbs`) that, when enabled, adds a "node.kubernetes.io/exclude-from-external-load-balancers" label to nodes undergoing kured reboot. This label tells the Kubernetes control plane to de-register the affected node from any connected ELBs. The node label is removed after restart which causes the control plane to re-register the node with the ELBs. Close kubereboot#358

github-actions · 2022-01-27T01:47:00Z

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

amorey · 2022-01-27T07:11:43Z

/remove stale

evrardjp mentioned this issue Apr 16, 2021

Redesign idea: "Central nervous system for Kured" #359

Open

amorey mentioned this issue Aug 4, 2021

Add label-with-exclude-from-external-lbs CLI argument to enable graceful removal/addition from external load balancers #419

Closed

github-actions bot added the no-issue-activity label Sep 27, 2021

ckotzbauer removed the no-issue-activity label Sep 27, 2021

github-actions bot added the no-issue-activity label Nov 27, 2021

github-actions bot removed the no-issue-activity label Nov 28, 2021

github-actions bot added the no-issue-activity label Jan 27, 2022

ckotzbauer added keep This won't be closed by the stale bot. and removed no-issue-activity labels Jan 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EC2 instance graceful restart #358

EC2 instance graceful restart #358

amorey commented Apr 14, 2021

evrardjp commented Apr 14, 2021

amorey commented Apr 14, 2021

evrardjp commented Apr 14, 2021 •

edited

Loading

amorey commented Apr 14, 2021 •

edited

Loading

evrardjp commented Apr 14, 2021 •

edited

Loading

amorey commented Apr 14, 2021

evrardjp commented Apr 15, 2021

amorey commented Apr 15, 2021

amorey commented Apr 21, 2021 •

edited

Loading

amorey commented May 20, 2021

evrardjp commented Jul 28, 2021

evrardjp commented Jul 28, 2021

github-actions bot commented Sep 27, 2021

amorey commented Sep 27, 2021

github-actions bot commented Nov 27, 2021

amorey commented Nov 27, 2021

github-actions bot commented Jan 27, 2022

amorey commented Jan 27, 2022

EC2 instance graceful restart #358

EC2 instance graceful restart #358

Comments

amorey commented Apr 14, 2021

evrardjp commented Apr 14, 2021

amorey commented Apr 14, 2021

evrardjp commented Apr 14, 2021 • edited Loading

amorey commented Apr 14, 2021 • edited Loading

evrardjp commented Apr 14, 2021 • edited Loading

amorey commented Apr 14, 2021

evrardjp commented Apr 15, 2021

amorey commented Apr 15, 2021

amorey commented Apr 21, 2021 • edited Loading

amorey commented May 20, 2021

evrardjp commented Jul 28, 2021

evrardjp commented Jul 28, 2021

github-actions bot commented Sep 27, 2021

amorey commented Sep 27, 2021

github-actions bot commented Nov 27, 2021

amorey commented Nov 27, 2021

github-actions bot commented Jan 27, 2022

amorey commented Jan 27, 2022

evrardjp commented Apr 14, 2021 •

edited

Loading

amorey commented Apr 14, 2021 •

edited

Loading

evrardjp commented Apr 14, 2021 •

edited

Loading

amorey commented Apr 21, 2021 •

edited

Loading