Add a way to force reboot a node #110

SerialVelocity · 2020-02-14T19:01:33Z

Sometimes reboots need to be forced on a node. It would be nice if there was a way to force the node to restart outside of start-time/end-time, blocking-pod-selector, etc

Maybe if /var/run/reboot-required contains the text force?

The text was updated successfully, but these errors were encountered:

github-actions · 2020-11-26T01:43:33Z

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

SerialVelocity · 2020-12-03T13:03:59Z

/reopen

This is still a valid feature request

github-actions · 2021-03-14T02:10:39Z

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

evrardjp · 2021-03-15T08:17:17Z

I suppose this could be done with a more flexible reboot command.

evrardjp · 2021-04-14T08:12:26Z

This is implemented now, and will be released with next kured image (in the meantime, you can build your own container if you prefer, from our main branch).

SerialVelocity · 2021-04-16T01:42:22Z

@evrardjp Do you mind pointing me to the PR that fixes this issue? I see PRs for changing the command but this is about forcing a reboot even if we are not in the correct time range, there are k8s warnings, etc.

evrardjp · 2021-04-16T05:46:26Z

I am sorry, I missed the part where it was also outside the maintenance window. If it's inside the maintenance window, we indeed implemented a force Reboot, regardless of any drain/cordon failures.

What you are looking for is not implemented, sorry.

Could you clarify why you would want to force outside the maintenance window? Maybe there is an alternative: You set up your maintenance window to be always open, but set up a blocker when you don't want to reboot?

Alternatively, if you want to force the reboot, regardless of k8s success/failure, and you needed to write on /var/run/reboot-required , I suppose you would connect on the host. Why not triggering a reboot there directly then?

SerialVelocity · 2021-04-16T19:58:34Z

One reason is mentioned in the PR description of #21

Personally, I would use it mainly in the case where my host stops being able to schedule new pods (happens if the kubelet starts OOM'ing because of misconfiguration, the host gets temporarily blacklisted from ceph so all the pods need a full restart (hard to do without rebooting the host as you can't unmount existing volumes as they hang), etc)

evrardjp · 2021-04-19T08:10:27Z

So, in those cases, you don't even need to drain/cordon, right? (While you can do something on the API, if kubelet is dead, it's kinda pointless). We don't have code for that (yet?). If you are still okay with trying to drain/cordon first, I think the forceReboot we have implemented is good enough: if drain fails, it ignores it and reboots anyway. The reboot is not scheduling new pods.

evrardjp · 2021-04-19T08:10:53Z

If you don't think it's right, don't hesitate to reopen this.

SerialVelocity · 2021-04-20T13:59:02Z

If the reason why the failures are happening are because of the kubelet OOM'ing, it might not be able to drain. For us, it would fail when kubelet tries to read all of the disk stats simultaneously that pushed us over the limit so things like draining still works for a while after each OOM.

I don't think force reboot fixes this, right? The issue is you can't force kured to start the reboot process of draining, etc. I think part of the confusion comes from this being called "force reboot" (it was named before there was a feature called "force reboot" which is not the same thing)

The feature request is to add a way to bypass the checks you do for whether the node can be rebooted. i.e. If the host is having severe problems, you want to reboot even if it is outside the maintenance window or if there are prometheus alerts

Also, I can't reopen this, I don't have permission to.

evrardjp · 2021-04-21T07:31:05Z

forcereboot ignores the drain errors, so yeah that would (maybe) work. I am saying maybe, because the OOMkiller might want to kill kured, and for that there is nothing we can do if both kubelet and kured are killed. If it doesn't kill kured, then I don't see why it wouldn't work: the drain wouldn't stop on error, and the reboot would continue its way. But all of that applies during maintenance window. WHich is why I mentioned to have large maintenance window (=always happy to reboot) plus a way to block using prometheus when you aren't ready to reboot.

You might also be interested in some new design here #359

SerialVelocity · 2021-04-27T18:12:43Z

But all of that applies during maintenance window. WHich is why I mentioned to have large maintenance window (=always happy to reboot) plus a way to block using prometheus when you aren't ready to reboot.

Yes, this is what this issue is about. Adding a way to bypass the maintenance window when certain criteria are met.

It seems like the rewrite for #359, could possibly also include slightly more complex rule evaluation logic (or at least be structured in a way that allows it to be implemented later). e.g:

type: OR
operands:
- type: file-contents
  file: /var/run/reboot-required
  contents: force
- type: AND
  operands:
  - type: file-exists
    file: /var/run/reboot-required
  - type: prometheus
    metric: my-metric
    value: 0

rptaylor · 2021-07-27T22:17:12Z

Probably https://kubernetes.io/blog/2021/04/21/graceful-node-shutdown-beta/ is a better way to do this.
If you need to reboot a node due to some conditions on the node, unrelated to routine updates and outside of the kured maintenance window, IMHO it seems outside of scope of kured and you should just go ahead and reboot it yourself. If you configure the graceful node shutdown timeout which is now built-in to kubelet, it will take care of draining and gracefully terminating pods so perhaps that can solve the problem using native k8s functionality instead of kured.

SerialVelocity · 2021-07-31T17:19:00Z

To do that, the script would have to implement locking like kured to make sure that two nodes aren't shutdown at the same time.

evrardjp · 2021-08-02T09:44:34Z

I think this paves the way for a new "kured" :)

github-actions · 2021-10-02T01:46:41Z

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

SerialVelocity changed the title ~~Add a force-reboot endpoint~~ Add a way to force reboot a node Feb 14, 2020

SerialVelocity mentioned this issue Feb 14, 2020

adding option to force reboot, ignoring active allerts #21

Closed

github-actions bot added the no-issue-activity label Nov 26, 2020

github-actions bot closed this as completed Dec 3, 2020

SerialVelocity mentioned this issue Dec 3, 2020

Add a way to force reboot a node #261

Closed

dholbach reopened this Dec 4, 2020

dholbach removed the no-issue-activity label Dec 4, 2020

evrardjp added the enhancement label Jan 12, 2021

github-actions bot added the no-issue-activity label Mar 14, 2021

evrardjp added the flexible-reboot-command label Mar 15, 2021

github-actions bot removed the no-issue-activity label Mar 16, 2021

evrardjp closed this as completed Apr 14, 2021

evrardjp reopened this Apr 21, 2021

github-actions bot added the no-issue-activity label Oct 2, 2021

dholbach added keep This won't be closed by the stale bot. and removed no-issue-activity labels Oct 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a way to force reboot a node #110

Add a way to force reboot a node #110

SerialVelocity commented Feb 14, 2020

github-actions bot commented Nov 26, 2020

SerialVelocity commented Dec 3, 2020

github-actions bot commented Mar 14, 2021

evrardjp commented Mar 15, 2021

evrardjp commented Apr 14, 2021

SerialVelocity commented Apr 16, 2021

evrardjp commented Apr 16, 2021

SerialVelocity commented Apr 16, 2021

evrardjp commented Apr 19, 2021

evrardjp commented Apr 19, 2021

SerialVelocity commented Apr 20, 2021

evrardjp commented Apr 21, 2021 •

edited

Loading

SerialVelocity commented Apr 27, 2021

rptaylor commented Jul 27, 2021 •

edited

Loading

SerialVelocity commented Jul 31, 2021

evrardjp commented Aug 2, 2021

github-actions bot commented Oct 2, 2021

Add a way to force reboot a node #110

Add a way to force reboot a node #110

Comments

SerialVelocity commented Feb 14, 2020

github-actions bot commented Nov 26, 2020

SerialVelocity commented Dec 3, 2020

github-actions bot commented Mar 14, 2021

evrardjp commented Mar 15, 2021

evrardjp commented Apr 14, 2021

SerialVelocity commented Apr 16, 2021

evrardjp commented Apr 16, 2021

SerialVelocity commented Apr 16, 2021

evrardjp commented Apr 19, 2021

evrardjp commented Apr 19, 2021

SerialVelocity commented Apr 20, 2021

evrardjp commented Apr 21, 2021 • edited Loading

SerialVelocity commented Apr 27, 2021

rptaylor commented Jul 27, 2021 • edited Loading

SerialVelocity commented Jul 31, 2021

evrardjp commented Aug 2, 2021

github-actions bot commented Oct 2, 2021

evrardjp commented Apr 21, 2021 •

edited

Loading

rptaylor commented Jul 27, 2021 •

edited

Loading