Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a way to force reboot a node #110

Open
SerialVelocity opened this issue Feb 14, 2020 · 17 comments
Open

Add a way to force reboot a node #110

SerialVelocity opened this issue Feb 14, 2020 · 17 comments
Labels
enhancement flexible-reboot-command keep This won't be closed by the stale bot.

Comments

@SerialVelocity
Copy link

Sometimes reboots need to be forced on a node. It would be nice if there was a way to force the node to restart outside of start-time/end-time, blocking-pod-selector, etc

Maybe if /var/run/reboot-required contains the text force?

@SerialVelocity SerialVelocity changed the title Add a force-reboot endpoint Add a way to force reboot a node Feb 14, 2020
@github-actions
Copy link

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

@SerialVelocity
Copy link
Author

/reopen

This is still a valid feature request

@github-actions
Copy link

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

@evrardjp
Copy link
Collaborator

I suppose this could be done with a more flexible reboot command.

@evrardjp
Copy link
Collaborator

This is implemented now, and will be released with next kured image (in the meantime, you can build your own container if you prefer, from our main branch).

@SerialVelocity
Copy link
Author

@evrardjp Do you mind pointing me to the PR that fixes this issue? I see PRs for changing the command but this is about forcing a reboot even if we are not in the correct time range, there are k8s warnings, etc.

@evrardjp
Copy link
Collaborator

I am sorry, I missed the part where it was also outside the maintenance window. If it's inside the maintenance window, we indeed implemented a force Reboot, regardless of any drain/cordon failures.

What you are looking for is not implemented, sorry.

Could you clarify why you would want to force outside the maintenance window? Maybe there is an alternative: You set up your maintenance window to be always open, but set up a blocker when you don't want to reboot?

Alternatively, if you want to force the reboot, regardless of k8s success/failure, and you needed to write on /var/run/reboot-required , I suppose you would connect on the host. Why not triggering a reboot there directly then?

@SerialVelocity
Copy link
Author

One reason is mentioned in the PR description of #21

Personally, I would use it mainly in the case where my host stops being able to schedule new pods (happens if the kubelet starts OOM'ing because of misconfiguration, the host gets temporarily blacklisted from ceph so all the pods need a full restart (hard to do without rebooting the host as you can't unmount existing volumes as they hang), etc)

@evrardjp
Copy link
Collaborator

So, in those cases, you don't even need to drain/cordon, right? (While you can do something on the API, if kubelet is dead, it's kinda pointless). We don't have code for that (yet?). If you are still okay with trying to drain/cordon first, I think the forceReboot we have implemented is good enough: if drain fails, it ignores it and reboots anyway. The reboot is not scheduling new pods.

@evrardjp
Copy link
Collaborator

If you don't think it's right, don't hesitate to reopen this.

@SerialVelocity
Copy link
Author

If the reason why the failures are happening are because of the kubelet OOM'ing, it might not be able to drain. For us, it would fail when kubelet tries to read all of the disk stats simultaneously that pushed us over the limit so things like draining still works for a while after each OOM.

I don't think force reboot fixes this, right? The issue is you can't force kured to start the reboot process of draining, etc. I think part of the confusion comes from this being called "force reboot" (it was named before there was a feature called "force reboot" which is not the same thing)

The feature request is to add a way to bypass the checks you do for whether the node can be rebooted. i.e. If the host is having severe problems, you want to reboot even if it is outside the maintenance window or if there are prometheus alerts

Also, I can't reopen this, I don't have permission to.

@evrardjp evrardjp reopened this Apr 21, 2021
@evrardjp
Copy link
Collaborator

evrardjp commented Apr 21, 2021

forcereboot ignores the drain errors, so yeah that would (maybe) work. I am saying maybe, because the OOMkiller might want to kill kured, and for that there is nothing we can do if both kubelet and kured are killed. If it doesn't kill kured, then I don't see why it wouldn't work: the drain wouldn't stop on error, and the reboot would continue its way. But all of that applies during maintenance window. WHich is why I mentioned to have large maintenance window (=always happy to reboot) plus a way to block using prometheus when you aren't ready to reboot.

You might also be interested in some new design here #359

@SerialVelocity
Copy link
Author

But all of that applies during maintenance window. WHich is why I mentioned to have large maintenance window (=always happy to reboot) plus a way to block using prometheus when you aren't ready to reboot.

Yes, this is what this issue is about. Adding a way to bypass the maintenance window when certain criteria are met.

It seems like the rewrite for #359, could possibly also include slightly more complex rule evaluation logic (or at least be structured in a way that allows it to be implemented later). e.g:

type: OR
operands:
- type: file-contents
  file: /var/run/reboot-required
  contents: force
- type: AND
  operands:
  - type: file-exists
    file: /var/run/reboot-required
  - type: prometheus
    metric: my-metric
    value: 0

@rptaylor
Copy link

rptaylor commented Jul 27, 2021

Probably https://kubernetes.io/blog/2021/04/21/graceful-node-shutdown-beta/ is a better way to do this.
If you need to reboot a node due to some conditions on the node, unrelated to routine updates and outside of the kured maintenance window, IMHO it seems outside of scope of kured and you should just go ahead and reboot it yourself. If you configure the graceful node shutdown timeout which is now built-in to kubelet, it will take care of draining and gracefully terminating pods so perhaps that can solve the problem using native k8s functionality instead of kured.

@SerialVelocity
Copy link
Author

To do that, the script would have to implement locking like kured to make sure that two nodes aren't shutdown at the same time.

@evrardjp
Copy link
Collaborator

evrardjp commented Aug 2, 2021

I think this paves the way for a new "kured" :)

@github-actions
Copy link

github-actions bot commented Oct 2, 2021

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

@dholbach dholbach added keep This won't be closed by the stale bot. and removed no-issue-activity labels Oct 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement flexible-reboot-command keep This won't be closed by the stale bot.
Projects
None yet
Development

No branches or pull requests

4 participants