Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

970: Taint nodes even if reboot is currently blocked #971

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

yaraskm
Copy link

@yaraskm yaraskm commented Sep 3, 2024

  • If using the --prefer-no-schedule-taint flag, apply the taint even if the reboot is currently blocked. This will make it less likely for Pods to get scheduled to nodes that are pending a reboot, once the blocking Pods / alerts are cleared.

Resolves #970

- If using the --prefer-no-schedule-taint flag, apply the taint even if
  the reboot is currently blocked. This will make it less likley for
Pods to get scheduled to nodes that are pending a reboot, once the
blocking Pods / alerts are cleared.

Signed-off-by: Matt Yaraskavitch <[email protected]>
@yaraskm yaraskm marked this pull request as ready for review September 6, 2024 14:00
@@ -761,16 +761,24 @@ func rebootAsRequired(nodeID string, booter reboot.Reboot, sentinelCommand []str
blockCheckers = append(blockCheckers, KubernetesBlockingChecker{client: client, nodename: nodeID, filter: podSelectors})
}

rebootBlocked := rebootBlocked(blockCheckers...)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder is the easiest and most correct way to improve this is to move right after L32 above, immediately after we determine that the node will need to be deleted.

As far as I can see there isn't any harm in always adding this "prefer no schedule" taint. Right now we're only doing it if we aren't able to rapidly progress to the drain operation, but why not do it no matter what?

This would solve for your current problem as well.

Thoughts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, I'll make the change later today.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jackfrancis , which line is it you're referring to? I don't think L32 is correct.

I've been able to successfully run with this PR in prod for the last few weeks now and it definitely seems to have lowered the average time for a node to be blocked. I'm definitely open to adding the taint all of the time, regardless of whether we're in an acceptable reboot window or not.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackfrancis do you truly believe that adding the prefer no schedule taint can be done in all cases?
Wouldn't it make it a bit moot to have all nodes with prefernoschedule after a security update (for which all nodes will get the taint)? When kubernetes will schedule (all the nodes having the prefernoschedule) there will be no distinction anymore... Am I missing something?

@evrardjp
Copy link
Collaborator

evrardjp commented Oct 14, 2024

I am not convinced this is a view everyone will have.

I believe some ppl will prefer adding the preferNoSchedule taint, while some others will consider that the taint shouldn't be liberally applied on all nodes including the ones that can't reboot.

I don't like the fact to change the rebootasrequired loop through a variable. It makes it very hard to understand the loop in terms of maintenance in the long run. On top of that, most of the reasons ppl would have adjusted the rebootasrequired over time was for personal preference: "In my case, I would like to do x before y". It all made sense for some ppl, while it didn't make sense for some others.

I truely belive your case makes sense. Yet, I am not certain that it's the case for everyone.

Two choices:

  • I will contact a few ppl, see what they think.
  • We start to work on a change that makes the rebootasrequired loop more flexible. (I am already working on that btw).

@yaraskm
Copy link
Author

yaraskm commented Oct 15, 2024

I truely belive your case makes sense. Yet, I am not certain that it's the case for everyone.

I definitely agree with you here, and making it a configurable behaviour change makes the most sense to me as well. I'll wait for your feedback on the discussions and am happy to chip in on implementation changes if needed.

Copy link

This PR was automatically considered stale due to lack of activity. Please refresh it and/or join our slack channels to highlight it, before it automatically closes (in 7 days).

@nikokirilenko
Copy link

I truely belive your case makes sense. Yet, I am not certain that it's the case for everyone.

We are exploring using kured in our k8s environment which is hosting a Jenkins infrastructure, meaning the Jenkins controller spins up ephemeral pods acting as Jenkins agent which run various CI pipelines.
We can't evict those Jenkins pods as killing off the agent pod basically kills the Jenkins build.
For our use-case we would like to:

  1. Taint the node to prevent any further Jenkins pods scheduling on it. (--prefer-no-schedule-taint=weave.works/kured-node-reboot)
  2. Wait for all Jenkins pods to finish their workload. (--blocking-pod-selector=jenkins=agent)
  3. Continue with reboot etc...

This PR will make this use-case feasible and will be much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow tainting Nodes even when blocker Pods exist
4 participants