Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow tainting Nodes even when blocker Pods exist #970

Closed
yaraskm opened this issue Aug 30, 2024 · 4 comments · May be fixed by #971
Closed

Allow tainting Nodes even when blocker Pods exist #970

yaraskm opened this issue Aug 30, 2024 · 4 comments · May be fixed by #971
Labels

Comments

@yaraskm
Copy link

yaraskm commented Aug 30, 2024

I'm proposing an optimization that should help a cluster perform a full reboot faster when there are blocker conditions.

Currently, the logic is as follows when using both --blocking-pod-selector and --prefer-no-schedule-taint :

var rebootRequiredBlockCondition string
if rebootBlocked(blockCheckers...) {
	rebootRequiredBlockCondition = ", but blocked at this time"
	continue
}
log.Infof("Reboot required%s", rebootRequiredBlockCondition)

if !holding(lock, &nodeMeta, concurrency > 1) && !acquire(lock, &nodeMeta, TTL, concurrency) {
	// Prefer to not schedule pods onto this node to avoid draing the same pod multiple times.
	preferNoScheduleTaint.Enable()
	continue
}

By this point, we know that the Node requires a reboot, but if a blocker exists (e.g. Prometheus alert is firing or Pods exist on the node that match the blocking selector), the main loop will just go back to sleep and wait for the next tick without tainting the node. The problem here though is that more blocker Pods could be scheduled to this node while we wait for the next tick cycle to happen, meaning that it could take an extensively long period before the Node happens to be free of blocker Pods and can be rebooted.

My proposal is to add a flag so that Nodes will be tainted as PreferNoSchedule as soon as they're detected as requiring a reboot, then the block checker can continue as normal. This way, there is a high chance that blocker Pods will schedule on other nodes instead, as long as the scheduler can accommodate them somewhere else. Once all of the blocking conditions have cleared, the Node will reboot as normal.

I'm happy to submit a PR to create this flag if it's agreed to!

@ckotzbauer
Copy link
Member

Hi @yaraskm,
there's already a prefer-no-schedule-taint flag for kured which does exactly what you describe.

@yaraskm
Copy link
Author

yaraskm commented Sep 3, 2024

Hi @ckotzbauer ,

I don't think this should be closed. I'm using the flag --prefer-no-schedule-taint as you mentioned. I was trying to raise the point that if you use --prefer-no-schedule-taint and --blocking-pod-selector and some condition blocks the reboot, the nodes will not get tainted.

What I'm proposing is either:

  • Change the default behaviour so that nodes always get tainted, regardless of whether there exist reboot blockers or not
  • Add a new flag so the taint is applied regardless of blocking conditions

@yaraskm
Copy link
Author

yaraskm commented Sep 3, 2024

I've created a draft PR for the behaviour change I'm suggesting: #971

@yaraskm
Copy link
Author

yaraskm commented Sep 16, 2024

I've been running my PR in our clusters for the past week and it's working as planned. Now, nodes get tainted when they need a reboot, regardless of whether there are blocking Pods or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants