Allow tainting Nodes even when blocker Pods exist #970

yaraskm · 2024-08-30T14:32:30Z

I'm proposing an optimization that should help a cluster perform a full reboot faster when there are blocker conditions.

Currently, the logic is as follows when using both --blocking-pod-selector and --prefer-no-schedule-taint :

var rebootRequiredBlockCondition string
if rebootBlocked(blockCheckers...) {
	rebootRequiredBlockCondition = ", but blocked at this time"
	continue
}
log.Infof("Reboot required%s", rebootRequiredBlockCondition)

if !holding(lock, &nodeMeta, concurrency > 1) && !acquire(lock, &nodeMeta, TTL, concurrency) {
	// Prefer to not schedule pods onto this node to avoid draing the same pod multiple times.
	preferNoScheduleTaint.Enable()
	continue
}

By this point, we know that the Node requires a reboot, but if a blocker exists (e.g. Prometheus alert is firing or Pods exist on the node that match the blocking selector), the main loop will just go back to sleep and wait for the next tick without tainting the node. The problem here though is that more blocker Pods could be scheduled to this node while we wait for the next tick cycle to happen, meaning that it could take an extensively long period before the Node happens to be free of blocker Pods and can be rebooted.

My proposal is to add a flag so that Nodes will be tainted as PreferNoSchedule as soon as they're detected as requiring a reboot, then the block checker can continue as normal. This way, there is a high chance that blocker Pods will schedule on other nodes instead, as long as the scheduler can accommodate them somewhere else. Once all of the blocking conditions have cleared, the Node will reboot as normal.

I'm happy to submit a PR to create this flag if it's agreed to!

The text was updated successfully, but these errors were encountered:

ckotzbauer · 2024-08-31T20:09:09Z

Hi @yaraskm,
there's already a prefer-no-schedule-taint flag for kured which does exactly what you describe.

yaraskm · 2024-09-03T11:39:06Z

Hi @ckotzbauer ,

I don't think this should be closed. I'm using the flag --prefer-no-schedule-taint as you mentioned. I was trying to raise the point that if you use --prefer-no-schedule-taint and --blocking-pod-selector and some condition blocks the reboot, the nodes will not get tainted.

What I'm proposing is either:

Change the default behaviour so that nodes always get tainted, regardless of whether there exist reboot blockers or not
Add a new flag so the taint is applied regardless of blocking conditions

yaraskm · 2024-09-03T12:59:00Z

I've created a draft PR for the behaviour change I'm suggesting: #971

yaraskm · 2024-09-16T13:40:33Z

I've been running my PR in our clusters for the past week and it's working as planned. Now, nodes get tainted when they need a reboot, regardless of whether there are blocking Pods or not.

ckotzbauer closed this as completed Aug 31, 2024

ckotzbauer added the question label Aug 31, 2024

yaraskm mentioned this issue Sep 3, 2024

970: Taint nodes even if reboot is currently blocked #971

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow tainting Nodes even when blocker Pods exist #970

Allow tainting Nodes even when blocker Pods exist #970

yaraskm commented Aug 30, 2024

ckotzbauer commented Aug 31, 2024

yaraskm commented Sep 3, 2024

yaraskm commented Sep 3, 2024

yaraskm commented Sep 16, 2024

Allow tainting Nodes even when blocker Pods exist #970

Allow tainting Nodes even when blocker Pods exist #970

Comments

yaraskm commented Aug 30, 2024

ckotzbauer commented Aug 31, 2024

yaraskm commented Sep 3, 2024

yaraskm commented Sep 3, 2024

yaraskm commented Sep 16, 2024