feature: cluster health init-container similar to how prepare is leveraged for k3s-upgrade #169

dweomer · 2021-11-24T20:55:19Z

Is your feature request related to a problem? Please describe.
Some upgrade use-cases require that the cluster "be healthy" before incurring the disruption of a node upgrade. It would be nice to configure a Plan such that some settling has occurred before it continues with the next node. This could be achieved by some sort of health measurement, possibly ensuring that all replicasets and daemonsets have a minimum number of pods running, etc.

Describe the solution you'd like
A parameter or two on the Plan spec indicating that some health measurement should pass before commencing with node upgrade(s) and what pre-canned strategy to use for making such a determination. Maybe the presence of a strategy choice other than "none" would be enough (so, one parameter).

Describe alternatives you've considered
Relying on the eviction algorithm that respects pod disruption budgets (aka NOT specifying .spec.disableEviction) will likely not be adequate for all upgrade needs because such can hang indefinitely in resource-constrained clusters. Because of this we must assume that some disruptions can and will happen from upgrade plan applies. Is this enough to warrant new logic in the controller? 🤷

Additional context

Do not start patch or upgrade if cluster is not healthy and any node is cordoned #163

The text was updated successfully, but these errors were encountered:

psy-q · 2023-03-28T08:52:18Z

We could use a lightweight version of this where we can at least specify a delay between node upgrades so that we avoid having multiple nodes rebooting before a StatefulSet with 3 pods is ready again. Is there a delay option already that we missed, e.g. 30 minutes between node upgrades?

The issue we have is that this workload is tightly coupled to specific nodes, so if SUC just goes ahead and reboots one after the other, even if the pods could be rescheduled to another node to meet their PDB, they won't be because they need to be scheduled on the exact same node again.

As it takes 15-20 minutes for a pod to become ready and reconnect to its cluster friends, SUC has cheerfully rebooted all three nodes by that time, destroying the application's clustering mode. It can't deal with more than one cluster member being unavailable at any one time.

dweomer · 2023-04-10T18:01:28Z

We could use a lightweight version of this where we can at least specify a delay between node upgrades so that we avoid having multiple nodes rebooting before a StatefulSet with 3 pods is ready again. Is there a delay option already that we missed, e.g. 30 minutes between node upgrades?

The issue we have is that this workload is tightly coupled to specific nodes, so if SUC just goes ahead and reboots one after the other, even if the pods could be rescheduled to another node to meet their PDB, they won't be because they need to be scheduled on the exact same node again.

As it takes 15-20 minutes for a pod to become ready and reconnect to its cluster friends, SUC has cheerfully rebooted all three nodes by that time, destroying the application's clustering mode. It can't deal with more than one cluster member being unavailable at any one time.

IIRC, SUC will honor existing PDB if such exists.

dweomer added the enhancement New feature or request label Nov 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: cluster health init-container similar to how prepare is leveraged for k3s-upgrade #169

feature: cluster health init-container similar to how prepare is leveraged for k3s-upgrade #169

dweomer commented Nov 24, 2021 •

edited

Loading

psy-q commented Mar 28, 2023 •

edited

Loading

dweomer commented Apr 10, 2023

feature: cluster health init-container similar to how prepare is leveraged for k3s-upgrade #169

feature: cluster health init-container similar to how prepare is leveraged for k3s-upgrade #169

Comments

dweomer commented Nov 24, 2021 • edited Loading

psy-q commented Mar 28, 2023 • edited Loading

dweomer commented Apr 10, 2023

dweomer commented Nov 24, 2021 •

edited

Loading

psy-q commented Mar 28, 2023 •

edited

Loading