-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CAPA control plane machine health checks #1978
Comments
my 2 cents: contrary to my initial concerns about having The only "critical" path i still see is the fact that if an Kudos to Erkan for creating this great list of Scenarios: Example Scenarios (tested in Add MachineHealthCheck for all nodes in Openstack): Setup: 1 Control Plane + 3 Infra Setup: 1 Control Plane + 3 Infra Setup: 3 Control Plane + 3 Infra Setup: 3 Control Plane + 3 Infra Setup: 3 Control Plane + 3 Infra Setup: 3 Control Plane + 3 Infra Open issues for other providers: |
To solve this issue, we were disabling/enabling machine health checks while fixing the cluster manually.
See https://intranet.giantswarm.io/docs/support-and-ops/ops-recipes/remove-errors-from-capi-capo-crs/ The biggest issue we had in CAPO was that there was a bug in CAPO controller that adds transient errors to infra CRs, which are propagated to CAPI CRs too. We had to delete them manually to fix clusters because of the maxUnhealthy limit. See https://github.com/giantswarm/giantswarm/issues/22443 |
@calvix @erkanerol @bavarianbidi do you think having two different |
It depends on whether you want to define different For CAPVCD, I added for only worker nodes. I want to observe the experience for a while. |
As control plane nodes are in multiple cases a bit "different" i would propose doing two different |
So in the end, this is only applied for the control plane nodes as stated in the issue name as the support for machine-pool indeed does not exist, and I added a filter to avoid killing the bastion node. |
We'll update the following MCs:
|
done |
Story
-As a cluster admin, I want the control plane nodes to be recreated if basic machine health checks fail in order to improve the stability of CAPA clusters.
Towards epic.
Background
CAPA clusters with control plane machines in invalid states fail to recover because there are no health checks configured.
CAPI supports machine health checks for control plane nodes and they need to be configured for CAPA clusters.
Links
The text was updated successfully, but these errors were encountered: