-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MachineDeployment Status should take MachineHealthCheck into account when computing final status #7533
Comments
Thx for opening the issue I think we might have an additional problem. In this situation afaik we can't create a new Machine with the old version to remediate |
/triage accepted I have some question on the use case because it shows cluster status reporting an issue while the ask is to consider MHC, which surfaces at machine level with an annotation According to current API, A machine is considered ready when the node has been created and is "Ready" (which does not include MHC failures at current state). Changing this probably is an API change |
@fabriziopandini I will repro the issue and attach the cluster and machine description. |
Reproduced the issue:
Will try to explain the problem that I see and that there could be different ways to fix it than what has been asked in the issue.
Attaching machine/cluster yaml file. |
I guess one of the key things in this particular issue is that kubeadm uses configmaps rather than CRDs and there's no conversion webhooks. That means if kubeadm does an API revision bump and upconverts the cluster config to it in the same release, then the immediately preceding release doesn't understand the config anymore. There's no supportable version skew in this scenario. |
echoing from above (trying to avoid scope creeping on the issue)
I'm investigating the kubeadm issue; ASAP I will follow up with a separated issue and possibly some ideas |
Fair enough. I don't know if there's much we want to do with Phase. It's very tricky to convey a sensible meaning in a single status field. @nehagjain15 , as much as possible I would avoid relying on Phase in favour of conditions. What might be more interesting, on the above basis, is to set a Condition. In this error state we have:
Maybe we want an explicit Condition indicating there is at least one bad replica. |
@fabriziopandini I'm probably wrong but I thought the MHC controller sets the Independent of that. To me it seems we have a mismatch between when the Cluster topology controller considers a MD as "rolling out" vs what you can see at the MD (specifically the phase). Topology controller code: https://github.com/kubernetes-sigs/cluster-api/blob/main/internal/controllers/topology/cluster/scope/state.go#L96-L98 @nehagjain15 Can you please provide the MachineDeployment YAML at the time when the topology controller is stuck? EDIT: I assume what Naadir posted is the status of the MD at the time where a Machine is remediated? (sorry it's hard to guess :)) I think on a high-level we have the following options:
I think 2. is not really an option as I think our current idea is to surface things like this through conditions and not somehow through phase. Looking a bit closer into how remediation in the MS controller works. Looks like the "OwnerRemediated" condition will be there only for a few seconds before the Machine is deleted. So the better approach is probably to take the information we have about spec replicas, available replicas, ... and surface a condition which expresses if the entire MD is ready/running/... (not sure what the right wording is). |
@nehagjain15 Issue should be resolved now. Basically we don't block MD upgrades anymore if other MachineDeployments are rolling out for other reasons (e.g. MHC remediation). This will be released in v1.4.2 |
What steps did you take and what happened:
The cluster was provisioned with version 1.22.9+.
The cluster was then upgraded to v1.23.8+. At the time when the cluster was upgraded Machine Deployment status was running but MachineHealthCheck was still reporting some nodes as unhealthy.
Post k8s version upgrade Control plane nodes were upgraded to the new version but MD was not updated with the new k8s version because of the following:
Seen in the Cluster Description:
##Snippet from Logs
Currently, MachineDeployment status reports
running
fornodepool:tkc-wc-wcm-nodepool-8f399767-n854h
but MHC clearly shows that all expected machines are not ready.What did you expect to happen:
MachineDeployment status should not be running in the above scenario.
If the upgrade for MachineDeployment is blocked based on MachineHealthCheck then the status of MachineDeployment should take that into account.
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
Environment:
kubectl version
):/etc/os-release
):/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]
The text was updated successfully, but these errors were encountered: