Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSA: ClusterClass with MHC fails #6653

Closed
sbueringer opened this issue Jun 15, 2022 · 6 comments
Closed

SSA: ClusterClass with MHC fails #6653

sbueringer opened this issue Jun 15, 2022 · 6 comments
Assignees
Labels
area/clusterclass Issues or PRs related to clusterclass kind/bug Categorizes issue or PR as related to a bug. kind/release-blocking Issues or PRs that need to be closed before the next CAPI release

Comments

@sbueringer
Copy link
Member

I'm creating a ClusterClass with MHC and the reconciler fails with the following error

{"ts":1655283660205.302,"caller":"controller/controller.go:326","msg":"Reconciler error","controller":"machinehealthcheck","controllerGroup":"cluster.x-k8s.io","controllerKind":"MachineHealthCheck","cluster":{"name":"my-cluster","namespace":"default"},"namespace":"default","name":"my-cluster","reconcileID":"357fef1f-b6d4-49a6-a366-5289927df209","err":"error reconciling the Cluster topology: failed to patch MachineHealthCheck/my-cluster-g6vcl: MachineHealthCheck.cluster.x-k8s.io "my-cluster-g6vcl" is invalid: metadata.ownerReferences.uid: Invalid value: "": uid must not be empty","errVerbose":"MachineHealthCheck.cluster.x-k8s.io "my-cluster-g6vcl" is invalid: metadata.ownerReferences.uid: Invalid value: "": uid must not be empty\nfailed to patch MachineHealthCheck"}

Error message:

error reconciling the Cluster topology: failed to patch MachineHealthCheck/my-cluster-g6vcl: MachineHealthCheck.cluster.x-k8s.io "my-cluster-g6vcl" is invalid: metadata.ownerReferences.uid: Invalid value: "": uid must not be empty

Stacktrace (thx JSON logging!)

sigs.k8s.io/cluster-api/internal/controllers/topology/cluster.(*Reconciler).reconcileMachineHealthCheck
   /home/sbuerin/code/src/sigs.k8s.io/cluster-api/internal/controllers/topology/cluster/reconcile_state.go:328
sigs.k8s.io/cluster-api/internal/controllers/topology/cluster.(*Reconciler).reconcileControlPlane
   /home/sbuerin/code/src/sigs.k8s.io/cluster-api/internal/controllers/topology/cluster/reconcile_state.go:261
sigs.k8s.io/cluster-api/internal/controllers/topology/cluster.(*Reconciler).reconcileState
   /home/sbuerin/code/src/sigs.k8s.io/cluster-api/internal/controllers/topology/cluster/reconcile_state.go:69
sigs.k8s.io/cluster-api/internal/controllers/topology/cluster.(*Reconciler).reconcile
   /home/sbuerin/code/src/sigs.k8s.io/cluster-api/internal/controllers/topology/cluster/cluster_controller.go:241
sigs.k8s.io/cluster-api/internal/controllers/topology/cluster.(*Reconciler).Reconcile
   /home/sbuerin/code/src/sigs.k8s.io/cluster-api/internal/controllers/topology/cluster/cluster_controller.go:198
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
   /home/sbuerin/code/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
   /home/sbuerin/code/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
   /home/sbuerin/code/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
   /home/sbuerin/code/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234
runtime.goexit

The problem is the following:

  • In desired_state.go we are copying over the owner rev of the previous MHC:
    if current != nil {
    if ref := getOwnerReferenceFrom(current, healthCheckTarget); ref != nil {
    mhc.SetOwnerReferences([]metav1.OwnerReference{*ref})
    }
    }
    • At this point the owner ref to KubeadmControlPlane contains a UID
  • In reconcile_state.go we overwrite the ownerReferences:
    s.Desired.ControlPlane.MachineHealthCheck.SetOwnerReferences([]metav1.OwnerReference{
    *ownerReferenceTo(s.Desired.ControlPlane.Object),
    })
    • s.Desired.ControlPlane.Object does not contain a UID so the ref also doesn't contain it
  • In reconcileMachineHealthCheck we only call resolveOwnerReferenceIfIncomplete if the MHC is initially created, on updates we don't do it and thus try to patch a MHC with an empty UID, which leads to the error above

/kind bug
/area topology
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. area/topology labels Jun 15, 2022
@sbueringer sbueringer added the kind/release-blocking Issues or PRs that need to be closed before the next CAPI release label Jun 15, 2022
@chrischdi
Copy link
Member

/assign

Note: Seems to only happen for MHC defined for ControlPlanes. MHC for a MachineDeployment via ClusterClass seems to be fine.

@chrischdi
Copy link
Member

chrischdi commented Jun 15, 2022

When comparing MHC for MachineDeployments and ControlPlanes on ClusterClass:

I see two solutions:

  1. Only carry over the ownerReference at
    if s.Desired.ControlPlane.MachineHealthCheck != nil {
    , when the ControlPlane object did not exist yet (and thus got created and the UID at s.desired.ControlPlane.Object was set) by checking if s.Current.ControlPlane.Object == nil
  2. Get the current control plane object and create ownerReference from it instead of using s.desired.ControlPlane.Object (which has no uid set on updates).

I favor for 1, because this would make the behaviour consistent to what is done for MachineDeployments.

@killianmuldoon
Copy link
Contributor

I favor for 1, because this would make the behaviour consistent to what is done for MachineDeployments.
That seems like the correct behaviour to me too.

@sbueringer
Copy link
Member Author

sbueringer commented Jun 15, 2022

Note: Let's please extend our e2e test ClusterClasses to also include CP+MD MHC's. I think it should be a trivial change and this should allow us to at least detect if the MHC reconciliation breaks anything (I wouldn't go so far to verify if the MHC actually works, just adding it)

@sbueringer
Copy link
Member Author

Has been fixed in #6660

/close

@k8s-ci-robot
Copy link
Contributor

@sbueringer: Closing this issue.

In response to this:

Has been fixed in #6660

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@killianmuldoon killianmuldoon added the area/clusterclass Issues or PRs related to clusterclass label May 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clusterclass Issues or PRs related to clusterclass kind/bug Categorizes issue or PR as related to a bug. kind/release-blocking Issues or PRs that need to be closed before the next CAPI release
Projects
None yet
4 participants