SSA: ClusterClass with MHC fails #6653

sbueringer · 2022-06-15T09:21:15Z

I'm creating a ClusterClass with MHC and the reconciler fails with the following error

{"ts":1655283660205.302,"caller":"controller/controller.go:326","msg":"Reconciler error","controller":"machinehealthcheck","controllerGroup":"cluster.x-k8s.io","controllerKind":"MachineHealthCheck","cluster":{"name":"my-cluster","namespace":"default"},"namespace":"default","name":"my-cluster","reconcileID":"357fef1f-b6d4-49a6-a366-5289927df209","err":"error reconciling the Cluster topology: failed to patch MachineHealthCheck/my-cluster-g6vcl: MachineHealthCheck.cluster.x-k8s.io "my-cluster-g6vcl" is invalid: metadata.ownerReferences.uid: Invalid value: "": uid must not be empty","errVerbose":"MachineHealthCheck.cluster.x-k8s.io "my-cluster-g6vcl" is invalid: metadata.ownerReferences.uid: Invalid value: "": uid must not be empty\nfailed to patch MachineHealthCheck"}

Error message:

error reconciling the Cluster topology: failed to patch MachineHealthCheck/my-cluster-g6vcl: MachineHealthCheck.cluster.x-k8s.io "my-cluster-g6vcl" is invalid: metadata.ownerReferences.uid: Invalid value: "": uid must not be empty

Stacktrace (thx JSON logging!)

sigs.k8s.io/cluster-api/internal/controllers/topology/cluster.(*Reconciler).reconcileMachineHealthCheck
   /home/sbuerin/code/src/sigs.k8s.io/cluster-api/internal/controllers/topology/cluster/reconcile_state.go:328
sigs.k8s.io/cluster-api/internal/controllers/topology/cluster.(*Reconciler).reconcileControlPlane
   /home/sbuerin/code/src/sigs.k8s.io/cluster-api/internal/controllers/topology/cluster/reconcile_state.go:261
sigs.k8s.io/cluster-api/internal/controllers/topology/cluster.(*Reconciler).reconcileState
   /home/sbuerin/code/src/sigs.k8s.io/cluster-api/internal/controllers/topology/cluster/reconcile_state.go:69
sigs.k8s.io/cluster-api/internal/controllers/topology/cluster.(*Reconciler).reconcile
   /home/sbuerin/code/src/sigs.k8s.io/cluster-api/internal/controllers/topology/cluster/cluster_controller.go:241
sigs.k8s.io/cluster-api/internal/controllers/topology/cluster.(*Reconciler).Reconcile
   /home/sbuerin/code/src/sigs.k8s.io/cluster-api/internal/controllers/topology/cluster/cluster_controller.go:198
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
   /home/sbuerin/code/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
   /home/sbuerin/code/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
   /home/sbuerin/code/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
   /home/sbuerin/code/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234
runtime.goexit

The problem is the following:

In desired_state.go we are copying over the owner rev of the previous MHC:

cluster-api/internal/controllers/topology/cluster/desired_state.go

Lines 828 to 832 in abb8f5f

    
           if current != nil { 
        
           	if ref := getOwnerReferenceFrom(current, healthCheckTarget); ref != nil { 
        
           		mhc.SetOwnerReferences([]metav1.OwnerReference{*ref}) 
        
           	} 
        
           }

At this point the owner ref to KubeadmControlPlane contains a UID

In reconcile_state.go we overwrite the ownerReferences:

cluster-api/internal/controllers/topology/cluster/reconcile_state.go

Lines 255 to 257 in 2f6f77f

    
           s.Desired.ControlPlane.MachineHealthCheck.SetOwnerReferences([]metav1.OwnerReference{ 
        
           	*ownerReferenceTo(s.Desired.ControlPlane.Object), 
        
           })

s.Desired.ControlPlane.Object does not contain a UID so the ref also doesn't contain it

In reconcileMachineHealthCheck we only call resolveOwnerReferenceIfIncomplete if the MHC is initially created, on updates we don't do it and thus try to patch a MHC with an empty UID, which leads to the error above

/kind bug
/area topology
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

The text was updated successfully, but these errors were encountered:

chrischdi · 2022-06-15T11:05:53Z

/assign

Note: Seems to only happen for MHC defined for ControlPlanes. MHC for a MachineDeployment via ClusterClass seems to be fine.

chrischdi · 2022-06-15T12:09:41Z

When comparing MHC for MachineDeployments and ControlPlanes on ClusterClass:

In desired_state.go we Carry over the ownerReference ownerReference for both cases if the source object does already exist (MachineDeployments, ControlPlanes):

cluster-api/internal/controllers/topology/cluster/desired_state.go

Line 828 in 72b68bf

if current != nil {
in reconcile_state.go:
- for MachineDeployments: We only set the owner reference:
  - when the MachineDeployment gets created (in createMachineDeployment:
    
    cluster-api/internal/controllers/topology/cluster/reconcile_state.go
    
    Line 456 in 72b68bf
    
    *ownerReferenceTo(md.Object),
    
    )
  - not when it is getting updated (updateMachineDeployment).
- for ControlPlanes: We do it in both cases: (create and update)
  
  cluster-api/internal/controllers/topology/cluster/reconcile_state.go
  
  Line 254 in 72b68bf
  
  if s.Desired.ControlPlane.MachineHealthCheck != nil {
  - For the update case: if s.desired.ControlPlane.Object does not result in an update, it will not have the UID set. **this leads to setting the broken uid`.

I see two solutions:

Only carry over the ownerReference at

cluster-api/internal/controllers/topology/cluster/reconcile_state.go

Line 254 in 72b68bf

if s.Desired.ControlPlane.MachineHealthCheck != nil {

, when the ControlPlane object did not exist yet (and thus got created and the UID at s.desired.ControlPlane.Object was set) by checking if s.Current.ControlPlane.Object == nil
Get the current control plane object and create ownerReference from it instead of using s.desired.ControlPlane.Object (which has no uid set on updates).

I favor for 1, because this would make the behaviour consistent to what is done for MachineDeployments.

killianmuldoon · 2022-06-15T12:22:11Z

I favor for 1, because this would make the behaviour consistent to what is done for MachineDeployments.
That seems like the correct behaviour to me too.

sbueringer · 2022-06-15T13:03:38Z

Note: Let's please extend our e2e test ClusterClasses to also include CP+MD MHC's. I think it should be a trivial change and this should allow us to at least detect if the MHC reconciliation breaks anything (I wouldn't go so far to verify if the MHC actually works, just adding it)

sbueringer · 2022-06-21T12:30:05Z

Has been fixed in #6660

/close

k8s-ci-robot · 2022-06-21T12:30:17Z

@sbueringer: Closing this issue.

In response to this:

Has been fixed in #6660

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. area/topology labels Jun 15, 2022

sbueringer added the kind/release-blocking Issues or PRs that need to be closed before the next CAPI release label Jun 15, 2022

k8s-ci-robot assigned chrischdi Jun 15, 2022

chrischdi mentioned this issue Jun 15, 2022

🐛 Prevent breaking an existing owner reference for MachineHealthChecks of ControlPlanes in topology #6655

Closed

k8s-ci-robot closed this as completed Jun 21, 2022

killianmuldoon added the area/clusterclass Issues or PRs related to clusterclass label May 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSA: ClusterClass with MHC fails #6653

SSA: ClusterClass with MHC fails #6653

sbueringer commented Jun 15, 2022

chrischdi commented Jun 15, 2022

chrischdi commented Jun 15, 2022 •

edited

Loading

killianmuldoon commented Jun 15, 2022

sbueringer commented Jun 15, 2022 •

edited

Loading

sbueringer commented Jun 21, 2022

k8s-ci-robot commented Jun 21, 2022

SSA: ClusterClass with MHC fails #6653

SSA: ClusterClass with MHC fails #6653

Comments

sbueringer commented Jun 15, 2022

chrischdi commented Jun 15, 2022

chrischdi commented Jun 15, 2022 • edited Loading

killianmuldoon commented Jun 15, 2022

sbueringer commented Jun 15, 2022 • edited Loading

sbueringer commented Jun 21, 2022

k8s-ci-robot commented Jun 21, 2022

chrischdi commented Jun 15, 2022 •

edited

Loading

sbueringer commented Jun 15, 2022 •

edited

Loading