Topology controller should not take ownership of cluster.spec.topology #6736

chrischdi · 2022-06-26T00:13:40Z

What steps did you take and what happened:

Checked output of pull-cluster-api-e2e-full-main:
https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/directory/pull-cluster-api-e2e-full-main/1540324488158121984

Especially the cluster object and the resulted managed fields, the interesting part:

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"cluster.x-k8s.io/v1beta1","kind":"Cluster","metadata":{"annotations":{},"labels":{"cni":"k8s-upgrade-and-conformance-l9e4bu-crs-0"},"name":"k8s-upgrade-and-conformance-l9e4bu","namespace":"k8s-upgrade-and-conformance-9sta2j"},"spec":{"clusterNetwork":{"pods":{"cidrBlocks":["192.168.0.0/16"]},"serviceDomain":"cluster.local","services":{"cidrBlocks":["10.128.0.0/12"]}},"topology":{"class":"quick-start","controlPlane":{"metadata":{},"replicas":3},"variables":[{"name":"etcdImageTag","value":""},{"name":"coreDNSImageTag","value":""}],"version":"v1.23.6","workers":{"machineDeployments":[{"class":"default-worker","failureDomain":"fd4","name":"md-0","replicas":1}]}}}}
  creationTimestamp: "2022-06-24T13:52:55Z"
  finalizers:
  - cluster.cluster.x-k8s.io
  generation: 4
  labels:
    cluster.x-k8s.io/cluster-name: k8s-upgrade-and-conformance-l9e4bu
    cni: k8s-upgrade-and-conformance-l9e4bu-crs-0
    topology.cluster.x-k8s.io/owned: ""
  managedFields:
  - apiVersion: cluster.x-k8s.io/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:kubectl.kubernetes.io/last-applied-configuration: {}
        f:labels:
          f:cluster.x-k8s.io/cluster-name: {}
          f:cni: {}
          f:topology.cluster.x-k8s.io/owned: {}
      f:spec:
        f:clusterNetwork:
          f:pods:
            f:cidrBlocks: {}
          f:serviceDomain: {}
          f:services:
            f:cidrBlocks: {}
        f:controlPlaneRef:
          f:apiVersion: {}
          f:kind: {}
          f:name: {}
          f:namespace: {}
        f:infrastructureRef:
          f:apiVersion: {}
          f:kind: {}
          f:name: {}
          f:namespace: {}
        f:topology: // THE INTERESTING PART
          f:class: {}
          f:controlPlane:
            f:metadata: {}
            f:replicas: {}
          f:variables: {}
          f:version: {}
          f:workers:
            f:machineDeployments: {}
    manager: capi-topology
    operation: Apply
    time: "2022-06-24T14:12:06Z"
  - apiVersion: cluster.x-k8s.io/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
        f:labels:
          .: {}
          f:cni: {}
      f:spec:
        .: {}
        f:clusterNetwork:
          .: {}
          f:pods:
            .: {}
            f:cidrBlocks: {}
          f:serviceDomain: {}
          f:services:
            .: {}
            f:cidrBlocks: {}
        f:topology:
          .: {}
          f:class: {}
          f:controlPlane:
            .: {}
            f:metadata: {}
            f:replicas: {}
          f:workers:
            .: {}
            f:machineDeployments: {}
    manager: kubectl-client-side-apply
    operation: Update
    time: "2022-06-24T13:52:55Z"

[0]

The interesting part here is the co-ownership of the path spec.topology and fields bellow.
These are fields which are normally read as input for computing the desired state and are normally never written by the topology controller.

The same may apply to the following paths but need to be verified:

.spec.paused
spec.clusterNetwork

The issue is: if the topology controller reconciles and at the same time another controller or user changes one of the co-owned fields, the topology controller may overwrite the changed data.

E.g.:

topology controller starts reconciling cluster object, using the old values (Cluster.spec.topology.workers.machineDeployments[0].replicas == 3)
user edits cluster and sets Cluster.spec.topology.workers.machineDeployments[0].replicas = 100 and patches it to the api server
topology controller finishes reconciliation which include spec changes, by that resets the cluster object to Cluster.spec.topology.workers.machineDeployments[0].replicas == 3

What did you expect to happen:

Topology controller does not overwrite cluster fields which it is not supposed to write to.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Possible solution: Note: does not work out of the box, would require bigger refactoring in patchEngine.Apply

adjust computeCluster in topology controllers desired_state.go :
As alternative computeCluster at

cluster-api/internal/controllers/topology/cluster/desired_state.go

Line 406 in 5c44a14

func computeCluster(_ context.Context, s *scope.Scope, infrastructureCluster, controlPlane *unstructured.Unstructured) *clusterv1.Cluster {

to only contain set the relevant paths supposed to get set via topology controller:
- metadata.labels
- cluster.Spec.InfrastructureRef
- cluster.Spec.ControlPlaneRef
as well ass other required metadata
- metadata.name
- metadata.namespace

Possible solution:
This could be adjusted by adding the unrelated fields to the ignorePaths at:

cluster-api/internal/controllers/topology/cluster/reconcile_state.go

Line 442 in 5c44a14

patchHelper, err := r.patchHelperFactory(s.Current.Cluster, s.Desired.Cluster, structuredmerge.IgnorePaths{

Possible solutions: other variants

xref: issue discovered while testing dry-run server side apply at #6710 but should potentially already affect main.

Environment:

Cluster-api version:
minikube/kind version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

The text was updated successfully, but these errors were encountered:

chrischdi · 2022-06-26T00:14:10Z

cc @fabriziopandini @sbueringer

chrischdi · 2022-06-26T00:21:54Z

/area topology

I think a fix should get cherry picked to v1.2

chrischdi · 2022-06-26T14:45:20Z

#6738 implements option 1:

We always clearly define our desired cluster object which results in which fields the topology controller will own
We cannot forget to add a field to the filtering at

cluster-api/internal/controllers/topology/cluster/reconcile_state.go

Line 442 in 5c44a14

patchHelper, err := r.patchHelperFactory(s.Current.Cluster, s.Desired.Cluster, structuredmerge.IgnorePaths{

because we already take care to only add the relevant fields to the desired object in computeCluster.

Cons of option 2 and current implementation:

We have two places where we define the intent of our desired Custer object / which fields will be owned by the topology controller for the Cluster object
We would define pretty late which fields we have an option on (in reconcileCluster), compared to option 1 where we will do it at the earliest possible stage.

Edit: option 1 does not work out out of the box, because desired.Cluster.Spec.Topology is still getting used inside patch engine.

chrischdi · 2022-06-26T18:11:51Z

#6739 implements option 2 which simply extends the behaviour already done for spec.controlPlaneEndpoint for the other not-topology managed fields.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 26, 2022

k8s-ci-robot added the area/topology label Jun 26, 2022

chrischdi mentioned this issue Jun 26, 2022

🐛 Adjust structuredmerge patch helper options to set correct allow list for Cluster objects to prevent co-ownership #6738

Closed

chrischdi mentioned this issue Jun 26, 2022

🐛 [wip] filter all fields not managed by by topology controller for cluster object #6739

Closed

chrischdi mentioned this issue Jun 29, 2022

🐛 Adjust structuredmerge patch helper options to set correct allow list for Cluster objects to prevent co-ownership #6773

Merged

k8s-ci-robot closed this as completed in #6773 Jun 29, 2022

killianmuldoon added the area/clusterclass Issues or PRs related to clusterclass label May 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topology controller should not take ownership of cluster.spec.topology #6736

Topology controller should not take ownership of cluster.spec.topology #6736

chrischdi commented Jun 26, 2022 •

edited

Loading

chrischdi commented Jun 26, 2022

chrischdi commented Jun 26, 2022 •

edited

Loading

chrischdi commented Jun 26, 2022 •

edited

Loading

chrischdi commented Jun 26, 2022

Topology controller should not take ownership of cluster.spec.topology #6736

Topology controller should not take ownership of cluster.spec.topology #6736

Comments

chrischdi commented Jun 26, 2022 • edited Loading

chrischdi commented Jun 26, 2022

chrischdi commented Jun 26, 2022 • edited Loading

chrischdi commented Jun 26, 2022 • edited Loading

chrischdi commented Jun 26, 2022

chrischdi commented Jun 26, 2022 •

edited

Loading

chrischdi commented Jun 26, 2022 •

edited

Loading

chrischdi commented Jun 26, 2022 •

edited

Loading