Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

capi-controller-manager continously patches AWSCluster object when using ClusterClass #6320

Closed
matlemb opened this issue Mar 21, 2022 · 12 comments · Fixed by #6495
Closed

capi-controller-manager continously patches AWSCluster object when using ClusterClass #6320

matlemb opened this issue Mar 21, 2022 · 12 comments · Fixed by #6495
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor.
Milestone

Comments

@matlemb
Copy link

matlemb commented Mar 21, 2022

What steps did you take and what happened:
When launching an kubernetes cluster on AWS with Cluster API and consuming existing AWS Infrastructure (existing VPC and subnets) via ClusterClass, capi-controller-manager continously patches the AWSCluster object to the state of AWSClusterTemplate object.

AWSCluster object does get additional information via AWS API calls through capa-controller-manager (for example the routing table, tags, etc.). So both controllers continously are modifying the object resulting in a loop.

What did you expect to happen:
capi-controller-manager should not revert changes done by capa-controller-manager when using ClusterClass. ClusterClass feature relies on AWSClusterTemplate.

Anything else you would like to add:
I believe this is more an capi/clusterclass issue than an capa issue. By design capa needs to write these informations back to the AWSCluster object.

A workaround was tested successfully by defining all additional information into AWSClusterTemplate that was retrieved by capa-controller-manager. However this is not a practical solution (especially for tags as these change by launching new clusters).

AWSCluster object with additional information on the subnets written by capa-controller-manager:

apiVersion: v1
items:
- apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
  kind: AWSCluster
  metadata:
    annotations:
      cluster.x-k8s.io/cloned-from-groupkind: AWSClusterTemplate.infrastructure.cluster.x-k8s.io
      cluster.x-k8s.io/cloned-from-name: my-aws-workload-cluster
    name: my-aws-workload-cluster-5tq2q
    namespace: my-aws-workload-cluster
[...]
  spec:
    network:   
      vpc:
        availabilityZoneSelection: Ordered
        availabilityZoneUsageLimit: 3
        cidrBlock: 192.168.0.0/16
        id: vpc-zzz
      subnets:
      - availabilityZone: eu-central-1a
        cidrBlock: 192.168.6.0/24
        id: subnet-aaa
        isPublic: false
        routeTableId: rtb-eee
        tags:
          kubernetes.io/cluster/my-aws-workload-cluster: shared
          kubernetes.io/role/internal-elb: "1"
      - availabilityZone: eu-central-1b
        cidrBlock: 192.168.7.0/24
        id: subnet-bbb
        isPublic: false
        routeTableId: rtb-fff
        tags:
          kubernetes.io/cluster/my-aws-workload-cluster: shared
          kubernetes.io/role/internal-elb: "1"
      - availabilityZone: eu-central-1c
        cidrBlock: 192.168.8.0/24
        id: subnet-ccc
        isPublic: false
        routeTableId: rtb-ggg
        tags:
          kubernetes.io/cluster/my-aws-workload-cluster: shared
          kubernetes.io/role/internal-elb: "1"
      - availabilityZone: eu-central-1a
        cidrBlock: 100.64.0.0/24
        id: subnet-ddd
        isPublic: true
        routeTableId: rtb-hhh
        tags:
          kubernetes.io/cluster/my-aws-workload-cluster: shared
          kubernetes.io/role/elb: "1"
      - availabilityZone: eu-central-1b
        cidrBlock: 100.64.1.0/24
        id: subnet-eee
        isPublic: true
        routeTableId: rtb-iii
        tags:
          kubernetes.io/cluster/my-aws-workload-cluster: shared
          kubernetes.io/role/elb: "1"
      - availabilityZone: eu-central-1c
        cidrBlock: 100.64.2.0/24
        id: subnet-fff
        isPublic: true
        routeTableId: rtb-jjj
        tags:
          kubernetes.io/cluster/my-aws-workload-cluster: shared
          kubernetes.io/role/elb: "1"

AWSClusterTemplate object with desired state of capi-controller-manager:

apiVersion: v1
items:
- apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
  kind: AWSClusterTemplate
    name: my-aws-workload-cluster
    namespace: aws-worklmy-aws-workload-cluster
[...]
  spec:
    template:
      spec:
        network:
          vpc:
            id: vpc-zzz
          subnets:
          - id: subnet-aaa
            availabilityZone: eu-central-1a
          - id: subnet-bbb
            availabilityZone: eu-central-1b
          - id: subnet-ccc
            availabilityZone: eu-central-1c
          - id: subnet-ddd
            availabilityZone: eu-central-1a
            isPublic: true
          - id: subnet-eee
            availabilityZone: eu-central-1b
            isPublic: true
          - id: subnet-fff
            availabilityZone: eu-central-1c
            isPublic: true

Log snippets from capi-controller-manager (grep -i patch):

...
I0321 13:48:01.618059       1 reconcile_state.go:612] controller/topology/cluster "msg"="Patching object" "name"="aws-workload-001-1379-r-01" "namespace"="aws-workload-001-1379-r-01" "object"="aws-workload-001-1379-r-01-4ftw9" "object groupVersion"="infrastructure.cluster.x-k8s.io/v1beta1" "object kind"="AWSCluster" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Cluster" "Patch"="{\"spec\":{\"network\":{\"subnets\":[{\"availabilityZone\":\"eu-central-1a\",\"id\":\"subnet-aaa\",\"isPublic\":false},{\"availabilityZone\":\"eu-central-1b\",\"id\":\"subnet-bbb\",\"isPublic\":false},{\"availabilityZone\":\"eu-central-1c\",\"id\":\"subnet-ccc\",\"isPublic\":false},{\"availabilityZone\":\"eu-central-1a\",\"id\":\"subnet-ddd\",\"isPublic\":true},{\"availabilityZone\":\"eu-central-1b\",\"id\":\"subnet-eee\",\"isPublic\":true},{\"availabilityZone\":\"eu-central-1c\",\"id\":\"subnet-fff\",\"isPublic\":true}]}}}"
I0321 13:48:02.731611       1 reconcile_state.go:612] controller/topology/cluster "msg"="Patching object" "name"="my-aws-workload-cluster" "namespace"="my-aws-workload-cluster" "object"="my-aws-workload-cluster-4ftw9" "object groupVersion"="infrastructure.cluster.x-k8s.io/v1beta1" "object kind"="AWSCluster" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Cluster" "Patch"="{\"spec\":{\"network\":{\"subnets\":[{\"availabilityZone\":\"eu-central-1a\",\"id\":\"subnet-aaa\",\"isPublic\":false},{\"availabilityZone\":\"eu-central-1b\",\"id\":\"subnet-bbb\",\"isPublic\":false},{\"availabilityZone\":\"eu-central-1c\",\"id\":\"subnet-ccc\",\"isPublic\":false},{\"availabilityZone\":\"eu-central-1a\",\"id\":\"subnet-ddd\",\"isPublic\":true},{\"availabilityZone\":\"eu-central-1b\",\"id\":\"subnet-eee\",\"isPublic\":true},{\"availabilityZone\":\"eu-central-1c\",\"id\":\"subnet-fff\",\"isPublic\":true}]}}}"
I0321 13:48:04.271176       1 reconcile_state.go:612] controller/topology/cluster "msg"="Patching object" "name"="my-aws-workload-cluster" "namespace"="my-aws-workload-cluster" "object"="my-aws-workload-cluster-4ftw9" "object groupVersion"="infrastructure.cluster.x-k8s.io/v1beta1" "object kind"="AWSCluster" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Cluster" "Patch"="{\"spec\":{\"network\":{\"subnets\":[{\"availabilityZone\":\"eu-central-1a\",\"id\":\"subnet-aaa\",\"isPublic\":false},{\"availabilityZone\":\"eu-central-1b\",\"id\":\"subnet-bbb\",\"isPublic\":false},{\"availabilityZone\":\"eu-central-1c\",\"id\":\"subnet-ccc\",\"isPublic\":false},{\"availabilityZone\":\"eu-central-1a\",\"id\":\"subnet-ddd\",\"isPublic\":true},{\"availabilityZone\":\"eu-central-1b\",\"id\":\"subnet-eee\",\"isPublic\":true},{\"availabilityZone\":\"eu-central-1c\",\"id\":\"subnet-fff\",\"isPublic\":true}]}}}"
...

Log snippets from capa-controller-manager (grep -i network.go):

...
I0321 13:48:03.583567       1 network.go:68] controller/awscluster "msg"="Reconcile network completed successfully" "cluster"="my-aws-workload-cluster" "name"="my-aws-workload-cluster-4ftw9" "namespace"="my-aws-workload-cluster" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AWSCluster" 
I0321 13:48:03.763572       1 network.go:29] controller/awscluster "msg"="Reconciling network for cluster" "cluster"="my-aws-workload-cluster" "name"="my-aws-workload-cluster-4ftw9" "namespace"="my-aws-workload-cluster" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AWSCluster" "cluster-name"="my-aws-workload-cluster" "cluster-namespace"="my-aws-workload-cluster"
I0321 13:48:04.012918       1 network.go:68] controller/awscluster "msg"="Reconcile network completed successfully" "cluster"="my-aws-workload-cluster" "name"="my-aws-workload-cluster-4ftw9" "namespace"="my-aws-workload-cluster" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AWSCluster" 
I0321 13:48:04.314065       1 network.go:29] controller/awscluster "msg"="Reconciling network for cluster" "cluster"="my-aws-workload-cluster" "name"="my-aws-workload-cluster-4ftw9" "namespace"="my-aws-workload-cluster" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AWSCluster" "cluster-name"="my-aws-workload-cluster" "cluster-namespace"="my-aws-workload-cluster"
I0321 13:48:05.009338       1 network.go:68] controller/awscluster "msg"="Reconcile network completed successfully" "cluster"="my-aws-workload-cluster" "name"="my-aws-workload-cluster-4ftw9" "namespace"="my-aws-workload-cluster" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AWSCluster" 
I0321 13:48:05.273592       1 network.go:29] controller/awscluster "msg"="Reconciling network for cluster" "cluster"="my-aws-workload-cluster" "name"="my-aws-workload-cluster-4ftw9" "namespace"="my-aws-workload-cluster" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AWSCluster" "cluster-name"="my-aws-workload-cluster" "cluster-namespace"="my-aws-workload-cluster"
...

Environment:

  • Cluster-api version: 1.1.3
  • Kubernetes version: (use kubectl version): 1.22.4
  • OS (e.g. from /etc/os-release): Ubuntu 20.04

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

Matthias Lembcke <[email protected]>, Daimler TSS GmbH (Provider Information)

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 21, 2022
@fabriziopandini
Copy link
Member

fabriziopandini commented Mar 21, 2022

/milestone v1.2
This is eventually candidate for backport

The topology reconciler has been specifically designed to not manage only fields that are explicitly defined in templates and some internals of this are described in https://cluster-api.sigs.k8s.io/tasks/experimental-features/cluster-class/change-clusterclass.html#reference
However, what is happening here is that the two controllers are "fighting" on fields of items on a list, and json patching is notoriously limited when working on list items.

In order to properly triage the issue, it will be great to have the full output of the AWSCluster object

@k8s-ci-robot k8s-ci-robot added this to the v1.2 milestone Mar 21, 2022
@matlemb
Copy link
Author

matlemb commented Mar 21, 2022

Thanks for your fast feedback! Here is the full output of the AWSCluster objects.

AWSCluster object when not modified by capa (desired state of capi):

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSCluster
metadata:
  annotations:
    cluster.x-k8s.io/cloned-from-groupkind: AWSClusterTemplate.infrastructure.cluster.x-k8s.io
    cluster.x-k8s.io/cloned-from-name: my-aws-workload-cluster
    topology.cluster.x-k8s.io/managed-field-paths: H4sIAAAAAAAA/2SPwWrDMBBE/0VnfUGOaS+hoRSXXnpbS1Nn8WZVtEqCMf73osgphpw0YjRPM7PryQondbvZkUi6Ib4cXru9pDCa282Ld1DqBbFeFu9C0pKTfAgpjoninoQ0IFdAyMnsO20N1qFRLJxwxqovvaLYSuQILVymDj8VMrLG9kypBapEuaU83j9RXo+DDhlm3UXwYG3I3l1/w33WlVioZ+Ey1XKfEIR18+Kf7C+jAUc+c2k+t+WLdxnDf8rs9Ibp/dHwLwAA//9vVA9OSAEAAA==
  creationTimestamp: "2022-03-21T16:03:40Z"
  finalizers:
  - awscluster.infrastructure.cluster.x-k8s.io
  generation: 677
  labels:
    cluster.x-k8s.io/cluster-name: my-aws-workload-cluster
    topology.cluster.x-k8s.io/owned: ""
  name: my-aws-workload-cluster-rdcc8
  namespace: my-aws-workload-cluster
  ownerReferences:
  - apiVersion: cluster.x-k8s.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: Cluster
    name: my-aws-workload-cluster
    uid: 62cca6a2-3e19-4747-ab5f-e44a2475410c
  resourceVersion: "22924479"
  uid: cae152e9-8ac6-47f4-aa2b-4e39b0a2ccde
spec:
  bastion:
    allowedCIDRBlocks:
    - 0.0.0.0/0
    enabled: false
  controlPlaneEndpoint:
    host: internal-8xztwdw9a0uh573gk73zxw2zu0ig-k8s-72349339.eu-central-1.elb.amazonaws.com
    port: 6443
  controlPlaneLoadBalancer:
    crossZoneLoadBalancing: false
    scheme: internal
    subnets:
    - subnet-02cfd2ed9a0bf20aa
    - subnet-0049a93e4ac5594aa
    - subnet-042910a41b4206e96
  identityRef:
    kind: AWSClusterStaticIdentity
    name: my-aws-workload-cluster-account
  network:
    cni:
      cniIngressRules:
      - description: bgp (calico)
        fromPort: 179
        protocol: tcp
        toPort: 179
      - description: IP-in-IP (calico)
        fromPort: -1
        protocol: "4"
        toPort: 65535
    subnets:
    - availabilityZone: eu-central-1a
      id: subnet-02897d43aecbd1abb
      isPublic: false
    - availabilityZone: eu-central-1b
      id: subnet-09b63ae42f80aa441
      isPublic: false
    - availabilityZone: eu-central-1c
      id: subnet-08749cde1774778a5
      isPublic: false
    - availabilityZone: eu-central-1a
      id: subnet-0f7abb8ed5cc40ffb
      isPublic: true
    - availabilityZone: eu-central-1b
      id: subnet-0f4842a43f6d2415f
      isPublic: true
    - availabilityZone: eu-central-1c
      id: subnet-0995c47a4483d1ee8
      isPublic: true
    vpc:
      availabilityZoneSelection: Ordered
      availabilityZoneUsageLimit: 3
      cidrBlock: 192.168.0.0/16
      id: vpc-0a04044e9ae06ad3f
  region: eu-central-1
  sshKeyName: capa
status:
  conditions:
  - lastTransitionTime: "2022-03-21T16:05:04Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2022-03-21T16:03:44Z"
    status: "True"
    type: ClusterSecurityGroupsReady
  - lastTransitionTime: "2022-03-21T16:05:04Z"
    status: "True"
    type: LoadBalancerReady
  - lastTransitionTime: "2022-03-21T16:03:40Z"
    status: "True"
    type: PrincipalCredentialRetrieved
  - lastTransitionTime: "2022-03-21T16:03:40Z"
    status: "True"
    type: PrincipalUsageAllowed
  - lastTransitionTime: "2022-03-21T16:03:42Z"
    status: "True"
    type: SubnetsReady
  - lastTransitionTime: "2022-03-21T16:03:41Z"
    status: "True"
    type: VpcReady
  failureDomains:
    eu-central-1a:
      controlPlane: true
    eu-central-1b:
      controlPlane: true
    eu-central-1c:
      controlPlane: true
  networkStatus:
    apiServerElb:
      attributes:
        idleTimeout: 600000000000
      availabilityZones:
      - eu-central-1b
      - eu-central-1a
      - eu-central-1c
      dnsName: internal-8xztwdw9a0uh573gk73zxw2zu0ig-k8s-72349339.eu-central-1.elb.amazonaws.com
      name: 8xztwdw9a0uh573gk73zxw2zu0ig-k8s
      scheme: internal
      securityGroupIds:
      - sg-035eaf68aaceb3ba1
      subnetIds:
      - subnet-0049a93e4ac5594aa
      - subnet-02cfd2ed9a0bf20aa
      - subnet-042910a41b4206e96
      tags:
        Name: 8xztwdw9a0uh573gk73zxw2zu0ig-k8s
        sigs.k8s.io/cluster-api-provider-aws/cluster/my-aws-workload-cluster: owned
        sigs.k8s.io/cluster-api-provider-aws/role: apiserver
    securityGroups:
      apiserver-lb:
        id: sg-035eaf68aaceb3ba1
        ingressRule:
        - cidrBlocks:
          - 0.0.0.0/0
          description: Kubernetes API
          fromPort: 6443
          protocol: tcp
          toPort: 6443
        name: my-aws-workload-cluster-apiserver-lb
        tags:
          Name: my-aws-workload-cluster-apiserver-lb
          sigs.k8s.io/cluster-api-provider-aws/cluster/my-aws-workload-cluster: owned
          sigs.k8s.io/cluster-api-provider-aws/role: apiserver-lb
      bastion:
        id: sg-06f109a5744c04fbd
        ingressRule:
        - cidrBlocks:
          - 0.0.0.0/0
          description: SSH
          fromPort: 22
          protocol: tcp
          toPort: 22
        name: my-aws-workload-cluster-bastion
        tags:
          Name: my-aws-workload-cluster-bastion
          sigs.k8s.io/cluster-api-provider-aws/cluster/my-aws-workload-cluster: owned
          sigs.k8s.io/cluster-api-provider-aws/role: bastion
      controlplane:
        id: sg-083e3d13dc8071736
        ingressRule:
        - description: Kubernetes API
          fromPort: 6443
          protocol: tcp
          sourceSecurityGroupIds:
          - sg-002cec903938d6df0
          - sg-035eaf68aaceb3ba1
          - sg-083e3d13dc8071736
          toPort: 6443
        - description: etcd
          fromPort: 2379
          protocol: tcp
          sourceSecurityGroupIds:
          - sg-083e3d13dc8071736
          toPort: 2379
        - description: etcd peer
          fromPort: 2380
          protocol: tcp
          sourceSecurityGroupIds:
          - sg-083e3d13dc8071736
          toPort: 2380
        - description: bgp (calico)
          fromPort: 179
          protocol: tcp
          sourceSecurityGroupIds:
          - sg-002cec903938d6df0
          - sg-083e3d13dc8071736
          toPort: 179
        - description: IP-in-IP (calico)
          fromPort: 0
          protocol: "4"
          sourceSecurityGroupIds:
          - sg-002cec903938d6df0
          - sg-083e3d13dc8071736
          toPort: 0
        name: my-aws-workload-cluster-controlplane
        tags:
          Name: my-aws-workload-cluster-controlplane
          sigs.k8s.io/cluster-api-provider-aws/cluster/my-aws-workload-cluster: owned
          sigs.k8s.io/cluster-api-provider-aws/role: controlplane
      lb:
        id: sg-0aeee9b356f992744
        name: my-aws-workload-cluster-lb
        tags:
          Name: my-aws-workload-cluster-lb
          kubernetes.io/cluster/my-aws-workload-cluster: owned
          sigs.k8s.io/cluster-api-provider-aws/cluster/my-aws-workload-cluster: owned
          sigs.k8s.io/cluster-api-provider-aws/role: lb
      node:
        id: sg-002cec903938d6df0
        ingressRule:
        - cidrBlocks:
          - 0.0.0.0/0
          description: Node Port Services
          fromPort: 30000
          protocol: tcp
          toPort: 32767
        - description: Kubelet API
          fromPort: 10250
          protocol: tcp
          sourceSecurityGroupIds:
          - sg-002cec903938d6df0
          - sg-083e3d13dc8071736
          toPort: 10250
        - description: bgp (calico)
          fromPort: 179
          protocol: tcp
          sourceSecurityGroupIds:
          - sg-002cec903938d6df0
          - sg-083e3d13dc8071736
          toPort: 179
        - description: IP-in-IP (calico)
          fromPort: 0
          protocol: "4"
          sourceSecurityGroupIds:
          - sg-002cec903938d6df0
          - sg-083e3d13dc8071736
          toPort: 0
        name: my-aws-workload-cluster-node
        tags:
          Name: my-aws-workload-cluster-node
          sigs.k8s.io/cluster-api-provider-aws/cluster/my-aws-workload-cluster: owned
          sigs.k8s.io/cluster-api-provider-aws/role: node
  ready: true

AWSCluster object when modified by capa (undesired state of capi):

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSCluster
metadata:
  annotations:
    cluster.x-k8s.io/cloned-from-groupkind: AWSClusterTemplate.infrastructure.cluster.x-k8s.io
    cluster.x-k8s.io/cloned-from-name: my-aws-workload-cluster
    topology.cluster.x-k8s.io/managed-field-paths: H4sIAAAAAAAA/2SPwWrDMBBE/0VnfUGOaS+hoRSXXnpbS1Nn8WZVtEqCMf73osgphpw0YjRPM7PryQondbvZkUi6Ib4cXru9pDCa282Ld1DqBbFeFu9C0pKTfAgpjoninoQ0IFdAyMnsO20N1qFRLJxwxqovvaLYSuQILVymDj8VMrLG9kypBapEuaU83j9RXo+DDhlm3UXwYG3I3l1/w33WlVioZ+Ey1XKfEIR18+Kf7C+jAUc+c2k+t+WLdxnDf8rs9Ibp/dHwLwAA//9vVA9OSAEAAA==
  creationTimestamp: "2022-03-21T16:03:40Z"
  finalizers:
  - awscluster.infrastructure.cluster.x-k8s.io
  generation: 808
  labels:
    cluster.x-k8s.io/cluster-name: my-aws-workload-cluster
    topology.cluster.x-k8s.io/owned: ""
  name: my-aws-workload-cluster-rdcc8
  namespace: my-aws-workload-cluster
  ownerReferences:
  - apiVersion: cluster.x-k8s.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: Cluster
    name: my-aws-workload-cluster
    uid: 62cca6a2-3e19-4747-ab5f-e44a2475410c
  resourceVersion: "22924994"
  uid: cae152e9-8ac6-47f4-aa2b-4e39b0a2ccde
spec:
  bastion:
    allowedCIDRBlocks:
    - 0.0.0.0/0
    enabled: false
  controlPlaneEndpoint:
    host: internal-8xztwdw9a0uh573gk73zxw2zu0ig-k8s-72349339.eu-central-1.elb.amazonaws.com
    port: 6443
  controlPlaneLoadBalancer:
    crossZoneLoadBalancing: false
    scheme: internal
    subnets:
    - subnet-02cfd2ed9a0bf20aa
    - subnet-0049a93e4ac5594aa
    - subnet-042910a41b4206e96
  identityRef:
    kind: AWSClusterStaticIdentity
    name: my-aws-workload-cluster-account
  network:
    cni:
      cniIngressRules:
      - description: bgp (calico)
        fromPort: 179
        protocol: tcp
        toPort: 179
      - description: IP-in-IP (calico)
        fromPort: -1
        protocol: "4"
        toPort: 65535
    subnets:
    - availabilityZone: eu-central-1a
      cidrBlock: 192.168.6.0/24
      id: subnet-02897d43aecbd1abb
      isPublic: false
      routeTableId: rtb-0ceb07dd7abecf8f7
      tags:
        kubernetes.io/cluster/my-aws-workload-cluster: shared
        kubernetes.io/role/internal-elb: "1"
    - availabilityZone: eu-central-1b
      cidrBlock: 192.168.7.0/24
      id: subnet-09b63ae42f80aa441
      isPublic: false
      routeTableId: rtb-02404d929e426dda6
      tags:
        kubernetes.io/cluster/my-aws-workload-cluster: shared
        kubernetes.io/role/internal-elb: "1"
    - availabilityZone: eu-central-1c
      cidrBlock: 192.168.8.0/24
      id: subnet-08749cde1774778a5
      isPublic: false
      routeTableId: rtb-0d7dbd47896d6d4ce
      tags:
        kubernetes.io/cluster/my-aws-workload-cluster: shared
        kubernetes.io/role/internal-elb: "1"
    - availabilityZone: eu-central-1a
      cidrBlock: 100.64.0.0/24
      id: subnet-0f7abb8ed5cc40ffb
      isPublic: true
      routeTableId: rtb-0f97be3d6fb101cf0
      tags:
        kubernetes.io/cluster/my-aws-workload-cluster: shared
        kubernetes.io/role/elb: "1"
    - availabilityZone: eu-central-1b
      cidrBlock: 100.64.1.0/24
      id: subnet-0f4842a43f6d2415f
      isPublic: true
      routeTableId: rtb-008b65abfeeb32993
      tags:
        kubernetes.io/cluster/my-aws-workload-cluster: shared
        kubernetes.io/role/elb: "1"
    - availabilityZone: eu-central-1c
      cidrBlock: 100.64.2.0/24
      id: subnet-0995c47a4483d1ee8
      isPublic: true
      routeTableId: rtb-02af70e75aa7c3ccb
      tags:
        kubernetes.io/cluster/my-aws-workload-cluster: shared
        kubernetes.io/role/elb: "1"
    vpc:
      availabilityZoneSelection: Ordered
      availabilityZoneUsageLimit: 3
      cidrBlock: 192.168.0.0/16
      id: vpc-0a04044e9ae06ad3f
  region: eu-central-1
  sshKeyName: capa
status:
  conditions:
  - lastTransitionTime: "2022-03-21T16:05:04Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2022-03-21T16:03:44Z"
    status: "True"
    type: ClusterSecurityGroupsReady
  - lastTransitionTime: "2022-03-21T16:05:04Z"
    status: "True"
    type: LoadBalancerReady
  - lastTransitionTime: "2022-03-21T16:03:40Z"
    status: "True"
    type: PrincipalCredentialRetrieved
  - lastTransitionTime: "2022-03-21T16:03:40Z"
    status: "True"
    type: PrincipalUsageAllowed
  - lastTransitionTime: "2022-03-21T16:03:42Z"
    status: "True"
    type: SubnetsReady
  - lastTransitionTime: "2022-03-21T16:03:41Z"
    status: "True"
    type: VpcReady
  failureDomains:
    eu-central-1a:
      controlPlane: true
    eu-central-1b:
      controlPlane: true
    eu-central-1c:
      controlPlane: true
  networkStatus:
    apiServerElb:
      attributes:
        idleTimeout: 600000000000
      availabilityZones:
      - eu-central-1b
      - eu-central-1a
      - eu-central-1c
      dnsName: internal-8xztwdw9a0uh573gk73zxw2zu0ig-k8s-72349339.eu-central-1.elb.amazonaws.com
      name: 8xztwdw9a0uh573gk73zxw2zu0ig-k8s
      scheme: internal
      securityGroupIds:
      - sg-035eaf68aaceb3ba1
      subnetIds:
      - subnet-0049a93e4ac5594aa
      - subnet-02cfd2ed9a0bf20aa
      - subnet-042910a41b4206e96
      tags:
        Name: 8xztwdw9a0uh573gk73zxw2zu0ig-k8s
        sigs.k8s.io/cluster-api-provider-aws/cluster/my-aws-workload-cluster: owned
        sigs.k8s.io/cluster-api-provider-aws/role: apiserver
    securityGroups:
      apiserver-lb:
        id: sg-035eaf68aaceb3ba1
        ingressRule:
        - cidrBlocks:
          - 0.0.0.0/0
          description: Kubernetes API
          fromPort: 6443
          protocol: tcp
          toPort: 6443
        name: my-aws-workload-cluster-apiserver-lb
        tags:
          Name: my-aws-workload-cluster-apiserver-lb
          sigs.k8s.io/cluster-api-provider-aws/cluster/my-aws-workload-cluster: owned
          sigs.k8s.io/cluster-api-provider-aws/role: apiserver-lb
      bastion:
        id: sg-06f109a5744c04fbd
        ingressRule:
        - cidrBlocks:
          - 0.0.0.0/0
          description: SSH
          fromPort: 22
          protocol: tcp
          toPort: 22
        name: my-aws-workload-cluster-bastion
        tags:
          Name: my-aws-workload-cluster-bastion
          sigs.k8s.io/cluster-api-provider-aws/cluster/my-aws-workload-cluster: owned
          sigs.k8s.io/cluster-api-provider-aws/role: bastion
      controlplane:
        id: sg-083e3d13dc8071736
        ingressRule:
        - description: Kubernetes API
          fromPort: 6443
          protocol: tcp
          sourceSecurityGroupIds:
          - sg-002cec903938d6df0
          - sg-035eaf68aaceb3ba1
          - sg-083e3d13dc8071736
          toPort: 6443
        - description: etcd
          fromPort: 2379
          protocol: tcp
          sourceSecurityGroupIds:
          - sg-083e3d13dc8071736
          toPort: 2379
        - description: etcd peer
          fromPort: 2380
          protocol: tcp
          sourceSecurityGroupIds:
          - sg-083e3d13dc8071736
          toPort: 2380
        - description: bgp (calico)
          fromPort: 179
          protocol: tcp
          sourceSecurityGroupIds:
          - sg-002cec903938d6df0
          - sg-083e3d13dc8071736
          toPort: 179
        - description: IP-in-IP (calico)
          fromPort: 0
          protocol: "4"
          sourceSecurityGroupIds:
          - sg-002cec903938d6df0
          - sg-083e3d13dc8071736
          toPort: 0
        name: my-aws-workload-cluster-controlplane
        tags:
          Name: my-aws-workload-cluster-controlplane
          sigs.k8s.io/cluster-api-provider-aws/cluster/my-aws-workload-cluster: owned
          sigs.k8s.io/cluster-api-provider-aws/role: controlplane
      lb:
        id: sg-0aeee9b356f992744
        name: my-aws-workload-cluster-lb
        tags:
          Name: my-aws-workload-cluster-lb
          kubernetes.io/cluster/my-aws-workload-cluster: owned
          sigs.k8s.io/cluster-api-provider-aws/cluster/my-aws-workload-cluster: owned
          sigs.k8s.io/cluster-api-provider-aws/role: lb
      node:
        id: sg-002cec903938d6df0
        ingressRule:
        - cidrBlocks:
          - 0.0.0.0/0
          description: Node Port Services
          fromPort: 30000
          protocol: tcp
          toPort: 32767
        - description: Kubelet API
          fromPort: 10250
          protocol: tcp
          sourceSecurityGroupIds:
          - sg-002cec903938d6df0
          - sg-083e3d13dc8071736
          toPort: 10250
        - description: bgp (calico)
          fromPort: 179
          protocol: tcp
          sourceSecurityGroupIds:
          - sg-002cec903938d6df0
          - sg-083e3d13dc8071736
          toPort: 179
        - description: IP-in-IP (calico)
          fromPort: 0
          protocol: "4"
          sourceSecurityGroupIds:
          - sg-002cec903938d6df0
          - sg-083e3d13dc8071736
          toPort: 0
        name: my-aws-workload-cluster-node
        tags:
          Name: my-aws-workload-cluster-node
          sigs.k8s.io/cluster-api-provider-aws/cluster/my-aws-workload-cluster: owned
          sigs.k8s.io/cluster-api-provider-aws/role: node
  ready: true

Matthias Lembcke <[email protected]>, Daimler TSS GmbH (Provider Information)

@fabriziopandini
Copy link
Member

@matlemb thanks.
These are the outcomes of the initial investigation; As of today the topology reconcilers assume the entire list of subnets is authoritative and it doesn't discriminate between fields for each item in the list.
The only workaround that comes to my mind is to have all the fields for your subnet in your template; this is not ideal but at least it should unblock you while we investigate possible solutions.

@yastij could you kindly helps us to understand if this problem (items in a slide that are partially edited in the templates, partially edited by the controllers) applies to bring your own subnet only or if it applies to other lists in the APIs

@matlemb
Copy link
Author

matlemb commented Mar 21, 2022

@fabriziopandini Thank you too. We already tested your workaround successfully.

@yastij
Copy link
Member

yastij commented Mar 22, 2022

sorry removed my comment as it wasn't finished yet, was mainly taking notes of potential slices, just finished it seems like CAPZ is okay, same for CAPV. will take a look at other providers tomorrow

@pydctw
Copy link

pydctw commented Mar 24, 2022

For CAPA, I was able to reproduce the issue with bring your own infra use case and observed that cluster creation is successful but subnets fields continue oscillating between two states and saturate controller logs :(

To confirm my understanding, does the issue happen only when the fields are co-authored by capi and provider controllers and they are a list?

From CAPI book, I see the statements.

you can consider the topology reconciler to be authoritative on all the values under spec. Being authoritative means that the user cannot manually change those values in the object derived from the template in a specific Cluster

spec.network.subnets field is modified by CAPA controller whether it's CAPA-managed infra or bring your own infra case but I didn't observe the same issue when subnets field is empty in AWSClusterTemplate.

AWSClusterTemplate

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSClusterTemplate
metadata:
  name: ec2-clusterclass-v1
  namespace: aws
spec:
  template:
    spec: { }

AWSCluster

spec:
  network:
    ...
    subnets:
    - availabilityZone: us-west-1a
      cidrBlock: 10.0.0.0/19
      id: subnet-068231995a7b13c00
      isPublic: true
      natGatewayId: nat-08a70fc169b26a54e
      routeTableId: rtb-0664d096d4bd730ac
      tags:
        Name: ec2-cluster-subnet-public-us-west-1a
        kubernetes.io/cluster/ec2-cluster: shared
        kubernetes.io/role/elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/ec2-cluster: owned
        sigs.k8s.io/cluster-api-provider-aws/role: public
    - availabilityZone: us-west-1a
      cidrBlock: 10.0.64.0/18
      id: subnet-09569af1adfe40387
      isPublic: false
      routeTableId: rtb-0ebf37cc0a877c7b8
      tags:
        Name: ec2-cluster-subnet-private-us-west-1a
        kubernetes.io/cluster/ec2-cluster: shared
        kubernetes.io/role/internal-elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/ec2-cluster: owned
        sigs.k8s.io/cluster-api-provider-aws/role: private
       ...

@fabriziopandini
Copy link
Member

To confirm my understanding, does the issue happen only when the fields are co-authored by capi and provider controllers and they are a list?

yes, but co-authored by capi is a consequence of fields being defined in the cluster class; this should also clarify your following observation

I didn't observe the same issue when subnets field is empty in AWSClusterTemplate.

@fabriziopandini
Copy link
Member

I'm prototyping a fix, but it will take some time...
/assing
/lifecycle active

@fabriziopandini
Copy link
Member

/assign

@fabriziopandini
Copy link
Member

fabriziopandini commented May 10, 2022

After investigating the problem, some prototyping and discussions I'm proposing the following way forward:

Please note that:

  • server-side apply cannot be used for clusterctl topology dry run in its current form. We are planning to continue to support it even this command but with some limitations/caveats, while we figure out a long term solution
  • part of the implementation will be figuring out if the change to SSA could trigger a rollout on the existing cluster/possible solutions

@sbueringer
Copy link
Member

sbueringer commented Jun 13, 2022

@pydctw Fix was merged, if you have time it would be really great if you can confirm that it solves the issue CAPA had.

@pydctw
Copy link

pydctw commented Jun 13, 2022

@pydctw Fix was merged, if you have time it would be really great if you can confirm that it solves the issue CAPA had.

Thanks! I am happy to verify the fix. Will make changes to CAPA (+ run with CAPI main) and report back the result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants