Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GKE autopilot cluster refresh error #355

Closed
selfuryon opened this issue Aug 10, 2023 · 11 comments · Fixed by upbound/configuration-caas#4
Closed

GKE autopilot cluster refresh error #355

selfuryon opened this issue Aug 10, 2023 · 11 comments · Fixed by upbound/configuration-caas#4
Assignees
Labels
bug Something isn't working community is:triaged

Comments

@selfuryon
Copy link

What happened?

Hello!
I tried to create a new GKE Autopilot cluster via crossplane. It was created successfully but after crossplane can't refresh the information about it and shows Synced: False.

How can we reproduce it?

I used this manifest:

apiVersion: container.gcp.upbound.io/v1beta1
kind: Cluster
metadata:
  name: management
spec:
  providerConfigRef:
    name: infra
  forProvider:
    enableAutopilot: true
    releaseChannel:
      - channel: RAPID
    location: europe-north1
    networkRef:
      name: main
    subnetworkRef:
      name: kubernetes-europe-north1
    privateClusterConfig:
      - enablePrivateNodes: true
        masterIpv4CidrBlock: 10.248.0.0/28
        masterGlobalAccessConfig:
          - enabled: true
    maintenancePolicy:
      - dailyMaintenanceWindow:
          - startTime: "13:00"
    ipAllocationPolicy:
      - servicesSecondaryRangeName: service-network
        clusterSecondaryRangeName: pod-network

Crossplane created cluster, but after that I got that:

Name:         management
Namespace:
Labels:       <none>
Annotations:  crossplane.io/external-name: management
API Version:  container.gcp.upbound.io/v1beta1
Kind:         Cluster
Metadata:
  Creation Timestamp:  2023-08-09T21:08:08Z
  Finalizers:
    finalizer.managedresource.crossplane.io
  Generation:        7
  Resource Version:  89094
  UID:               1d1bdd6a-6289-4035-bf3e-0fb2fc310d98
Spec:
  Deletion Policy:  Delete
  For Provider:
    Addons Config:
      Gce Persistent Disk Csi Driver Config:
        Enabled:  true
    Binary Authorization:
    Database Encryption:
      State:            DECRYPTED
    Datapath Provider:  ADVANCED_DATAPATH
    Default Snat Status:
    Dns Config:
      Cluster Dns:         CLOUD_DNS
      Cluster Dns Domain:  cluster.local
      Cluster Dns Scope:   CLUSTER_SCOPE
    Enable Autopilot:      true
    Gateway API Config:
      Channel:  CHANNEL_STANDARD
    Ip Allocation Policy:
      Cluster Secondary Range Name:   pod-network
      Services Secondary Range Name:  service-network
    Location:                         europe-north1
    Logging Config:
      Enable Components:
        SYSTEM_COMPONENTS
        WORKLOADS
    Logging Service:  logging.googleapis.com/kubernetes
    Maintenance Policy:
      Daily Maintenance Window:
        Start Time:  13:00
    Monitoring Config:
      Enable Components:
        SYSTEM_COMPONENTS
      Managed Prometheus:
        Enabled:         true
    Monitoring Service:  monitoring.googleapis.com/kubernetes
    Network:             https://www.googleapis.com/compute/v1/projects/etherno-infra/global/networks/main
    Network Ref:
      Name:           main
    Networking Mode:  VPC_NATIVE
    Node Locations:
      europe-north1-a
      europe-north1-b
      europe-north1-c
    Node Pool Defaults:
      Node Config Defaults:
        Logging Variant:  DEFAULT
    Notification Config:
      Pubsub:
    Private Cluster Config:
      Enable Private Nodes:  true
      masterIpv4CidrBlock:   10.248.0.0/28
    Project:                 ***
    Release Channel:
      Channel:  RAPID
    Service External Ips Config:
    Subnetwork:  https://www.googleapis.com/compute/v1/projects/etherno-infra/regions/europe-north1/subnetworks/kubernetes-europe-north1
    Subnetwork Ref:
      Name:  kubernetes-europe-north1
    Vertical Pod Autoscaling:
      Enabled:  true
  Init Provider:
  Management Policies:
    *
  Provider Config Ref:
    Name:  infra
Status:
  At Provider:
  Conditions:
    Last Transition Time:  2023-08-10T09:05:33Z
    Message:               observe failed: cannot run refresh: refresh failed: Missing required argument: The argument "disabled" is required, but no definition was found.
Missing required argument: The argument "enabled" is required, but no definition was found.
Missing required argument: The argument "enabled" is required, but no definition was found.
    Reason:  ReconcileError
    Status:  False
    Type:    Synced
Events:
  Type     Reason                         Age                From                                                    Message
  ----     ------                         ----               ----                                                    -------
  Warning  CannotObserveExternalResource  9h (x37 over 11h)  managed/container.gcp.upbound.io/v1beta1, kind=cluster  cannot run refresh: refresh failed: Missing required argument: The argument "enabled" is required, but no definition was found.
Missing required argument: The argument "disabled" is required, but no definition was found.
Missing required argument: The argument "enabled" is required, but no definition was found.
  Warning  CannotObserveExternalResource  7m26s (x60 over 11h)  managed/container.gcp.upbound.io/v1beta1, kind=cluster  cannot run refresh: refresh failed: Missing required argument: The argument "disabled" is required, but no definition was found.
Missing required argument: The argument "enabled" is required, but no definition was found.
Missing required argument: The argument "enabled" is required, but no definition was found.
  Warning  CannotObserveExternalResource  117s (x55 over 11h)  managed/container.gcp.upbound.io/v1beta1, kind=cluster  cannot run refresh: refresh failed: Missing required argument: The argument "enabled" is required, but no definition was found.
Missing required argument: The argument "enabled" is required, but no definition was found.
Missing required argument: The argument "disabled" is required, but no definition was found.

What environment did it happen in?

  • Crossplane Version: v1.13.2
  • Provider Version: v0.35.0
  • Kubernetes Version: v1.27.3
  • Kubernetes Distribution: GKE Autopilot
@selfuryon selfuryon added bug Something isn't working needs:triage labels Aug 10, 2023
@turkenf
Copy link
Collaborator

turkenf commented Aug 10, 2023

Hi @selfuryon,

Thank you raising for this issue, I can can reproduce this problem with a simple example manifest:

apiVersion: container.gcp.upbound.io/v1beta1
kind: Cluster
metadata:
  annotations:
    meta.upbound.io/example-id: container/v1beta1/cluster
  labels:
    testing.upbound.io/example-name: cluster
  name: cluster
spec:
  forProvider:
    location: europe-north1
    ipAllocationPolicy:
      - {}
    enableAutopilot: true

get the following error:

    message: |-
      observe failed: cannot run refresh: refresh failed: Missing required argument: The argument "enabled" is required, but no definition was found.
      Missing required argument: The argument "enabled" is required, but no definition was found.
      Missing required argument: The argument "issue_client_certificate" is required, but no definition was found.
      Missing required argument: The argument "enabled" is required, but no definition was found.
      Missing required argument: The argument "disabled" is required, but no definition was found.
    reason: ReconcileError
    status: "False"
    type: Synced

@calavelas
Copy link

issue_client_certificate is coming from "masterAuth" spec, added this and it gone

          forProvider:
            masterAuth:
              - clientCertificate: ''
                clientCertificateConfig:
                  - issueClientCertificate: false

The rest is quite hard to find, since a lot of spec contain enabled

@calavelas
Copy link

calavelas commented Aug 10, 2023

Added possible field that I get from my manual created cluster and used crossplane to observe

status:
  atProvider:
    addonsConfig:
      - dnsCacheConfig:
          - enabled: true
        gcePersistentDiskCsiDriverConfig:
          - enabled: true
        gcpFilestoreCsiDriverConfig:
          - enabled: true
    binaryAuthorization:
      - enabled: false
        evaluationMode: DISABLED
    clusterAutoscaling:
      - autoProvisioningDefaults:
        enabled: true
    defaultSnatStatus:
      - disabled: false
    ipAllocationPolicy:
      - podCidrOverprovisionConfig:
          - disabled: false
    monitoringConfig:
      - managedPrometheus:
          - enabled: true
    networkPolicy:
      - enabled: false
    notificationConfig:
      - pubsub:
          - enabled: false
            topic: ''
    privateClusterConfig:
      - enablePrivateEndpoint: false
        enablePrivateNodes: false
        masterGlobalAccessConfig:
          - enabled: false
    serviceExternalIpsConfig:
      - enabled: false
    verticalPodAutoscaling:
      - enabled: true

@selfuryon
Copy link
Author

I found the minimal working config for me:

apiVersion: container.gcp.upbound.io/v1beta1
kind: Cluster
metadata:
  name: management
spec:
  forProvider:
    enableAutopilot: true
    releaseChannel:
      - channel: RAPID
    networkRef:
      name: main
    subnetworkRef:
      name: kubernetes-europe-north1
    maintenancePolicy:
    location: europe-north1
      - dailyMaintenanceWindow:
          - startTime: "13:00"
    ipAllocationPolicy:
      - servicesSecondaryRangeName: service-network
        clusterSecondaryRangeName: pod-network
    defaultSnatStatus:
      - disabled: false
    masterAuth:
      - clientCertificateConfig:
          - issueClientCertificate: false
    serviceExternalIpsConfig:
      - enabled: false
    notificationConfig:
      - pubsub:
          - enabled: false
    privateClusterConfig:
      - enablePrivateNodes: true
        masterIpv4CidrBlock: 10.248.0.0/28
  deletionPolicy: Orphan

So it requires defaultSnatStatus, serviceExternalIpsConfig, notificationConfig and masterAuth sections.

@turkenf
Copy link
Collaborator

turkenf commented Aug 14, 2023

I tested Cluster.container with v0.34.0 and the resource was successfully created, spec:

spec:
  deletionPolicy: Delete
  forProvider:
    addonsConfig:
    - gcePersistentDiskCsiDriverConfig:
      - enabled: true
    binaryAuthorization:
    - {}
    clusterAutoscaling:
    - autoProvisioningDefaults:
      - imageType: COS_CONTAINERD
        management:
        - autoRepair: true
          autoUpgrade: true
        oauthScopes:
        - https://www.googleapis.com/auth/devstorage.read_only
        - https://www.googleapis.com/auth/logging.write
        - https://www.googleapis.com/auth/monitoring
        - https://www.googleapis.com/auth/service.management.readonly
        - https://www.googleapis.com/auth/servicecontrol
        - https://www.googleapis.com/auth/trace.append
        serviceAccount: default
        upgradeSettings:
        - maxSurge: 1
          strategy: SURGE
    databaseEncryption:
    - state: DECRYPTED
    datapathProvider: ADVANCED_DATAPATH
    defaultSnatStatus:
    - disabled: false
    dnsConfig:
    - clusterDns: CLOUD_DNS
      clusterDnsDomain: cluster.local
      clusterDnsScope: CLUSTER_SCOPE
    enableAutopilot: true
    gatewayApiConfig:
    - channel: CHANNEL_STANDARD
    ipAllocationPolicy:
    - {}
    location: europe-north1
    loggingConfig:
    - enableComponents:
      - SYSTEM_COMPONENTS
      - WORKLOADS
    loggingService: logging.googleapis.com/kubernetes
    masterAuth:
    - clientCertificateConfig:
      - issueClientCertificate: false
    monitoringConfig:
    - enableComponents:
      - SYSTEM_COMPONENTS
      managedPrometheus:
      - enabled: true
    monitoringService: monitoring.googleapis.com/kubernetes
    network: projects/official-provider-testing/global/networks/default
    networkingMode: VPC_NATIVE
    nodeConfig:
    - diskSizeGb: 100
      diskType: pd-standard
      imageType: COS_CONTAINERD
      loggingVariant: DEFAULT
      machineType: e2-small
      metadata:
        disable-legacy-endpoints: "true"
      oauthScopes:
      - https://www.googleapis.com/auth/devstorage.read_only
      - https://www.googleapis.com/auth/logging.write
      - https://www.googleapis.com/auth/monitoring
      - https://www.googleapis.com/auth/service.management.readonly
      - https://www.googleapis.com/auth/servicecontrol
      - https://www.googleapis.com/auth/trace.append
      reservationAffinity:
      - consumeReservationType: NO_RESERVATION
      serviceAccount: default
      shieldedInstanceConfig:
      - enableIntegrityMonitoring: true
        enableSecureBoot: true
      taint:
      - effect: NO_SCHEDULE
        key: cloud.google.com/gke-quick-remove
        value: "true"
      workloadMetadataConfig:
      - mode: GKE_METADATA
    nodeLocations:
    - europe-north1-a
    - europe-north1-b
    - europe-north1-c
    nodePoolDefaults:
    - nodeConfigDefaults:
      - loggingVariant: DEFAULT
    notificationConfig:
    - pubsub:
      - enabled: false
    privateClusterConfig:
    - masterGlobalAccessConfig:
      - enabled: false
    project: official-provider-testing
    releaseChannel:
    - channel: REGULAR
    serviceExternalIpsConfig:
    - enabled: false
    subnetwork: projects/official-provider-testing/regions/europe-north1/subnetworks/default
    verticalPodAutoscaling:
    - enabled: true
  managementPolicies:
  - '*'
  providerConfigRef:
    name: default

But when I try with v0.35.0 I get an error, spec:

spec:
  deletionPolicy: Delete
  forProvider:
    addonsConfig:
    - gcePersistentDiskCsiDriverConfig:
      - enabled: true
    binaryAuthorization:
    - {}
    clusterAutoscaling:
    - autoProvisioningDefaults:
      - imageType: COS_CONTAINERD
        management:
        - autoRepair: true
          autoUpgrade: true
        oauthScopes:
        - https://www.googleapis.com/auth/devstorage.read_only
        - https://www.googleapis.com/auth/logging.write
        - https://www.googleapis.com/auth/monitoring
        - https://www.googleapis.com/auth/service.management.readonly
        - https://www.googleapis.com/auth/servicecontrol
        - https://www.googleapis.com/auth/trace.append
        serviceAccount: default
        upgradeSettings:
        - maxSurge: 1
          strategy: SURGE
    databaseEncryption:
    - state: DECRYPTED
    datapathProvider: ADVANCED_DATAPATH
    defaultSnatStatus:
    - {}
    dnsConfig:
    - clusterDns: CLOUD_DNS
      clusterDnsDomain: cluster.local
      clusterDnsScope: CLUSTER_SCOPE
    enableAutopilot: true
    gatewayApiConfig:
    - channel: CHANNEL_STANDARD
    ipAllocationPolicy:
    - {}
    location: europe-north1
    loggingConfig:
    - enableComponents:
      - SYSTEM_COMPONENTS
      - WORKLOADS
    loggingService: logging.googleapis.com/kubernetes
    masterAuth:
    - clientCertificateConfig:
      - {}
    monitoringConfig:
    - enableComponents:
      - SYSTEM_COMPONENTS
      managedPrometheus:
      - enabled: true
    monitoringService: monitoring.googleapis.com/kubernetes
    network: projects/official-provider-testing/global/networks/default
    networkingMode: VPC_NATIVE
    nodeLocations:
    - europe-north1-a
    - europe-north1-b
    - europe-north1-c
    nodePoolDefaults:
    - nodeConfigDefaults:
      - loggingVariant: DEFAULT
    notificationConfig:
    - pubsub:
      - {}
    privateClusterConfig:
    - masterGlobalAccessConfig:
      - {}
    project: official-provider-testing
    releaseChannel:
    - channel: REGULAR
    serviceExternalIpsConfig:
    - {}
    subnetwork: projects/official-provider-testing/regions/europe-north1/subnetworks/default
    verticalPodAutoscaling:
    - enabled: true
  initProvider: {}
  managementPolicies:
  - '*'
  providerConfigRef:
    name: default

No native provider bump between these two versions, diff between them:

Screenshot 2023-08-14 at 13 50 01 Screenshot 2023-08-14 at 13 50 27

@turkenh
Copy link
Collaborator

turkenh commented Aug 14, 2023

@lsviben could this be some bug related to GMP changes (e.g. some change in late-init behavior due to changes in API like required -> optional) ?

@turkenf
Copy link
Collaborator

turkenf commented Aug 14, 2023

Also, I faced a similar issue Account.storage resource in provider-azure between v0.34.0 and v0.35.0, diff:
Screenshot 2023-08-14 at 14 41 27

error message:

    message: |-
      observe failed: cannot run refresh: refresh failed: Missing required argument: The argument "write" is required, but no definition was found.
      Missing required argument: The argument "delete" is required, but no definition was found.
      Missing required argument: The argument "read" is required, but no definition was found.
      Missing required argument: The argument "enabled" is required, but no definition was found.

@haarchri
Copy link
Member

same issue in my side with gke cluster

  Warning  CannotObserveExternalResource    2m29s               managed/container.gcp.upbound.io/v1beta1, kind=cluster  cannot run refresh: refresh failed: Missing required argument: The argument "issue_client_certificate" is required, but no definition was found.
Missing required argument: The argument "enabled" is required, but no definition was found.
Missing required argument: The argument "enabled" is required, but no definition was found.
Missing required argument: The argument "disabled" is required, but no definition was found.
Missing required argument: The argument "enabled" is required, but no definition was found.

following manifest:

kubectl get Cluster.container.gcp.upbound.io gcp-spoke-01-jzb8c-jzc56 -o yaml
apiVersion: container.gcp.upbound.io/v1beta1
kind: Cluster
metadata:
  annotations:
    crossplane.io/composition-resource-name: gke-cluster
    crossplane.io/external-create-succeeded: "2023-08-16T18:31:47Z"
    crossplane.io/external-name: gcp-spoke-01-jzb8c-jzc56
    upjet.crossplane.io/provider-meta: '{"e2bfb730-ecaa-11e6-8f88-34363bc7c4c0":{"create":2400000000000,"delete":2400000000000,"read":2400000000000,"update":3600000000000},"schema_version":"1"}'
  creationTimestamp: "2023-08-16T18:25:53Z"
  finalizers:
  - finalizer.managedresource.crossplane.io
  generateName: gcp-spoke-01-jzb8c-
  generation: 6
  labels:
    crossplane.io/claim-name: gcp-spoke-01
    crossplane.io/claim-namespace: default
    crossplane.io/composite: gcp-spoke-01-jzb8c
  name: gcp-spoke-01-jzb8c-jzc56
  ownerReferences:
  - apiVersion: gcp.caas.upbound.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: XGKE
    name: gcp-spoke-01-jzb8c-thbgp
    uid: 4fdd5aa0-275d-4406-83cc-60d1bb082b2a
  resourceVersion: "163973"
  uid: 0e9c7e81-b1ee-4674-925d-6e578443172e
spec:
  deletionPolicy: Delete
  forProvider:
    addonsConfig:
    - gcePersistentDiskCsiDriverConfig:
      - enabled: true
    binaryAuthorization:
    - {}
    clusterAutoscaling:
    - {}
    databaseEncryption:
    - state: DECRYPTED
    defaultSnatStatus:
    - {}
    enableIntranodeVisibility: true
    initialNodeCount: 1
    ipAllocationPolicy:
    - clusterSecondaryRangeName: pods
      servicesSecondaryRangeName: services
    location: europe-west3
    loggingConfig:
    - enableComponents:
      - SYSTEM_COMPONENTS
      - WORKLOADS
    loggingService: logging.googleapis.com/kubernetes
    masterAuth:
    - clientCertificateConfig:
      - {}
    monitoringConfig:
    - enableComponents:
      - SYSTEM_COMPONENTS
      managedPrometheus:
      - enabled: true
    monitoringService: monitoring.googleapis.com/kubernetes
    network: https://www.googleapis.com/compute/v1/projects/crossplane-playground/global/networks/gcp-spoke-01
    networkRef:
      name: gcp-spoke-01
    networkSelector:
      matchLabels:
        networks.gcp.caas.upbound.io/network-id: gcp-spoke-01
    networkingMode: VPC_NATIVE
    nodeConfig:
    - serviceAccount: [email protected]
    nodeLocations:
    - europe-west3-a
    - europe-west3-b
    - europe-west3-c
    nodePoolDefaults:
    - nodeConfigDefaults:
      - loggingVariant: DEFAULT
    notificationConfig:
    - pubsub:
      - {}
    privateClusterConfig:
    - masterGlobalAccessConfig:
      - {}
    project: crossplane-playground
    releaseChannel:
    - channel: REGULAR
    serviceExternalIpsConfig:
    - {}
    subnetwork: https://www.googleapis.com/compute/v1/projects/crossplane-playground/regions/europe-west3/subnetworks/gcp-spoke-01-jzb8c-j6fgb
    subnetworkRef:
      name: gcp-spoke-01-jzb8c-j6fgb
    subnetworkSelector:
      matchLabels:
        networks.gcp.caas.upbound.io/network-id: gcp-spoke-01
  initProvider: {}
  managementPolicies:
  - '*'
  providerConfigRef:
    name: default
  writeConnectionSecretToRef:
    name: 4fdd5aa0-275d-4406-83cc-60d1bb082b2a-gkecluster
    namespace: upbound-system
status:
  atProvider: {}
  conditions:
  - lastTransitionTime: "2023-08-16T18:45:13Z"
    message: |-
      observe failed: cannot run refresh: refresh failed: Missing required argument: The argument "enabled" is required, but no definition was found.
      Missing required argument: The argument "enabled" is required, but no definition was found.
      Missing required argument: The argument "disabled" is required, but no definition was found.
      Missing required argument: The argument "issue_client_certificate" is required, but no definition was found.
      Missing required argument: The argument "enabled" is required, but no definition was found.
    reason: ReconcileError
    status: "False"
    type: Synced
  - lastTransitionTime: "2023-08-16T18:31:47Z"
    reason: Creating
    status: "False"
    type: Ready
  - lastTransitionTime: "2023-08-16T18:38:01Z"
    reason: Success
    status: "True"
    type: LastAsyncOperation
  - lastTransitionTime: "2023-08-16T18:38:01Z"
    reason: Finished
    status: "True"
    type: AsyncOperation

@haarchri
Copy link
Member

set all these fields for my cluster - and then the resources synced and ready true

            loggingConfig:
              - enableComponents:
                - SYSTEM_COMPONENTS
                - WORKLOADS
            monitoringConfig:
            - enableComponents:
              - SYSTEM_COMPONENTS
              managedPrometheus:
              - enabled: true
            masterAuth:
              - clientCertificateConfig:
                  - issueClientCertificate: false
            addonsConfig:
              - dnsCacheConfig:
                  - enabled: true
                gcePersistentDiskCsiDriverConfig:
                  - enabled: true
                gcpFilestoreCsiDriverConfig:
                  - enabled: true
            binaryAuthorization:
              - enabled: false
            clusterAutoscaling:
              - autoProvisioningDefaults:
                enabled: true
            defaultSnatStatus:
              - disabled: false
            ipAllocationPolicy:
              - podCidrOverprovisionConfig:
                  - disabled: false
            monitoringConfig:
              - managedPrometheus:
                  - enabled: true
            networkPolicy:
              - enabled: false
            notificationConfig:
              - pubsub:
                  - enabled: false
                    topic: ''
            privateClusterConfig:
              - enablePrivateEndpoint: false
                enablePrivateNodes: false
                masterGlobalAccessConfig:
                  - enabled: false
            serviceExternalIpsConfig:
              - enabled: false
            verticalPodAutoscaling:
              - enabled: true

@lsviben
Copy link
Contributor

lsviben commented Aug 17, 2023

Looks like the issue was introduced in crossplane/upjet#237 with initProviders.

As now all the fields which were required and are also in initProvider are optional, they also got the tag omitempty. And it looks like it messes with the late-initialization of those fields. Notice that they are all bool fields with values false. Kudos to @turkenh for figuring it out!

Made a PR which fixes the behaviour (reverts the omitempty addition to previously required fields). Tested it out with the cluster example and it works ok.

Hopefully we will merge it soon and make some patch releases

@lsviben
Copy link
Contributor

lsviben commented Aug 24, 2023

Fixed in the newest provider releases! https://github.com/upbound/provider-gcp/releases/tag/v0.36.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working community is:triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants