GKE Cluster with some misconfigurations never reaches READY state but does not provide any error messages #601

tylerreidwaze · 2022-02-01T16:58:52Z

Checklist

I did not find a related open issue.
I did not find a solution in the troubleshooting guide: (https://cloud.google.com/config-connector/docs/troubleshooting)
If this issue is time-sensitive, I have submitted a corresponding issue with GCP support.

Bug Description

I am creating a GKE Cluster more or less following the provided blueprint. I have cnrm.cloud.google.com/remove-default-node-pool set to "true." My cluster is created without any issues with the default nodepool. Then, after the cluster reaches a ready state it removes the node pool. After removing the default nodepool, my nodepool is stuck in a pending state and is never added to the cluster as it is waiting for the cluster to be ready

nodepool

> kubectl describe containernodepool dacluster2-primary -n kpt-test-project-demo1 
Events:
  Type     Reason              Age                  From                          Message
  ----     ------              ----                 ----                          -------
  Warning  DependencyNotReady  104s (x14 over 17m)  containernodepool-controller  reference ContainerCluster kpt-test-project-demo1/dacluster2 is not ready

cluster

> kubectl describe containercluster dacluster2 -n kpt-test-project-demo1
Events:
  Type     Reason        Age                From                         Message
  ----     ------        ----               ----                         -------
  Normal   Updating      11m (x3 over 20m)  containercluster-controller  Update in progress
  Normal   UpToDate      11m                containercluster-controller  The resource is up to date

Additional Diagnostic Information

Kubernetes Cluster Version

kubectl version --short
Client Version: v1.20.8-dispatcher
Server Version: v1.20.10-gke.1600

Config Connector Version

> kubectl get ns cnrm-system -o jsonpath='{.metadata.annotations.cnrm\.cloud\.google\.com/version}'~
1.70.0~%

Config Connector Mode

No result was returned from this call

Log Output

Steps to Reproduce

Steps to reproduce the issue

Create cluster with remove-default node pool set to true and a custom nodepool. Watch the cluster start up in the GCP Console

YAML snippets

Nodepool config

apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerNodePool
metadata:
  name: dacluster2-primary 
  namespace: Redacted
  annotations:
    cnrm.cloud.google.com/blueprint: cnrm/gke:gke-nodepool/v0.3.0
    cnrm.cloud.google.com/project-id: Redacted
spec:
  autoscaling:
    maxNodeCount: 3 # kpt-set: ${max-node-count}
    minNodeCount: 1
  clusterRef:
    name: dacluster2 
  # initialNodeCount is per-zone, for regional clusters
  initialNodeCount: 1
  location: us-central1 
  management:
    autoRepair: true
    autoUpgrade: true
  maxPodsPerNode: 64
  nodeConfig:
    labels:
      gke.io/nodepool: primary 
    diskSizeGb: 100
    diskType: pd-ssd
    machineType: e2-standard-16
    oauthScopes:
      - gke-default
    serviceAccountRef:
      name: Redacted

Cluster Config

apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerCluster
metadata: 
  name: dacluster2 # kpt-set: ${cluster-name}
  namespace: Redacted
  labels:
    gke.io/environment: dev
  annotations:
    cnrm.cloud.google.com/blueprint: cnrm/gke:gke-cluster/v0.3.0
    cnrm.cloud.google.com/project-id: Redacted
    # Remove the default node pool after bootstrapping.
    # Explcit node pool configuration allows for more isolation and makes it
    # easier to replace node pools to change immutable fields.
    cnrm.cloud.google.com/remove-default-node-pool: "true"
spec:
  addonsConfig:
    dnsCacheConfig:
      enabled: true
    gcePersistentDiskCsiDriverConfig:
      enabled: true
    networkPolicyConfig:
      disabled: false
  enableBinaryAuthorization: true
  enableShieldedNodes: true
  initialNodeCount: 1

  ipAllocationPolicy:
    clusterSecondaryRangeName: pods
    servicesSecondaryRangeName: services
  location: us-central1 
  masterAuthorizedNetworksConfig:
    cidrBlocks:
      - cidrBlock: 0.0.0.0/0
        displayName: Whole Internet
  networkRef:
    external: Redacted
  privateClusterConfig:
    enablePrivateEndpoint: false
    enablePrivateNodes: false
    masterGlobalAccessConfig:
      enabled: true
  releaseChannel:
    channel: REGULAR 
  subnetworkRef:
    name: us-central1-abitofdanet # kpt-set: ${cloud-region}-${network-host-subnet-name} 
    namespace:Redacted
  verticalPodAutoscaling:
    enabled: true
  workloadIdentityConfig:
    identityNamespace: redacted

The text was updated successfully, but these errors were encountered:

tylerreidwaze · 2022-02-01T20:52:06Z

It's worth noting that my cluster will reach the up to date state, but never be ready

Events:
  Type     Reason        Age                 From                         Message
  ----     ------        ----                ----                         -------
  Warning  UpdateFailed  49m                 containercluster-controller  Redacted
  Normal   Updating      91s (x15 over 49m)  containercluster-controller  Update in progress
  Normal   UpToDate      90s (x14 over 44m)  containercluster-controller  The resource is up to date
> kubectl get containercluster dacluster5 -n kpt-test-project-demo1
I0201 12:45:57.460202  477241 request.go:655] Throttling request took 1.17739692s, request: GET:https://34.132.193.174/apis/networking.gke.io/v1?timeout=32s
NAME         AGE    READY   STATUS     STATUS AGE

tylerreidwaze · 2022-02-02T17:00:26Z

So this was eventually do to some configuration conflict. I wanted to have public IPs, but I a config section which had some of the configs for a private cluster. As a result, the cluster could never reach a ready state. However, I would have expected a more detail error message or possibly some issues raised by Config Sync.

The config bits I removed

  privateClusterConfig:
    enablePrivateEndpoint: false
    enablePrivateNodes: false
    masterGlobalAccessConfig:
      enabled: true

As far as I am concerned, the issue is resolved for me. I think there is a less urgent request to improve the error messaging for an issue like this. Let me know if you need more details.

karlkfi · 2022-02-02T18:58:48Z

So with that config removed the cluster became ready with the default pool deleted?

tylerreidwaze · 2022-02-02T19:39:20Z

I actually added the default pool back because I was trying to limit the issues associated with it. I was able to create it, but I left the default pool.

I will likely remove that default pool and try again in a few days. Will report back

karlkfi · 2022-02-02T20:05:00Z

If leaving the default pool up solves the problem, then the root cause probably still needs fixing in KCC.

tylerreidwaze · 2022-02-03T00:17:19Z

THe cluster also started without the default node pool

diviner524 · 2022-02-12T00:15:14Z

Confirmed with customer this is not impacting them anymore after removing privateClusterConfig.

tylerreidwaze added the bug Something isn't working label Feb 1, 2022

tylerreidwaze changed the title ~~Nodepool Failing to Create in Cluster without Default Node Pool~~ GKE Cluster with some misconfigurations never reaches READY state but does not provide any error messages Feb 2, 2022

diviner524 closed this as completed Feb 12, 2022

This was referenced Nov 27, 2023

[Snyk] Fix for 15 vulnerabilities Matthelonianxl/k8s-config-connector#30

Open

[Snyk] Fix for 1 vulnerabilities Matthelonianxl/k8s-config-connector#33

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GKE Cluster with some misconfigurations never reaches READY state but does not provide any error messages #601

GKE Cluster with some misconfigurations never reaches READY state but does not provide any error messages #601

tylerreidwaze commented Feb 1, 2022

tylerreidwaze commented Feb 1, 2022 •

edited

Loading

tylerreidwaze commented Feb 2, 2022

karlkfi commented Feb 2, 2022

tylerreidwaze commented Feb 2, 2022

karlkfi commented Feb 2, 2022

tylerreidwaze commented Feb 3, 2022

diviner524 commented Feb 12, 2022

GKE Cluster with some misconfigurations never reaches READY state but does not provide any error messages #601

GKE Cluster with some misconfigurations never reaches READY state but does not provide any error messages #601

Comments

tylerreidwaze commented Feb 1, 2022

Checklist

Bug Description

Additional Diagnostic Information

Kubernetes Cluster Version

Config Connector Version

Config Connector Mode

Log Output

Steps to Reproduce

Steps to reproduce the issue

YAML snippets

tylerreidwaze commented Feb 1, 2022 • edited Loading

tylerreidwaze commented Feb 2, 2022

karlkfi commented Feb 2, 2022

tylerreidwaze commented Feb 2, 2022

karlkfi commented Feb 2, 2022

tylerreidwaze commented Feb 3, 2022

diviner524 commented Feb 12, 2022

tylerreidwaze commented Feb 1, 2022 •

edited

Loading