Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GKE Cluster with some misconfigurations never reaches READY state but does not provide any error messages #601

Closed
3 tasks done
tylerreidwaze opened this issue Feb 1, 2022 · 7 comments
Labels
bug Something isn't working

Comments

@tylerreidwaze
Copy link
Collaborator

Checklist

Bug Description

I am creating a GKE Cluster more or less following the provided blueprint. I have cnrm.cloud.google.com/remove-default-node-pool set to "true." My cluster is created without any issues with the default nodepool. Then, after the cluster reaches a ready state it removes the node pool. After removing the default nodepool, my nodepool is stuck in a pending state and is never added to the cluster as it is waiting for the cluster to be ready

nodepool

> kubectl describe containernodepool dacluster2-primary -n kpt-test-project-demo1 
Events:
  Type     Reason              Age                  From                          Message
  ----     ------              ----                 ----                          -------
  Warning  DependencyNotReady  104s (x14 over 17m)  containernodepool-controller  reference ContainerCluster kpt-test-project-demo1/dacluster2 is not ready

cluster

> kubectl describe containercluster dacluster2 -n kpt-test-project-demo1
Events:
  Type     Reason        Age                From                         Message
  ----     ------        ----               ----                         -------
  Normal   Updating      11m (x3 over 20m)  containercluster-controller  Update in progress
  Normal   UpToDate      11m                containercluster-controller  The resource is up to date

Additional Diagnostic Information

Kubernetes Cluster Version

kubectl version --short
Client Version: v1.20.8-dispatcher
Server Version: v1.20.10-gke.1600

Config Connector Version

> kubectl get ns cnrm-system -o jsonpath='{.metadata.annotations.cnrm\.cloud\.google\.com/version}'~
1.70.0~%  

Config Connector Mode

No result was returned from this call

Log Output

Steps to Reproduce

Steps to reproduce the issue

Create cluster with remove-default node pool set to true and a custom nodepool. Watch the cluster start up in the GCP Console

YAML snippets

Nodepool config

apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerNodePool
metadata:
  name: dacluster2-primary 
  namespace: Redacted
  annotations:
    cnrm.cloud.google.com/blueprint: cnrm/gke:gke-nodepool/v0.3.0
    cnrm.cloud.google.com/project-id: Redacted
spec:
  autoscaling:
    maxNodeCount: 3 # kpt-set: ${max-node-count}
    minNodeCount: 1
  clusterRef:
    name: dacluster2 
  # initialNodeCount is per-zone, for regional clusters
  initialNodeCount: 1
  location: us-central1 
  management:
    autoRepair: true
    autoUpgrade: true
  maxPodsPerNode: 64
  nodeConfig:
    labels:
      gke.io/nodepool: primary 
    diskSizeGb: 100
    diskType: pd-ssd
    machineType: e2-standard-16
    oauthScopes:
      - gke-default
    serviceAccountRef:
      name: Redacted

Cluster Config

apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerCluster
metadata: 
  name: dacluster2 # kpt-set: ${cluster-name}
  namespace: Redacted
  labels:
    gke.io/environment: dev
  annotations:
    cnrm.cloud.google.com/blueprint: cnrm/gke:gke-cluster/v0.3.0
    cnrm.cloud.google.com/project-id: Redacted
    # Remove the default node pool after bootstrapping.
    # Explcit node pool configuration allows for more isolation and makes it
    # easier to replace node pools to change immutable fields.
    cnrm.cloud.google.com/remove-default-node-pool: "true"
spec:
  addonsConfig:
    dnsCacheConfig:
      enabled: true
    gcePersistentDiskCsiDriverConfig:
      enabled: true
    networkPolicyConfig:
      disabled: false
  enableBinaryAuthorization: true
  enableShieldedNodes: true
  initialNodeCount: 1

  ipAllocationPolicy:
    clusterSecondaryRangeName: pods
    servicesSecondaryRangeName: services
  location: us-central1 
  masterAuthorizedNetworksConfig:
    cidrBlocks:
      - cidrBlock: 0.0.0.0/0
        displayName: Whole Internet
  networkRef:
    external: Redacted
  privateClusterConfig:
    enablePrivateEndpoint: false
    enablePrivateNodes: false
    masterGlobalAccessConfig:
      enabled: true
  releaseChannel:
    channel: REGULAR 
  subnetworkRef:
    name: us-central1-abitofdanet # kpt-set: ${cloud-region}-${network-host-subnet-name} 
    namespace:Redacted
  verticalPodAutoscaling:
    enabled: true
  workloadIdentityConfig:
    identityNamespace: redacted
@tylerreidwaze tylerreidwaze added the bug Something isn't working label Feb 1, 2022
@tylerreidwaze
Copy link
Collaborator Author

tylerreidwaze commented Feb 1, 2022

It's worth noting that my cluster will reach the up to date state, but never be ready

Events:
  Type     Reason        Age                 From                         Message
  ----     ------        ----                ----                         -------
  Warning  UpdateFailed  49m                 containercluster-controller  Redacted
  Normal   Updating      91s (x15 over 49m)  containercluster-controller  Update in progress
  Normal   UpToDate      90s (x14 over 44m)  containercluster-controller  The resource is up to date
> kubectl get containercluster dacluster5 -n kpt-test-project-demo1
I0201 12:45:57.460202  477241 request.go:655] Throttling request took 1.17739692s, request: GET:https://34.132.193.174/apis/networking.gke.io/v1?timeout=32s
NAME         AGE    READY   STATUS     STATUS AGE

@tylerreidwaze
Copy link
Collaborator Author

So this was eventually do to some configuration conflict. I wanted to have public IPs, but I a config section which had some of the configs for a private cluster. As a result, the cluster could never reach a ready state. However, I would have expected a more detail error message or possibly some issues raised by Config Sync.

The config bits I removed

  privateClusterConfig:
    enablePrivateEndpoint: false
    enablePrivateNodes: false
    masterGlobalAccessConfig:
      enabled: true

As far as I am concerned, the issue is resolved for me. I think there is a less urgent request to improve the error messaging for an issue like this. Let me know if you need more details.

@tylerreidwaze tylerreidwaze changed the title Nodepool Failing to Create in Cluster without Default Node Pool GKE Cluster with some misconfigurations never reaches READY state but does not provide any error messages Feb 2, 2022
@karlkfi
Copy link

karlkfi commented Feb 2, 2022

So with that config removed the cluster became ready with the default pool deleted?

@tylerreidwaze
Copy link
Collaborator Author

I actually added the default pool back because I was trying to limit the issues associated with it. I was able to create it, but I left the default pool.

I will likely remove that default pool and try again in a few days. Will report back

@karlkfi
Copy link

karlkfi commented Feb 2, 2022

If leaving the default pool up solves the problem, then the root cause probably still needs fixing in KCC.

@tylerreidwaze
Copy link
Collaborator Author

THe cluster also started without the default node pool

@diviner524
Copy link
Collaborator

Confirmed with customer this is not impacting them anymore after removing privateClusterConfig.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants