Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

system-cluster-critical pod "calico-kube-controllers" is not yet ready over 10 minutes when performing rolling update #10207

Closed
CheyiLin opened this issue Nov 10, 2020 · 6 comments

Comments

@CheyiLin
Copy link
Contributor

1. What kops version are you running?
Version 1.18.2 (git-84495481e4)

2. What Kubernetes version are you running?

Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.13", GitCommit:"30d651da517185653e34e7ab99a792be6a3d9495", GitTreeState:"clean", BuildDate:"2020-10-15T01:06:31Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.13", GitCommit:"30d651da517185653e34e7ab99a792be6a3d9495", GitTreeState:"clean", BuildDate:"2020-10-15T00:59:17Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
kops rolling-update cluster --name my-cluster-name --yes

5. What happened after the commands executed?
The cluster cannot pass the validation, wait for the rescheduled system-cluster-critical pod calico-kube-controllers to be ready.

kops logs
...
I1110 13:55:22.615378   87661 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": system-cluster-critical pod "calico-kube-controllers-79c8f4bdc4-6tft2" is not ready (calico-kube-controllers), system-cluster-critical pod "metrics-server-6dffd46f94-27fv5" is pending.
I1110 13:55:54.935669   87661 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": system-cluster-critical pod "calico-kube-controllers-79c8f4bdc4-6tft2" is not ready (calico-kube-controllers).
I1110 13:56:27.642298   87661 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": system-cluster-critical pod "calico-kube-controllers-79c8f4bdc4-6tft2" is not ready (calico-kube-controllers).
I1110 13:56:59.836149   87661 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": system-cluster-critical pod "calico-kube-controllers-79c8f4bdc4-6tft2" is not ready (calico-kube-controllers).
I1110 13:57:32.576827   87661 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": system-cluster-critical pod "calico-kube-controllers-79c8f4bdc4-6tft2" is not ready (calico-kube-controllers).
I1110 13:58:05.432661   87661 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": system-cluster-critical pod "calico-kube-controllers-79c8f4bdc4-6tft2" is not ready (calico-kube-controllers).
I1110 13:58:38.093650   87661 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": system-cluster-critical pod "calico-kube-controllers-79c8f4bdc4-6tft2" is not ready (calico-kube-controllers).
I1110 13:59:10.570779   87661 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": system-cluster-critical pod "calico-kube-controllers-79c8f4bdc4-6tft2" is not ready (calico-kube-controllers).
...
pod calico-kube-controllers logs
2020-11-10 05:46:17.832 [INFO][1] ipam.go 45: Synchronizing IPAM data                                                                                                                                                                         
2020-11-10 05:46:17.927 [INFO][1] ipam.go 281: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(ip-10-104-191-29.us-west-2.compute.internal) with error: nodes "ip-10-104-191-29.us-west-2.compute.inte 
2020-11-10 05:46:17.927 [INFO][1] ipam.go 137: Checking node calicoNode="ip-10-104-191-29.us-west-2.compute.internal" k8sNode=""                                                                                                              
2020-11-10 05:46:17.932 [INFO][1] ipam.go 177: Cleaning up IPAM resources for deleted node calicoNode="ip-10-104-191-29.us-west-2.compute.internal" k8sNode=""                                                                                
2020-11-10 05:46:17.932 [INFO][1] ipam.go 1173: Releasing all IPs with handle 'ipip-tunnel-addr-ip-10-104-191-29.us-west-2.compute.internal'                                                                                                  
2020-11-10 05:46:18.066 [INFO][1] ipam.go 1492: Node doesn't exist, no need to release affinity cidr=100.119.230.64/26 host="ip-10-104-191-29.us-west-2.compute.internal"                                                                     
2020-11-10 05:46:18.066 [INFO][1] ipam.go 1173: Releasing all IPs with handle 'k8s-pod-network.36e91144959c16cd8a006ee6d91c7f7d5b771f38bb0cfdf692049cee6f545de1'                                                                              
2020-11-10 05:46:18.117 [INFO][1] ipam.go 1492: Node doesn't exist, no need to release affinity cidr=100.119.230.64/26 host="ip-10-104-191-29.us-west-2.compute.internal"                                                                     
2020-11-10 05:46:18.117 [INFO][1] ipam.go 1173: Releasing all IPs with handle 'k8s-pod-network.73961784a28b84cd384d6bda1d76c902a0b317baf33be5bd87fce0adc6d322f6'                                                                              
2020-11-10 05:46:18.169 [INFO][1] ipam.go 1492: Node doesn't exist, no need to release affinity cidr=100.119.230.64/26 host="ip-10-104-191-29.us-west-2.compute.internal"                                                                     
2020-11-10 05:46:18.251 [INFO][1] ipam.go 190: Node and IPAM data is in sync                                                                                                                                                                  
2020-11-10 05:46:51.498 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceed 
2020-11-10 05:46:51.498 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                            
2020-11-10 05:47:23.501 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                             
2020-11-10 05:47:43.502 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceed 
2020-11-10 05:47:43.502 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                            
2020-11-10 05:48:15.503 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                             
2020-11-10 05:48:35.504 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceed 
2020-11-10 05:48:35.504 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                            
2020-11-10 05:49:07.504 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                             
2020-11-10 05:49:27.505 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceed 
2020-11-10 05:49:27.505 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                            
2020-11-10 05:49:59.505 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                             
2020-11-10 05:50:19.506 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceed 
2020-11-10 05:50:19.506 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                            
2020-11-10 05:50:51.507 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                             
2020-11-10 05:51:11.508 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceed 
2020-11-10 05:51:11.508 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                            
2020-11-10 05:51:43.509 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                             
2020-11-10 05:52:03.509 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceed 
2020-11-10 05:52:03.510 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                            
2020-11-10 05:52:35.512 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                             
2020-11-10 05:52:55.513 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceed 
2020-11-10 05:52:55.513 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                            
2020-11-10 05:53:27.513 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                             
2020-11-10 05:53:47.514 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceed 
2020-11-10 05:53:47.514 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                            
2020-11-10 05:54:19.514 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                             
2020-11-10 05:54:39.515 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceed 
2020-11-10 05:54:39.515 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                            
2020-11-10 05:55:11.516 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                             
2020-11-10 05:55:31.517 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceed 
2020-11-10 05:55:31.517 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                            
2020-11-10 05:56:03.517 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                             
2020-11-10 05:56:23.518 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceed 
2020-11-10 05:56:23.518 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                            
2020-11-10 05:56:55.519 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                             
2020-11-10 05:57:15.519 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceed 
2020-11-10 05:57:15.519 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                            
2020-11-10 05:57:47.520 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                             
2020-11-10 05:58:07.520 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceed 
2020-11-10 05:58:07.520 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                               
...

6. What did you expect to happen?
The rescheduled pod calico-kube-controllers will be ready in few minutes.

7. Please provide your cluster manifest.

kops cluster & ig manifests
---
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  name: my-cluster-name.k8s.local
spec:
  additionalNetworkCIDRs:
  - my-network-cidr
  api:
    loadBalancer:
      type: Internal
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://my-k8s-state/my-cluster-name.k8s.local
  etcdClusters:
  - etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-a1
      name: master-a1
    name: main
  - etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-a1
      name: master-a1
    name: events
  externalPolicies:
    master:
    - arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
    node:
    - arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    disableBasicAuth: true
    eventTTL: 3h0m0s
  kubeControllerManager:
    horizontalPodAutoscalerUseRestClients: true
  kubeDNS:
    coreDNSImage: k8s.gcr.io/coredns:1.7.0
    externalCoreFile: |-
      # forward AWS records to Amazon DNS server within VPC
      amazonaws.com:53 {
        errors
        prometheus :9153
        forward . 169.254.169.253
        cache 30
      }
      # forward external records to public DNS servers
      # public DNS providers: Google > Cloudflare > Quad9 (IBM)
      com net org io {
        errors
        prometheus :9153
        forward . 8.8.8.8 1.1.1.1 9.9.9.9 {
          policy sequential
        }
        cache 30
      }
      .:53 {
        log . "* - {type} {class} {name} {proto} {size} {rcode} {rsize} {duration} > {remote}"
        errors
        health
        autopath @kubernetes
        kubernetes cluster.local. in-addr.arpa ip6.arpa {
          pods verified
          endpoint_pod_names
          fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . /etc/resolv.conf
        loop
        cache 30
        loadbalance
        reload
        ready
      }
    nodeLocalDNS:
      enabled: true
    provider: CoreDNS
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
    enforceNodeAllocatable: pods
    kubeReserved:
      cpu: 100m
      ephemeral-storage: 1Gi
      memory: 200Mi
    systemReserved:
      cpu: 100m
      ephemeral-storage: 1Gi
      memory: 200Mi
  kubernetesApiAccess:
  - 10.0.0.0/8
  kubernetesVersion: 1.17.13
  networkCIDR: my-network-cidr
  networkID: my-vpc-id
  networking:
    calico:
      crossSubnet: true
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 10.0.0.0/8
  subnets:
  - id: subnet-000001
    name: subnet-a1
    type: Private
    zone: us-west-2a
  - id: subnet-000002
    name: subnet-b1
    type: Private
    zone: us-west-2b
  - id: subnet-000003
    name: subnet-c1
    type: Private
    zone: us-west-2c
  topology:
    dns:
      type: Public
    masters: private
    nodes: private

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: my-cluster-name.k8s.local
  name: master-a1
spec:
  associatePublicIp: false
  cloudLabels:
    k8s.io/cluster-autoscaler/disabled: ""
    k8s.io/cluster-autoscaler/my-cluster-name.k8s.local: owned
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20201014
  machineType: t3a.large
  maxPrice: "0.04"
  maxSize: 1
  minSize: 1
  mixedInstancesPolicy:
    instances:
    - t3a.large
    - t3.large
    onDemandAboveBase: 0
    onDemandBase: 0
    spotAllocationStrategy: capacity-optimized
  nodeLabels:
    kops.k8s.io/instancegroup: master-a1
  role: Master
  rootVolumeOptimization: true
  subnets:
  - subnet-a1
  suspendProcesses:
  - AZRebalance

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: my-cluster-name.k8s.local
  name: nodes-a1
spec:
  associatePublicIp: false
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: ""
    k8s.io/cluster-autoscaler/my-cluster-name.k8s.local: owned
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20201014
  machineType: t3a.large
  maxPrice: "0.04"
  maxSize: 3
  minSize: 1
  mixedInstancesPolicy:
    instances:
    - t3a.large
    - t3.large
    onDemandAboveBase: 0
    onDemandBase: 0
    spotAllocationStrategy: capacity-optimized
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-a1
  role: Node
  rootVolumeOptimization: true
  subnets:
  - subnet-a1
  suspendProcesses:
  - AZRebalance

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: my-cluster-name.k8s.local
  name: nodes-b1
spec:
  associatePublicIp: false
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: ""
    k8s.io/cluster-autoscaler/my-cluster-name.k8s.local: owned
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20201014
  machineType: t3a.large
  maxPrice: "0.04"
  maxSize: 3
  minSize: 1
  mixedInstancesPolicy:
    instances:
    - t3a.large
    - t3.large
    onDemandAboveBase: 0
    onDemandBase: 0
    spotAllocationStrategy: capacity-optimized
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-b1
  role: Node
  rootVolumeOptimization: true
  subnets:
  - subnet-b1
  suspendProcesses:
  - AZRebalance

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: my-cluster-name.k8s.local
  name: nodes-c1
spec:
  associatePublicIp: false
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: ""
    k8s.io/cluster-autoscaler/my-cluster-name.k8s.local: owned
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20201014
  machineType: t3a.large
  maxPrice: "0.04"
  maxSize: 3
  minSize: 1
  mixedInstancesPolicy:
    instances:
    - t3a.large
    - t3.large
    onDemandAboveBase: 0
    onDemandBase: 0
    spotAllocationStrategy: capacity-optimized
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-c1
  role: Node
  rootVolumeOptimization: true
  subnets:
  - subnet-c1
  suspendProcesses:
  - AZRebalance

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

Not available yet.

9. Anything else do we need to know?

  1. The cluster uses calico CNI with cross-subnet mode enabled.
  networking:
    calico:
      crossSubnet: true
  1. Deleting the not-ready pod calico-kube-controllers manually solves the issue, the new created pod works normally, and the cluster can pass the validation, continue the rolling update process.
new created pod calico-kube-controllers logs
2020-11-10 06:00:04.995 [INFO][1] main.go 88: Loaded configuration from environment config=&config.Config{LogLevel:"info", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", DatastoreType:"kuberne 
W1110 06:00:04.997519       1 client_config.go:543] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.                                                                                        
2020-11-10 06:00:04.998 [INFO][1] main.go 109: Ensuring Calico datastore is initialized                                                                                                                                                       
2020-11-10 06:00:05.009 [INFO][1] main.go 149: Getting initial config snapshot from datastore                                                                                                                                                 
2020-11-10 06:00:05.029 [INFO][1] main.go 152: Got initial config snapshot                                                                                                                                                                    
2020-11-10 06:00:05.029 [INFO][1] watchersyncer.go 89: Start called                                                                                                                                                                           
2020-11-10 06:00:05.029 [INFO][1] main.go 169: Starting status report routine                                                                                                                                                                 
2020-11-10 06:00:05.030 [INFO][1] main.go 402: Starting controller ControllerType="Node"                                                                                                                                                      
2020-11-10 06:00:05.030 [INFO][1] node_controller.go 138: Starting Node controller                                                                                                                                                            
2020-11-10 06:00:05.030 [INFO][1] watchersyncer.go 127: Sending status update Status=wait-for-ready                                                                                                                                           
2020-11-10 06:00:05.030 [INFO][1] node_syncer.go 40: Node controller syncer status updated: wait-for-ready                                                                                                                                    
2020-11-10 06:00:05.030 [INFO][1] watchersyncer.go 147: Starting main event processing loop                                                                                                                                                   
2020-11-10 06:00:05.031 [INFO][1] resources.go 343: Main client watcher loop                                                                                                                                                                  
2020-11-10 06:00:05.037 [INFO][1] watchercache.go 297: Sending synced update ListRoot="/calico/resources/v3/projectcalico.org/nodes"                                                                                                          
2020-11-10 06:00:05.037 [INFO][1] watchersyncer.go 127: Sending status update Status=resync                                                                                                                                                   
2020-11-10 06:00:05.037 [INFO][1] node_syncer.go 40: Node controller syncer status updated: resync                                                                                                                                            
2020-11-10 06:00:05.037 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches                                                                                                                                  
2020-11-10 06:00:05.037 [INFO][1] watchersyncer.go 221: All watchers have sync'd data - sending data and final sync                                                                                                                           
2020-11-10 06:00:05.037 [INFO][1] watchersyncer.go 127: Sending status update Status=in-sync                                                                                                                                                  
2020-11-10 06:00:05.037 [INFO][1] node_syncer.go 40: Node controller syncer status updated: in-sync                                                                                                                                           
2020-11-10 06:00:05.046 [INFO][1] hostendpoints.go 90: successfully synced all hostendpoints                                                                                                                                                  
2020-11-10 06:00:05.130 [INFO][1] node_controller.go 151: Node controller is now running                                                                                                                                                      
2020-11-10 06:00:05.130 [INFO][1] ipam.go 45: Synchronizing IPAM data                                                                                                                                                                         
2020-11-10 06:00:05.155 [INFO][1] ipam.go 190: Node and IPAM data is in sync
...
@zetaab
Copy link
Member

zetaab commented Nov 10, 2020

for me it sometimes can take longer than 10 minutes, that is why I use 20minutes max time for waiting

@CheyiLin
Copy link
Contributor Author

@zetaab Good to know that we're not alone. 😆
But if I manually delete the not-ready pod calico-kube-controllers, the new created pod will work normally and the cluster will pass the validation immediately.
In my opinion there should be something we can improve/fix to speed up the rolling update process.

@hakman
Copy link
Member

hakman commented Nov 10, 2020

@CheyiLin This is a Calico and client-go issue, unrelated to kOps. Please more about it in:
projectcalico/calico#3751
https://github.com/projectcalico/libcalico-go/issues/1267
kubernetes/client-go#374

@CheyiLin
Copy link
Contributor Author

@hakman The issue projectcalico/calico#3751 is exactly what I reported here.
Thanks for the context and pointing this out!

@hakman
Copy link
Member

hakman commented Nov 10, 2020

No problem. Keep an eye on those. I see there's some activity.

@rtacconi
Copy link

rtacconi commented Jan 3, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants