Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apiserver never comes online #1319

Closed
jackfrancis opened this issue Apr 15, 2021 · 21 comments
Closed

apiserver never comes online #1319

jackfrancis opened this issue Apr 15, 2021 · 21 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@jackfrancis
Copy link
Contributor

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

Built a "default" cluster using the provided tilt-accessible template:

https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/master/templates/cluster-template.yaml

... and the apiserver seems to never reconcile.

What did you expect to happen:

I expected the cluster to come online

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

The IaaS seems to be there, but the apiserver container is not running:

$ sudo crictl ps
CONTAINER           IMAGE               CREATED              STATE               NAME                      ATTEMPT             POD ID
a803e2cc635aa       0369cf4303ffd       About a minute ago   Running             etcd                      2                   65ac7cb613107
4056fe89f9b30       8d13f1db8bfb4       23 minutes ago       Running             kube-scheduler            1                   1a2c5d9c257c2
cb5abc93967d4       6f0c3da8c99e9       23 minutes ago       Running             kube-controller-manager   1                   12b39ce781218
54537d21e1b90       bfe3a36ebd252       25 minutes ago       Running             coredns                   0                   69bc19ee141ba
88ff470d8cfae       75c8849ca8404       25 minutes ago       Running             calico-kube-controllers   0                   3763e13788ab9
2c9ea6194a1c7       bfe3a36ebd252       25 minutes ago       Running             coredns                   0                   38b6e7f583094
da9a1afce015e       f0d3b0d0e32ce       25 minutes ago       Running             calico-node               0                   d9cb1b9794eaa
090c2fced8cf1       5384b16505072       26 minutes ago       Running             kube-proxy                0                   e7508dec230cc

Environment:

  • cluster-api-provider-azure version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 15, 2021
@jackfrancis
Copy link
Contributor Author

cloud-init logs:

$ sudo cat /var/log/cloud-init-output.log 
[2021-04-15 20:22:59] /usr/lib/python3/dist-packages/requests/__init__.py:80: RequestsDependencyWarning: urllib3 (1.26.4) or chardet (3.0.4) doesn't match a supported version!
[2021-04-15 20:22:59]   RequestsDependencyWarning)
[2021-04-15 20:23:01] Cloud-init v. 20.4.1-0ubuntu1~18.04.1 running 'init-local' at Thu, 15 Apr 2021 20:22:59 +0000. Up 15.46 seconds.
[2021-04-15 20:23:04] /usr/lib/python3/dist-packages/requests/__init__.py:80: RequestsDependencyWarning: urllib3 (1.26.4) or chardet (3.0.4) doesn't match a supported version!
[2021-04-15 20:23:04]   RequestsDependencyWarning)
[2021-04-15 20:23:07] Cloud-init v. 20.4.1-0ubuntu1~18.04.1 running 'init' at Thu, 15 Apr 2021 20:23:04 +0000. Up 19.97 seconds.
[2021-04-15 20:23:07] ci-info: +++++++++++++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++++++++++++
[2021-04-15 20:23:07] ci-info: +--------+------+-----------------------------+-------------+--------+-------------------+
[2021-04-15 20:23:07] ci-info: | Device |  Up  |           Address           |     Mask    | Scope  |     Hw-Address    |
[2021-04-15 20:23:07] ci-info: +--------+------+-----------------------------+-------------+--------+-------------------+
[2021-04-15 20:23:07] ci-info: |  eth0  | True |           10.0.0.4          | 255.255.0.0 | global | 00:0d:3a:c3:14:28 |
[2021-04-15 20:23:07] ci-info: |  eth0  | True | fe80::20d:3aff:fec3:1428/64 |      .      |  link  | 00:0d:3a:c3:14:28 |
[2021-04-15 20:23:07] ci-info: |   lo   | True |          127.0.0.1          |  255.0.0.0  |  host  |         .         |
[2021-04-15 20:23:07] ci-info: |   lo   | True |           ::1/128           |      .      |  host  |         .         |
[2021-04-15 20:23:07] ci-info: +--------+------+-----------------------------+-------------+--------+-------------------+
[2021-04-15 20:23:07] ci-info: ++++++++++++++++++++++++++++++Route IPv4 info+++++++++++++++++++++++++++++++
[2021-04-15 20:23:07] ci-info: +-------+-----------------+----------+-----------------+-----------+-------+
[2021-04-15 20:23:07] ci-info: | Route |   Destination   | Gateway  |     Genmask     | Interface | Flags |
[2021-04-15 20:23:07] ci-info: +-------+-----------------+----------+-----------------+-----------+-------+
[2021-04-15 20:23:07] ci-info: |   0   |     0.0.0.0     | 10.0.0.1 |     0.0.0.0     |    eth0   |   UG  |
[2021-04-15 20:23:07] ci-info: |   1   |     10.0.0.0    | 0.0.0.0  |   255.255.0.0   |    eth0   |   U   |
[2021-04-15 20:23:07] ci-info: |   2   |  168.63.129.16  | 10.0.0.1 | 255.255.255.255 |    eth0   |  UGH  |
[2021-04-15 20:23:07] ci-info: |   3   | 169.254.169.254 | 10.0.0.1 | 255.255.255.255 |    eth0   |  UGH  |
[2021-04-15 20:23:07] ci-info: +-------+-----------------+----------+-----------------+-----------+-------+
[2021-04-15 20:23:07] ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
[2021-04-15 20:23:07] ci-info: +-------+-------------+---------+-----------+-------+
[2021-04-15 20:23:07] ci-info: | Route | Destination | Gateway | Interface | Flags |
[2021-04-15 20:23:07] ci-info: +-------+-------------+---------+-----------+-------+
[2021-04-15 20:23:07] ci-info: |   1   |  fe80::/64  |    ::   |    eth0   |   U   |
[2021-04-15 20:23:07] ci-info: |   3   |   anycast   |    ::   |    eth0   |   U   |
[2021-04-15 20:23:07] ci-info: |   4   |    local    |    ::   |    eth0   |   U   |
[2021-04-15 20:23:07] ci-info: |   5   |  multicast  |    ::   |    eth0   |   U   |
[2021-04-15 20:23:07] ci-info: +-------+-------------+---------+-----------+-------+
[2021-04-15 20:23:23] Generating public/private rsa key pair.
[2021-04-15 20:23:23] Your identification has been saved in /etc/ssh/ssh_host_rsa_key.
[2021-04-15 20:23:23] Your public key has been saved in /etc/ssh/ssh_host_rsa_key.pub.
[2021-04-15 20:23:23] The key fingerprint is:
[2021-04-15 20:23:23] SHA256:<redacted> root@default-template-control-plane-8klqz
[2021-04-15 20:23:23] The key's randomart image is:
<redacted>
[2021-04-15 20:23:23] +----[SHA256]-----+
[2021-04-15 20:23:23] Generating public/private dsa key pair.
[2021-04-15 20:23:23] Your identification has been saved in /etc/ssh/ssh_host_dsa_key.
[2021-04-15 20:23:23] Your public key has been saved in /etc/ssh/ssh_host_dsa_key.pub.
[2021-04-15 20:23:23] The key fingerprint is:
[2021-04-15 20:23:23] SHA256:<redacted> root@default-template-control-plane-8klqz
[2021-04-15 20:23:23] The key's randomart image is:
[2021-04-15 20:23:23] +---[DSA 1024]----+
<redacted>
[2021-04-15 20:23:23] +----[SHA256]-----+
[2021-04-15 20:23:23] Generating public/private ecdsa key pair.
[2021-04-15 20:23:23] Your identification has been saved in /etc/ssh/ssh_host_ecdsa_key.
[2021-04-15 20:23:23] Your public key has been saved in /etc/ssh/ssh_host_ecdsa_key.pub.
[2021-04-15 20:23:23] The key fingerprint is:
[2021-04-15 20:23:23] SHA256:****** root@default-template-control-plane-8klqz
[2021-04-15 20:23:23] The key's randomart image is:
[2021-04-15 20:23:23] +---[ECDSA 256]---+
<redacted>
[2021-04-15 20:23:23] +----[SHA256]-----+
[2021-04-15 20:23:23] Generating public/private ed25519 key pair.
[2021-04-15 20:23:23] Your identification has been saved in /etc/ssh/ssh_host_ed25519_key.
[2021-04-15 20:23:23] Your public key has been saved in /etc/ssh/ssh_host_ed25519_key.pub.
[2021-04-15 20:23:23] The key fingerprint is:
[2021-04-15 20:23:23] SHA256:2EX04Sx9VMAPjUdFLSuKNc10f2FSN4IKmAIsUrN5oq8 root@default-template-control-plane-8klqz
[2021-04-15 20:23:23] The key's randomart image is:
[2021-04-15 20:23:23] +--[ED25519 256]--+
<redacted>
[2021-04-15 20:23:23] +----[SHA256]-----+
[2021-04-15 20:23:28] /usr/lib/python3/dist-packages/requests/__init__.py:80: RequestsDependencyWarning: urllib3 (1.26.4) or chardet (3.0.4) doesn't match a supported version!
[2021-04-15 20:23:28]   RequestsDependencyWarning)
[2021-04-15 20:23:29] Cloud-init v. 20.4.1-0ubuntu1~18.04.1 running 'modules:config' at Thu, 15 Apr 2021 20:23:28 +0000. Up 43.62 seconds.
[2021-04-15 20:23:30] /usr/lib/python3/dist-packages/requests/__init__.py:80: RequestsDependencyWarning: urllib3 (1.26.4) or chardet (3.0.4) doesn't match a supported version!
[2021-04-15 20:23:30]   RequestsDependencyWarning)
[2021-04-15 20:23:32] [init] Using Kubernetes version: v1.20.5
[2021-04-15 20:23:32] [preflight] Running pre-flight checks
[2021-04-15 20:23:33] [preflight] Pulling images required for setting up a Kubernetes cluster
[2021-04-15 20:23:33] [preflight] This might take a minute or two, depending on the speed of your internet connection
[2021-04-15 20:23:33] [preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[2021-04-15 20:23:33] [certs] Using certificateDir folder "/etc/kubernetes/pki"
[2021-04-15 20:23:33] [certs] Using existing ca certificate authority
[2021-04-15 20:23:33] [certs] Generating "apiserver" certificate and key
[2021-04-15 20:23:33] [certs] apiserver serving cert is signed for DNS names [default-template-1ea86e38.westus2.cloudapp.azure.com default-template-control-plane-8klqz kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 10.0.0.4]
[2021-04-15 20:23:33] [certs] Generating "apiserver-kubelet-client" certificate and key
[2021-04-15 20:23:33] [certs] Using existing front-proxy-ca certificate authority
[2021-04-15 20:23:34] [certs] Generating "front-proxy-client" certificate and key
[2021-04-15 20:23:34] [certs] Using existing etcd/ca certificate authority
[2021-04-15 20:23:34] [certs] Generating "etcd/server" certificate and key
[2021-04-15 20:23:34] [certs] etcd/server serving cert is signed for DNS names [default-template-control-plane-8klqz localhost] and IPs [10.0.0.4 127.0.0.1 ::1]
[2021-04-15 20:23:34] [certs] Generating "etcd/peer" certificate and key
[2021-04-15 20:23:34] [certs] etcd/peer serving cert is signed for DNS names [default-template-control-plane-8klqz localhost] and IPs [10.0.0.4 127.0.0.1 ::1]
[2021-04-15 20:23:34] [certs] Generating "etcd/healthcheck-client" certificate and key
[2021-04-15 20:23:34] [certs] Generating "apiserver-etcd-client" certificate and key
[2021-04-15 20:23:34] [certs] Using the existing "sa" key
[2021-04-15 20:23:34] [kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[2021-04-15 20:23:35] [kubeconfig] Writing "admin.conf" kubeconfig file
[2021-04-15 20:23:35] [kubeconfig] Writing "kubelet.conf" kubeconfig file
[2021-04-15 20:23:36] [kubeconfig] Writing "controller-manager.conf" kubeconfig file
[2021-04-15 20:23:36] [kubeconfig] Writing "scheduler.conf" kubeconfig file
[2021-04-15 20:23:36] [kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[2021-04-15 20:23:36] [kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[2021-04-15 20:23:36] [kubelet-start] Starting the kubelet
[2021-04-15 20:23:36] [control-plane] Using manifest folder "/etc/kubernetes/manifests"
[2021-04-15 20:23:36] [control-plane] Creating static Pod manifest for "kube-apiserver"
[2021-04-15 20:23:36] [control-plane] Creating static Pod manifest for "kube-controller-manager"
[2021-04-15 20:23:36] [control-plane] Creating static Pod manifest for "kube-scheduler"
[2021-04-15 20:23:36] [etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
[2021-04-15 20:23:36] [wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 20m0s
[2021-04-15 20:24:16] [kubelet-check] Initial timeout of 40s passed.
[2021-04-15 20:24:45] [apiclient] All control plane components are healthy after 69.009813 seconds
[2021-04-15 20:24:45] [upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[2021-04-15 20:24:45] [kubelet] Creating a ConfigMap "kubelet-config-1.20" in namespace kube-system with the configuration for the kubelets in the cluster
[2021-04-15 20:24:46] [upload-certs] Skipping phase. Please see --upload-certs
[2021-04-15 20:24:46] [mark-control-plane] Marking the node default-template-control-plane-8klqz as control-plane by adding the labels "node-role.kubernetes.io/master=''" and "node-role.kubernetes.io/control-plane='' (deprecated)"
[2021-04-15 20:24:46] [mark-control-plane] Marking the node default-template-control-plane-8klqz as control-plane by adding the taints [node-role.kubernetes.io/master:NoSchedule]
[2021-04-15 20:24:46] [bootstrap-token] Using token: kcsw0h.v7ldeuol96zqww9u
[2021-04-15 20:24:46] [bootstrap-token] Configuring bootstrap tokens, cluster-info ConfigMap, RBAC Roles
[2021-04-15 20:24:46] [bootstrap-token] configured RBAC rules to allow Node Bootstrap tokens to get nodes
[2021-04-15 20:24:46] [bootstrap-token] configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[2021-04-15 20:24:46] [bootstrap-token] configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[2021-04-15 20:24:46] [bootstrap-token] configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[2021-04-15 20:24:46] [bootstrap-token] Creating the "cluster-info" ConfigMap in the "kube-public" namespace
[2021-04-15 20:24:46] [kubelet-finalize] Updating "/etc/kubernetes/kubelet.conf" to point to a rotatable kubelet client certificate and key
[2021-04-15 20:24:47] [addons] Applied essential addon: CoreDNS
[2021-04-15 20:24:47] [addons] Applied essential addon: kube-proxy
[2021-04-15 20:24:47] 
[2021-04-15 20:24:47] Your Kubernetes control-plane has initialized successfully!
[2021-04-15 20:24:47] 
[2021-04-15 20:24:47] To start using your cluster, you need to run the following as a regular user:
[2021-04-15 20:24:47] 
[2021-04-15 20:24:47]   mkdir -p $HOME/.kube
[2021-04-15 20:24:47]   sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
[2021-04-15 20:24:47]   sudo chown $(id -u):$(id -g) $HOME/.kube/config
[2021-04-15 20:24:47] 
[2021-04-15 20:24:47] Alternatively, if you are the root user, you can run:
[2021-04-15 20:24:47] 
[2021-04-15 20:24:47]   export KUBECONFIG=/etc/kubernetes/admin.conf
[2021-04-15 20:24:47] 
[2021-04-15 20:24:47] You should now deploy a pod network to the cluster.
[2021-04-15 20:24:47] Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
[2021-04-15 20:24:47]   https://kubernetes.io/docs/concepts/cluster-administration/addons/
[2021-04-15 20:24:47] 
[2021-04-15 20:24:47] You can now join any number of control-plane nodes by copying certificate authorities
[2021-04-15 20:24:47] and service account keys on each node and then running the following as root:
[2021-04-15 20:24:47] 
[2021-04-15 20:24:47]   kubeadm join default-template-1ea86e38.westus2.cloudapp.azure.com:6443 --token kcsw0h.v7ldeuol96zqww9u \
[2021-04-15 20:24:47]     --discovery-token-ca-cert-hash sha256:<redacted> \
[2021-04-15 20:24:47]     --control-plane 
[2021-04-15 20:24:47] 
[2021-04-15 20:24:47] Then you can join any number of worker nodes by running the following on each as root:
[2021-04-15 20:24:47] 
[2021-04-15 20:24:47] kubeadm join default-template-1ea86e38.westus2.cloudapp.azure.com:6443 --token kcsw0h.v7ldeuol96zqww9u \
[2021-04-15 20:24:47]     --discovery-token-ca-cert-hash sha256:<redacted> 
[2021-04-15 20:24:47] Cloud-init v. 20.4.1-0ubuntu1~18.04.1 running 'modules:final' at Thu, 15 Apr 2021 20:23:30 +0000. Up 45.91 seconds.
[2021-04-15 20:24:47] Cloud-init v. 20.4.1-0ubuntu1~18.04.1 finished at Thu, 15 Apr 2021 20:24:47 +0000. Datasource DataSourceAzure [seed=/dev/sr0].  Up 123.22 seconds

@jackfrancis
Copy link
Contributor Author

cloud-init logs suggest that the apiserver came online originally, but something went wrong pretty quickly afterwards

@jackfrancis
Copy link
Contributor Author

The systemd job has an error trying to grab the kubelet config, but I imagine that's an expected race condition that the job sometimes loses between itself and whoever is paving /var/lib/kubelet/config.yaml. Pasting for posterity:

$ sudo journalctl -u kubelet --no-pager
-- Logs begin at Thu 2021-04-15 20:22:52 UTC, end at Thu 2021-04-15 20:56:19 UTC. --
Apr 15 20:23:24 default-template-control-plane-8klqz systemd[1]: Started kubelet: The Kubernetes Node Agent.
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]: F0415 20:23:29.106315    1433 server.go:198] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file "/var/lib/kubelet/config.yaml", error: open /var/lib/kubelet/config.yaml: no such file or directory
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]: goroutine 1 [running]:
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]: k8s.io/kubernetes/vendor/k8s.io/klog/v2.stacks(0xc000010001, 0xc0000c0840, 0xfb, 0x14d)
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]:         /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1026 +0xb9
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]: k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).output(0x70ea520, 0xc000000003, 0x0, 0x0, 0xc000051340, 0x6f551a6, 0x9, 0xc6, 0x411b00)
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]:         /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:975 +0x19b
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]: k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).printDepth(0x70ea520, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x1, 0xc000452ea0, 0x1, 0x1)
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]:         /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:732 +0x16f
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]: k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).print(...)
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]:         /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:714
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]: k8s.io/kubernetes/vendor/k8s.io/klog/v2.Fatal(...)
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]:         /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1482
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]: k8s.io/kubernetes/cmd/kubelet/app.NewKubeletCommand.func1(0xc00024d8c0, 0xc00004e0b0, 0x4, 0x4)
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]:         /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubelet/app/server.go:198 +0xe5a
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]: k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute(0xc00024d8c0, 0xc00004e0b0, 0x4, 0x4, 0xc00024d8c0, 0xc00004e0b0)
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]:         /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:854 +0x2c2
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]: k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc00024d8c0, 0x167621bc642c18b5, 0x70ea0e0, 0x409b25)
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]:         /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:958 +0x375
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]: k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute(...)
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]:         /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:895
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]: main.main()
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]:         _output/dockerized/go/src/k8s.io/kubernetes/cmd/kubelet/kubelet.go:41 +0xe5
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]: goroutine 6 [chan receive]:
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]: k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).flushDaemon(0x70ea520)
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]:         /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1169 +0x8b
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]: created by k8s.io/kubernetes/vendor/k8s.io/klog/v2.init.0
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]:         /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:417 +0xdf
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]: goroutine 100 [select]:
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]: k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x4a8e480, 0x4f29d00, 0xc0001d0090, 0x1, 0xc000106000)
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]:         /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:167 +0x149
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]: k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x4a8e480, 0x12a05f200, 0x0, 0xc000a9dc01, 0xc000106000)
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]:         /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]: k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.Until(...)
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]:         /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]: k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.Forever(0x4a8e480, 0x12a05f200)
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]:         /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:81 +0x4f
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]: created by k8s.io/kubernetes/vendor/k8s.io/component-base/logs.InitLogs
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]:         /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/component-base/logs/logs.go:58 +0x8a
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]: goroutine 90 [select]:
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]: k8s.io/kubernetes/vendor/go.opencensus.io/stats/view.(*worker).start(0xc000243e00)
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]:         /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/go.opencensus.io/stats/view/worker.go:154 +0x105
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]: created by k8s.io/kubernetes/vendor/go.opencensus.io/stats/view.init.0
Apr 15 20:23:29 default-template-control-plane-8klqz kubelet[1433]:         /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/go.opencensus.io/stats/view/worker.go:32 +0x57
Apr 15 20:23:29 default-template-control-plane-8klqz systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a

@jackfrancis
Copy link
Contributor Author

The next go-around it succeeded in starting up, and the first (non-spurious AWS credentials) error has to do w/ the network not being ready. Not sure if that's also normal, just a race between the control plane starting up and the CNI being installed:

Apr 15 20:23:36 default-template-control-plane-8klqz systemd[1]: Started kubelet: The Kubernetes Node Agent.
Apr 15 20:23:36 default-template-control-plane-8klqz kubelet[2053]: Flag --cloud-config has been deprecated, will be removed in 1.23, in favor of removing cloud provider code from Kubelet.
Apr 15 20:23:36 default-template-control-plane-8klqz kubelet[2053]: Flag --cloud-provider has been deprecated, will be removed in 1.23, in favor of removing cloud provider code from Kubelet.
Apr 15 20:23:36 default-template-control-plane-8klqz kubelet[2053]: W0415 20:23:36.548189    2053 server.go:191] Warning: For remote container runtime, --pod-infra-container-image is ignored in kubelet, which should be set in that remote runtime instead
Apr 15 20:23:36 default-template-control-plane-8klqz kubelet[2053]: Flag --cloud-config has been deprecated, will be removed in 1.23, in favor of removing cloud provider code from Kubelet.
Apr 15 20:23:36 default-template-control-plane-8klqz kubelet[2053]: Flag --cloud-provider has been deprecated, will be removed in 1.23, in favor of removing cloud provider code from Kubelet.
Apr 15 20:23:36 default-template-control-plane-8klqz kubelet[2053]: I0415 20:23:36.918146    2053 server.go:416] Version: v1.20.5
Apr 15 20:23:36 default-template-control-plane-8klqz kubelet[2053]: W0415 20:23:36.918566    2053 plugins.go:105] WARNING: azure built-in cloud provider is now deprecated. The Azure provider is deprecated and will be removed in a future release
Apr 15 20:23:36 default-template-control-plane-8klqz kubelet[2053]: I0415 20:23:36.961754    2053 server.go:837] Client rotation is on, will bootstrap in background
Apr 15 20:23:37 default-template-control-plane-8klqz kubelet[2053]: I0415 20:23:37.022608    2053 dynamic_cafile_content.go:167] Starting client-ca-bundle::/etc/kubernetes/pki/ca.crt
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: I0415 20:23:42.040796    2053 server.go:645] --cgroups-per-qos enabled, but --cgroup-root was not specified.  defaulting to /
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: I0415 20:23:42.041815    2053 container_manager_linux.go:274] container manager verified user specified cgroup-root exists: []
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: I0415 20:23:42.041944    2053 container_manager_linux.go:279] Creating Container Manager object based on Node Config: {RuntimeCgroupsName: SystemCgroupsName: KubeletCgroupsName: ContainerRuntime:remote CgroupsPerQOS:true CgroupRoot:/ CgroupDriver:cgroupfs KubeletRootDir:/var/lib/kubelet ProtectKernelDefaults:false NodeAllocatableConfig:{KubeReservedCgroupName: SystemReservedCgroupName: ReservedSystemCPUs: EnforceNodeAllocatable:map[pods:{}] KubeReserved:map[] SystemReserved:map[] HardEvictionThresholds:[{Signal:memory.available Operator:LessThan Value:{Quantity:100Mi Percentage:0} GracePeriod:0s MinReclaim:<nil>} {Signal:nodefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.1} GracePeriod:0s MinReclaim:<nil>} {Signal:nodefs.inodesFree Operator:LessThan Value:{Quantity:<nil> Percentage:0.05} GracePeriod:0s MinReclaim:<nil>} {Signal:imagefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.15} GracePeriod:0s MinReclaim:<nil>}]} QOSReserved:map[] ExperimentalCPUManagerPolicy:none ExperimentalTopologyManagerScope:container ExperimentalCPUManagerReconcilePeriod:10s ExperimentalPodPidsLimit:-1 EnforceCPULimits:true CPUCFSQuotaPeriod:100ms ExperimentalTopologyManagerPolicy:none}
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: I0415 20:23:42.044281    2053 topology_manager.go:120] [topologymanager] Creating topology manager with none policy per container scope
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: I0415 20:23:42.044402    2053 container_manager_linux.go:310] [topologymanager] Initializing Topology Manager with none policy and container-level scope
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: I0415 20:23:42.044525    2053 container_manager_linux.go:315] Creating device plugin manager: true
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: W0415 20:23:42.044848    2053 util_unix.go:103] Using "/run/containerd/containerd.sock" as endpoint is deprecated, please consider using full url format "unix:///run/containerd/containerd.sock".
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: I0415 20:23:42.045058    2053 remote_runtime.go:62] parsed scheme: ""
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: I0415 20:23:42.045248    2053 remote_runtime.go:62] scheme "" not registered, fallback to default scheme
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: I0415 20:23:42.045576    2053 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/run/containerd/containerd.sock  <nil> 0 <nil>}] <nil> <nil>}
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: I0415 20:23:42.046235    2053 clientconn.go:948] ClientConn switching balancer to "pick_first"
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: W0415 20:23:42.046314    2053 util_unix.go:103] Using "/run/containerd/containerd.sock" as endpoint is deprecated, please consider using full url format "unix:///run/containerd/containerd.sock".
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: I0415 20:23:42.046353    2053 remote_image.go:50] parsed scheme: ""
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: I0415 20:23:42.046367    2053 remote_image.go:50] scheme "" not registered, fallback to default scheme
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: I0415 20:23:42.046394    2053 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/run/containerd/containerd.sock  <nil> 0 <nil>}] <nil> <nil>}
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: I0415 20:23:42.046407    2053 clientconn.go:948] ClientConn switching balancer to "pick_first"
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: I0415 20:23:42.058311    2053 kubelet.go:265] Adding pod path: /etc/kubernetes/manifests
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: I0415 20:23:42.065087    2053 kubelet.go:276] Watching apiserver
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: I0415 20:23:42.088361    2053 kubelet.go:453] Kubelet client is not nil
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: I0415 20:23:42.128835    2053 kuberuntime_manager.go:216] Container runtime containerd initialized, version: v1.4.4, apiVersion: v1alpha2
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: E0415 20:23:42.252676    2053 aws_credentials.go:77] while getting AWS credentials NoCredentialProviders: no valid providers in chain. Deprecated.
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]:         For verbose messaging see aws.Config.CredentialsChainVerboseErrors
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: W0415 20:23:42.253125    2053 probe.go:268] Flexvolume plugin directory at /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ does not exist. Recreating.
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: I0415 20:23:42.278512    2053 server.go:1176] Started kubelet
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: I0415 20:23:42.285710    2053 fs_resource_analyzer.go:64] Starting FS ResourceAnalyzer
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: I0415 20:23:42.288928    2053 server.go:148] Starting to listen on 0.0.0.0:10250
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: I0415 20:23:42.289689    2053 server.go:410] Adding debug handlers to kubelet server.
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: I0415 20:23:42.302122    2053 volume_manager.go:271] Starting Kubelet Volume Manager
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: I0415 20:23:42.302852    2053 desired_state_of_world_populator.go:142] Desired state populator starts to run
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: E0415 20:23:42.321987    2053 kubelet.go:2188] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

@jackfrancis
Copy link
Contributor Author

Here is a list of the next several error messages, in order:

Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: E0415 20:23:42.323146    2053 cri_stats_provider.go:376] Failed to get the info of the filesystem with mountpoint "/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs": unable to find data in memory cache.
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: E0415 20:23:42.323171    2053 kubelet.go:1296] Image garbage collection failed once. Stats initialization may not have completed yet: invalid capacity 0 on image filesystem
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: E0415 20:23:42.637879    2053 kubelet.go:1857] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: E0415 20:23:42.738059    2053 kubelet.go:1857] skipping pod synchronization - container runtime status check may not have completed yet
Apr 15 20:23:42 default-template-control-plane-8klqz kubelet[2053]: E0415 20:23:42.938257    2053 kubelet.go:1857] skipping pod synchronization - container runtime status check may not have completed yet
Apr 15 20:23:43 default-template-control-plane-8klqz kubelet[2053]: E0415 20:23:43.338469    2053 kubelet.go:1857] skipping pod synchronization - container runtime status check may not have completed yet
Apr 15 20:23:44 default-template-control-plane-8klqz kubelet[2053]: E0415 20:23:44.138722    2053 kubelet.go:1857] skipping pod synchronization - container runtime status check may not have completed yet
Apr 15 20:23:45 default-template-control-plane-8klqz kubelet[2053]: E0415 20:23:45.738931    2053 kubelet.go:1857] skipping pod synchronization - container runtime status check may not have completed yet
Apr 15 20:23:48 default-template-control-plane-8klqz kubelet[2053]: E0415 20:23:48.939085    2053 kubelet.go:1857] skipping pod synchronization - container runtime status check may not have completed yet
Apr 15 20:23:52 default-template-control-plane-8klqz kubelet[2053]: E0415 20:23:52.302640    2053 controller.go:144] failed to ensure lease exists, will retry in 200ms, error: Get "https://default-template-1ea86e38.westus2.cloudapp.azure.com:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/default-template-control-plane-8klqz?timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers
Apr 15 20:23:52 default-template-control-plane-8klqz kubelet[2053]: E0415 20:23:52.402818    2053 kubelet.go:2268] nodes have not yet been read at least once, cannot construct node object
Apr 15 20:23:57 default-template-control-plane-8klqz kubelet[2053]: E0415 20:23:57.440276    2053 kubelet.go:2188] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
Apr 15 20:24:02 default-template-control-plane-8klqz kubelet[2053]: E0415 20:24:02.440999    2053 kubelet.go:2188] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
Apr 15 20:24:02 default-template-control-plane-8klqz kubelet[2053]: E0415 20:24:02.441707    2053 eviction_manager.go:260] eviction manager: failed to get summary stats: failed to get node info: nodes have not yet been read at least once, cannot construct node object
Apr 15 20:24:02 default-template-control-plane-8klqz kubelet[2053]: E0415 20:24:02.503055    2053 controller.go:144] failed to ensure lease exists, will retry in 400ms, error: Get "https://default-template-1ea86e38.westus2.cloudapp.azure.com:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/default-template-control-plane-8klqz?timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Apr 15 20:24:02 default-template-control-plane-8klqz kubelet[2053]: E0415 20:24:02.503473    2053 kubelet.go:2268] nodes have not yet been read at least once, cannot construct node object
Apr 15 20:24:07 default-template-control-plane-8klqz kubelet[2053]: E0415 20:24:07.022999    2053 certificate_manager.go:437] Failed while requesting a signed certificate from the master: cannot create certificate signing request: Post "https://default-template-1ea86e38.westus2.cloudapp.azure.com:6443/apis/certificates.k8s.io/v1/certificatesigningrequests": dial tcp 20.64.194.163:6443: i/o timeout
Apr 15 20:24:07 default-template-control-plane-8klqz kubelet[2053]: E0415 20:24:07.441764    2053 kubelet.go:2188] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

@CecileRobertMichon
Copy link
Contributor

What k8s version is this with? Any interesting cloud-controller-manager logs?

Can you also verify that /etc/azure.json has all the expected info, including credentials?

@jackfrancis
Copy link
Contributor Author

1.20.5:

$ sudo crictl ps | grep controller-manager
cb5abc93967d4       6f0c3da8c99e9       40 minutes ago      Running             kube-controller-manager   1                   12b39ce781218
$ sudo crictl logs cb5abc93967d4 | head -n 5
Flag --port has been deprecated, see --secure-port instead.
I0415 20:28:02.655940       1 serving.go:331] Generated self-signed cert in-memory
I0415 20:28:03.440305       1 controllermanager.go:176] Version: v1.20.5
I0415 20:28:03.445226       1 secure_serving.go:197] Serving securely on 127.0.0.1:10257
I0415 20:28:03.445281       1 leaderelection.go:243] attempting to acquire leader lease kube-system/kube-controller-manager...
I0415 20:28:03.445696       1 dynamic_cafile_content.go:167] Starting request-header::/etc/kubernetes/pki/front-proxy-ca.crt
I0415 20:28:03.445740       1 tlsconfig.go:240] Starting DynamicServingCertificateController
I0415 20:28:03.445947       1 dynamic_cafile_content.go:167] Starting client-ca-bundle::/etc/kubernetes/pki/ca.crt
E0415 20:28:13.445713       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: Get "https://10.0.0.4:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=10s": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
E0415 20:28:27.439394       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: Get "https://10.0.0.4:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0415 20:28:37.678983       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: etcdserver: request timed out
E0415 20:28:51.596190       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: Get "https://10.0.0.4:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0415 20:29:05.245648       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: context deadline exceeded
E0415 20:29:09.205614       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: Get "https://10.0.0.4:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=10s": dial tcp 10.0.0.4:6443: connect: connection refused
E0415 20:29:13.262802       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: Get "https://10.0.0.4:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=10s": dial tcp 10.0.0.4:6443: connect: connection refused
E0415 20:29:15.594618       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: Get "https://10.0.0.4:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=10s": dial tcp 10.0.0.4:6443: connect: connection refused
E0415 20:29:17.933177       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: Get "https://10.0.0.4:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=10s": dial tcp 10.0.0.4:6443: connect: connection refused
E0415 20:29:22.066042       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: Get "https://10.0.0.4:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=10s": dial tcp 10.0.0.4:6443: connect: connection refused
E0415 20:29:25.915466       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: Get "https://10.0.0.4:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=10s": dial tcp 10.0.0.4:6443: connect: connection refused
E0415 20:29:28.534662       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: Get "https://10.0.0.4:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=10s": dial tcp 10.0.0.4:6443: connect: connection refused
E0415 20:29:32.610662       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: Get "https://10.0.0.4:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=10s": dial tcp 10.0.0.4:6443: connect: connection refused
E0415 20:29:45.879789       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: Get "https://10.0.0.4:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=10s": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
E0415 20:29:58.664273       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: Get "https://10.0.0.4:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=10s": context deadline exceeded
$ sudo cat /etc/kubernetes/azure.json 
{
    "cloud": "AzurePublicCloud",
    "tenantId": "<redacted>",
    "subscriptionId": "<redacted>",
    "aadClientId": "<redacted>",
    "aadClientSecret": "<redacted>",
    "resourceGroup": "default-template",
    "securityGroupName": "default-template-node-nsg",
    "securityGroupResourceGroup": "default-template",
    "location": "westus2",
    "vmType": "vmss",
    "vnetName": "default-template-vnet",
    "vnetResourceGroup": "default-template",
    "subnetName": "default-template-node-subnet",
    "routeTableName": "default-template-node-routetable",
    "loadBalancerSku": "Standard",
    "maximumLoadBalancerRuleCount": 250,
    "useManagedIdentityExtension": false,
    "useInstanceMetadata": true
}

@jackfrancis
Copy link
Contributor Author

There is no running cloud-controller-manager container.

@jackfrancis
Copy link
Contributor Author

O.K., I now see that only 2 of 3 control plane VMs came online. Will look at capi logs...

@jackfrancis
Copy link
Contributor Author

It looks like the 2nd (of 3) control plane machines is stuck in Provisioned, which is why a 3rd one has never been created:

default     default-template-control-plane-2cnlm    azure:///subscriptions/aa3d3369-e814-4495-899d-d31e8d7d09ce/resourceGroups/default-template/providers/Microsoft.Compute/virtualMachines/default-template-control-plane-8klqz   Running       v1.20.5
default     default-template-control-plane-wcjv4    azure:///subscriptions/aa3d3369-e814-4495-899d-d31e8d7d09ce/resourceGroups/default-template/providers/Microsoft.Compute/virtualMachines/default-template-control-plane-s7q7n   Provisioned   v1.20.5

@jackfrancis
Copy link
Contributor Author

Here it is:

$ k describe azuremachine default-template-control-plane-s7q7n
Name:         default-template-control-plane-s7q7n
Namespace:    default
Labels:       cluster.x-k8s.io/cluster-name=default-template
              cluster.x-k8s.io/control-plane=
Annotations:  cluster-api-provider-azure: true
              cluster.x-k8s.io/cloned-from-groupkind: AzureMachineTemplate.infrastructure.cluster.x-k8s.io
              cluster.x-k8s.io/cloned-from-name: default-template-control-plane
              sigs.k8s.io/cluster-api-provider-azure-last-applied-tags-vm: {"kubernetes.io_cluster_default-template":"owned"}
API Version:  infrastructure.cluster.x-k8s.io/v1alpha4
Kind:         AzureMachine
Metadata:
  Creation Timestamp:  2021-04-15T20:26:27Z
  Finalizers:
    azuremachine.infrastructure.cluster.x-k8s.io
  Generation:  2
  Managed Fields:
    API Version:  infrastructure.cluster.x-k8s.io/v1alpha4
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:cluster-api-provider-azure:
          f:cluster.x-k8s.io/cloned-from-groupkind:
          f:cluster.x-k8s.io/cloned-from-name:
          f:sigs.k8s.io/cluster-api-provider-azure-last-applied-tags-vm:
        f:finalizers:
        f:labels:
          .:
          f:cluster.x-k8s.io/cluster-name:
          f:cluster.x-k8s.io/control-plane:
        f:ownerReferences:
      f:spec:
        .:
        f:dataDisks:
        f:identity:
        f:osDisk:
          .:
          f:diskSizeGB:
          f:managedDisk:
            .:
            f:storageAccountType:
          f:osType:
        f:providerID:
        f:sshPublicKey:
        f:vmSize:
      f:status:
        .:
        f:addresses:
        f:conditions:
        f:failureMessage:
        f:failureReason:
        f:ready:
        f:vmState:
    Manager:    manager
    Operation:  Update
    Time:       2021-04-15T20:47:47Z
  Owner References:
    API Version:           controlplane.cluster.x-k8s.io/v1alpha4
    Kind:                  KubeadmControlPlane
    Name:                  default-template-control-plane
    UID:                   036e1e01-8c12-4b03-9546-e69664091622
    API Version:           cluster.x-k8s.io/v1alpha4
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Machine
    Name:                  default-template-control-plane-wcjv4
    UID:                   c62aaac8-2f11-4cf0-ba8b-414442e317ae
  Resource Version:        97293
  Self Link:               /apis/infrastructure.cluster.x-k8s.io/v1alpha4/namespaces/default/azuremachines/default-template-control-plane-s7q7n
  UID:                     801147de-6e45-44fb-a617-8630e7209737
Spec:
  Data Disks:
    Caching Type:  ReadWrite
    Disk Size GB:  256
    Lun:           0
    Name Suffix:   etcddisk
  Identity:        None
  Os Disk:
    Caching Type:  None
    Disk Size GB:  128
    Managed Disk:
      Storage Account Type:  Premium_LRS
    Os Type:                 Linux
  Provider ID:               azure:///subscriptions/aa3d3369-e814-4495-899d-d31e8d7d09ce/resourceGroups/default-template/providers/Microsoft.Compute/virtualMachines/default-template-control-plane-s7q7n
  Ssh Public Key:            c3NoLXJzYSBBQUFBQjNOemFDMXljMkVBQUFBREFRQUJBQUFDQVFETVdHalZWUE9PWUxTV3JZbmk3REVwM21Ld1hobjEyUXNLWkJ2WEZHenlDUG5IdmV1TVBkSUcrSHJjV3dZc1ppKzZ4U0JFNHZQbVFwaWJaNks0OGh3MExXNzhrd21sMWtINmVPYW5mL2R1ZWdGS0VqM3hydUdSMWdLTFRQNE1kYTN4Y1NuVnNkMFNQckhaWC92Sjllekw2TFd1UFJQTzJQUzhrUk5ZQWJQOVVlVlZIVkZsYi9KdExHWDROMlZqbUpNN2VjZmZtWFZkWWs2SmR0S2pTMWxlTWxjL09STWNha2tSZWJIK1lRdjBIN010UUdCSkxxT09iNVljK3VSaFA0c1hrQTBSMXpJQ0kvOWc4clYwLzJQNkRZRnpvU2FRSXpXNTRhMXRvY0dkK0J5dGEzUXRCRVd4N2ltblZJeEVhSXlkQ3gxeDRpYzQwd0h2TEthMUpTalVSVmhPS3BIZGpWeDduRWpDV0xuYU9MejUwRTMvTXV3VG1nQnFGeWFDN1RNU3d4WSt2VTM0ZEtDUXp6eklIQkIrY3IvQngwR0ZaU0swR0hVcml3YkkzZWswcjQxS2dJUHZydDEwOVdvc1pBMHJlQW9QL3A3b1hNRFY1QmUxVkJ4UWIwakFkYzhFa2JBMFFmbWo1b0pCN3lWSkRZdnhJUnRlaEw0WHZJblorblN5TjRyMkV0eUN3S3JlcHpvZENLK042RlM2bmFHOS9mR212bTRmTUk0b1RkNDRYNHZqTHpxMDhWTFpQcDJyWjFvcTFSZ1hhQkdBWUFYUFRXT0Z1eGU1dldZWStITXNudnRyczNnN3QyNzk5dnZkK3o4QnMwNEc5bU1BRS9QN1lweEhHZWJEMGRDTjIwOFhJaFRiTTk0Z0V2SVBQaTNVOEFLTUJTZTNhMVErMHc9PSBqYWNrQGRlaXMuY29tCg==
  Vm Size:                   Standard_D2s_v3
Status:
  Addresses:
    Address:  10.0.0.5
    Type:     InternalIP
  Conditions:
    Last Transition Time:  2021-04-15T20:47:47Z
    Status:                True
    Type:                  Ready
    Last Transition Time:  2021-04-15T20:47:47Z
    Reason:                BootstrapFailed
    Severity:              Error
    Status:                False
    Type:                  BoostrapSucceeded
    Last Transition Time:  2021-04-15T20:47:47Z
    Status:                True
    Type:                  VMRunning
  Failure Message:         unable to create vm extension: reconcile error occurred that cannot be recovered. Object will not be requeued. The actual error is: extension state failed
  Failure Reason:          CreateError
  Ready:                   false
  Vm State:                Failed
Events:                    <none>

@jackfrancis
Copy link
Contributor Author

$ sudo cat /var/log/cloud-init-output.log | grep error
[2021-04-15 21:00:47] error execution phase control-plane-join/etcd: error creating local etcd static pod manifest file: timeout waiting for etcd cluster to be available
[2021-04-15 21:00:47] To see the stack trace of this error execute with --v=5 or higher

@jackfrancis
Copy link
Contributor Author

I can reproduce this pretty easily w/ this script:

https://github.com/jackfrancis/cluster-api-provider-azure/blob/repro/repro.sh

I cannot reproduce using the above script if I revert this change:

#1029

In other words, if I add back the useExperimentalRetryJoin: true and then re-make generate-flavors, I can consistently build 3 control plane node "default" clusters (templates/cluster-template.yaml).

What's the current CI test coverage for > 1 control plane node clusters since useExperimentalRetryJoin was disabled? Is it possible we don't have regular test coverage for 3|5 control plane node scenarios?

@CecileRobertMichon
Copy link
Contributor

What's the current CI test coverage for > 1 control plane node clusters since useExperimentalRetryJoin was disabled? Is it possible we don't have regular test coverage for 3|5 control plane node scenarios?

The With 3 control-plane nodes and 2 worker nodes spec runs on every PR and every periodic job.

Looking at job history https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-azure#capz-periodic-capi-e2e-full-main, that spec failed 2 out of the last 32 runs. One of these was an unrelated temporary issue that affected all specs in that run, the other one was a timeout waiting for the second worker node (all 3 control planes came up successfully).

I'm not sure what's different between the above test that seems stable and your repro script. Do you have an estimate of what "pretty easily" is in terms of failure rate? One difference I see is the k8s version in CI is 1.19.7 and you're testing with 1.20.5.

@jackfrancis
Copy link
Contributor Author

Do you have an estimate of what "pretty easily" is in terms of failure rate

greater than 50% in my tests

Is it possible that the mgmt cluster is responsible? I'm using the config that make tilt-up puts into a kind cluster running capi/capz controllers.

@CecileRobertMichon
Copy link
Contributor

Re: useExperimentalRetryJoin, what you're seeing indicates that somehow the retries kubeadm added in 1.19+ are not resilient enough in this particular scenario. See this slack thread for context. As @neolit123 suggested in the thread, we should file issues in kubeadm when we encounter such flakes. Can you please file an issue for kubeadm with the evidence above?

@jackfrancis
Copy link
Contributor Author

Are we currently testing 1.20 for capz?

@jackfrancis
Copy link
Contributor Author

:/ just repro'd w/ retry enabled

Events:
  Type     Reason                 Age                     From                              Message
  ----     ------                 ----                    ----                              -------
  Warning  ControlPlaneUnhealthy  9m10s (x15 over 11m)    kubeadm-control-plane-controller  Waiting for control plane to pass preflight checks to continue reconciliation: [machine acse-test-capz-repro-6c840-control-plane-9rq5h does not have APIServerPodHealthy condition, machine acse-test-capz-repro-6c840-control-plane-9rq5h does not have ControllerManagerPodHealthy condition, machine acse-test-capz-repro-6c840-control-plane-9rq5h does not have SchedulerPodHealthy condition, machine acse-test-capz-repro-6c840-control-plane-9rq5h does not have EtcdPodHealthy condition, machine acse-test-capz-repro-6c840-control-plane-9rq5h does not have EtcdMemberHealthy condition]
  Warning  ControlPlaneUnhealthy  9m6s (x2 over 9m8s)     kubeadm-control-plane-controller  Waiting for control plane to pass preflight checks to continue reconciliation: [machine acse-test-capz-repro-6c840-control-plane-9rq5h reports APIServerPodHealthy condition is false (Info, Waiting for startup or readiness probes), machine acse-test-capz-repro-6c840-control-plane-9rq5h reports ControllerManagerPodHealthy condition is false (Info, Waiting for startup or readiness probes), machine acse-test-capz-repro-6c840-control-plane-9rq5h reports SchedulerPodHealthy condition is false (Info, Waiting for startup or readiness probes), machine acse-test-capz-repro-6c840-control-plane-9rq5h reports EtcdPodHealthy condition is false (Info, Waiting for startup or readiness probes)]
  Warning  ControlPlaneUnhealthy  8m19s                   kubeadm-control-plane-controller  Waiting for control plane to pass preflight checks to continue reconciliation: [machine acse-test-capz-repro-6c840-control-plane-9rq5h reports ControllerManagerPodHealthy condition is false (Info, Waiting for startup or readiness probes), machine acse-test-capz-repro-6c840-control-plane-9rq5h reports SchedulerPodHealthy condition is false (Info, Waiting for startup or readiness probes), machine acse-test-capz-repro-6c840-control-plane-9rq5h reports EtcdPodHealthy condition is false (Info, Waiting for startup or readiness probes), machine acse-test-capz-repro-6c840-control-plane-9rq5h reports EtcdMemberHealthy condition is unknown (Failed to connect to the etcd pod on the acse-test-capz-repro-6c840-control-plane-j5qsz node)]
  Warning  ControlPlaneUnhealthy  8m18s (x7 over 9m9s)    kubeadm-control-plane-controller  Waiting for control plane to pass preflight checks to continue reconciliation: [machine acse-test-capz-repro-6c840-control-plane-9rq5h reports ControllerManagerPodHealthy condition is false (Info, Waiting for startup or readiness probes), machine acse-test-capz-repro-6c840-control-plane-9rq5h reports SchedulerPodHealthy condition is false (Info, Waiting for startup or readiness probes), machine acse-test-capz-repro-6c840-control-plane-9rq5h reports EtcdPodHealthy condition is false (Info, Waiting for startup or readiness probes)]
  Warning  ControlPlaneUnhealthy  6m41s (x12 over 7m45s)  kubeadm-control-plane-controller  Waiting for control plane to pass preflight checks to continue reconciliation: [machine acse-test-capz-repro-6c840-control-plane-2rfrq does not have APIServerPodHealthy condition, machine acse-test-capz-repro-6c840-control-plane-2rfrq does not have ControllerManagerPodHealthy condition, machine acse-test-capz-repro-6c840-control-plane-2rfrq does not have SchedulerPodHealthy condition, machine acse-test-capz-repro-6c840-control-plane-2rfrq does not have EtcdPodHealthy condition, machine acse-test-capz-repro-6c840-control-plane-2rfrq does not have EtcdMemberHealthy condition]

@CecileRobertMichon
Copy link
Contributor

These events happen while the new control planes are coming up, they do not necessarily indicate a failure. Do you have cloud-init logs that show a kubeadm join failure/error?

@jackfrancis
Copy link
Contributor Author

I can't repro this when building a 1.19.7 cluster using the templates/cluster-template.yaml spec (useExperimentalRetryJoin is disabled).

This strongly suggests the culprit is 1.20.5. I'll repro using that version and gather data for kubeadm.

@jackfrancis
Copy link
Contributor Author

Moved here:

kubernetes/kubeadm#2450

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants