This repository has been archived by the owner on Oct 10, 2023. It is now read-only.
Increase the value of CABPK --bootstrap-token-ttl to account for etcd replication time in vSphere clusters #954
Labels
area/lcm
Related to Cluster Lifecycle management
kind/bug
PR/Issue related to a bug
priority/critical-urgent
Bug description
Affected product area (please put an X in all that apply)
Expected behavior
See kubernetes-sigs/cluster-api#5477
Steps to reproduce the bug
Deploy a 3-node CP cluster using CAPV with kube-vip using PhotonOS (for some reason, it's more likely to occur here) and then set kcp.spec.upgradeAfter to trigger a rollout.
Due to a mixture of kube-vip/kube-vip#214 and the fact we haven't yet implemented etcd learner mode in kubeadm or have full support in kubernetes, etcd leader switches around many times, with kube-vip leader election also rotating. During this time, CAPI controllers are unable to fully reconcile, and neither can kubelet register nodes. Importantly, CABPK is also unable to renew the bootstrap token. Eventually, etcd replication completes but after the 15 minute bootstrap token timeout. kubelet node registration ultimately fails and we end up with an orphaned control plane machine which is a valid member of the etcd cluster, complicating cleanup.
Version (include the SHA if the version is not obvious)
TKG v1.4.0
Environment where the bug was observed (cloud, OS, etc)
vSphere 7 U2
Photon v1.21 OVAs
Relevant Debug Output (Logs, manifests, etc)
As @fabriziopandini notes, the
--bootstrap-token-ttl
is configurable. I think it makes sense to bump that TTL given vSphere is one of the most common deployments for Tanzu.The text was updated successfully, but these errors were encountered: