Increase the value of CABPK --bootstrap-token-ttl to account for etcd replication time in vSphere clusters #954

randomvariable · 2021-10-22T13:23:44Z

Bug description

Affected product area (please put an X in all that apply)

Expected behavior
See kubernetes-sigs/cluster-api#5477

Steps to reproduce the bug
Deploy a 3-node CP cluster using CAPV with kube-vip using PhotonOS (for some reason, it's more likely to occur here) and then set kcp.spec.upgradeAfter to trigger a rollout.

Due to a mixture of kube-vip/kube-vip#214 and the fact we haven't yet implemented etcd learner mode in kubeadm or have full support in kubernetes, etcd leader switches around many times, with kube-vip leader election also rotating. During this time, CAPI controllers are unable to fully reconcile, and neither can kubelet register nodes. Importantly, CABPK is also unable to renew the bootstrap token. Eventually, etcd replication completes but after the 15 minute bootstrap token timeout. kubelet node registration ultimately fails and we end up with an orphaned control plane machine which is a valid member of the etcd cluster, complicating cleanup.

Version (include the SHA if the version is not obvious)

TKG v1.4.0

Environment where the bug was observed (cloud, OS, etc)
vSphere 7 U2
Photon v1.21 OVAs

Relevant Debug Output (Logs, manifests, etc)

As @fabriziopandini notes, the --bootstrap-token-ttl is configurable. I think it makes sense to bump that TTL given vSphere is one of the most common deployments for Tanzu.

The text was updated successfully, but these errors were encountered:

randomvariable added kind/bug PR/Issue related to a bug needs-triage Indicates an issue or PR needs to be triaged labels Oct 22, 2021

randomvariable mentioned this issue Oct 22, 2021

During KCP rollout, etcd churn can prevent renewal of the bootstrap token causing KCP node registration to fail kubernetes-sigs/cluster-api#5477

Closed

randomvariable changed the title ~~Increase the value of --bootstrap-token-ttl to account for etcd replication time in vSphere clusters~~ Increase the value of CABPK --bootstrap-token-ttl to account for etcd replication time in vSphere clusters Oct 22, 2021

randomvariable mentioned this issue Oct 22, 2021

Extend CABPK timeout to 30m to account for etcd cluster settling in vSphere environments. #955

Closed

3 tasks

randomvariable self-assigned this Oct 22, 2021

randomvariable added the priority/critical-urgent label Oct 22, 2021

yharish991 added area/lcm Related to Cluster Lifecycle management and removed needs-triage Indicates an issue or PR needs to be triaged labels Oct 26, 2021

randomvariable mentioned this issue Oct 28, 2021

KCP upgrade is failing because of wrong member in etcd kubernetes-sigs/cluster-api#5509

Closed

andyzheung mentioned this issue Nov 25, 2022

About CAPV related to CAPI kubernetes-sigs/cluster-api-provider-vsphere#1700

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase the value of CABPK --bootstrap-token-ttl to account for etcd replication time in vSphere clusters #954

Increase the value of CABPK --bootstrap-token-ttl to account for etcd replication time in vSphere clusters #954

randomvariable commented Oct 22, 2021

Increase the value of CABPK --bootstrap-token-ttl to account for etcd replication time in vSphere clusters #954

Increase the value of CABPK --bootstrap-token-ttl to account for etcd replication time in vSphere clusters #954

Comments

randomvariable commented Oct 22, 2021