Skip to content
This repository has been archived by the owner on Oct 10, 2023. It is now read-only.

Increase the value of CABPK --bootstrap-token-ttl to account for etcd replication time in vSphere clusters #954

Open
1 of 10 tasks
randomvariable opened this issue Oct 22, 2021 · 0 comments
Assignees
Labels
area/lcm Related to Cluster Lifecycle management kind/bug PR/Issue related to a bug priority/critical-urgent

Comments

@randomvariable
Copy link
Contributor

Bug description

Affected product area (please put an X in all that apply)

  • APIs
  • Addons
  • CLI
  • Docs
  • IAM
  • Installation
  • Plugin
  • Security
  • Test and Release
  • User Experience

Expected behavior
See kubernetes-sigs/cluster-api#5477

Steps to reproduce the bug
Deploy a 3-node CP cluster using CAPV with kube-vip using PhotonOS (for some reason, it's more likely to occur here) and then set kcp.spec.upgradeAfter to trigger a rollout.

Due to a mixture of kube-vip/kube-vip#214 and the fact we haven't yet implemented etcd learner mode in kubeadm or have full support in kubernetes, etcd leader switches around many times, with kube-vip leader election also rotating. During this time, CAPI controllers are unable to fully reconcile, and neither can kubelet register nodes. Importantly, CABPK is also unable to renew the bootstrap token. Eventually, etcd replication completes but after the 15 minute bootstrap token timeout. kubelet node registration ultimately fails and we end up with an orphaned control plane machine which is a valid member of the etcd cluster, complicating cleanup.

Version (include the SHA if the version is not obvious)

TKG v1.4.0

Environment where the bug was observed (cloud, OS, etc)
vSphere 7 U2
Photon v1.21 OVAs

Relevant Debug Output (Logs, manifests, etc)

As @fabriziopandini notes, the --bootstrap-token-ttl is configurable. I think it makes sense to bump that TTL given vSphere is one of the most common deployments for Tanzu.

@randomvariable randomvariable added kind/bug PR/Issue related to a bug needs-triage Indicates an issue or PR needs to be triaged labels Oct 22, 2021
@randomvariable randomvariable changed the title Increase the value of --bootstrap-token-ttl to account for etcd replication time in vSphere clusters Increase the value of CABPK --bootstrap-token-ttl to account for etcd replication time in vSphere clusters Oct 22, 2021
@randomvariable randomvariable self-assigned this Oct 22, 2021
@yharish991 yharish991 added area/lcm Related to Cluster Lifecycle management and removed needs-triage Indicates an issue or PR needs to be triaged labels Oct 26, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/lcm Related to Cluster Lifecycle management kind/bug PR/Issue related to a bug priority/critical-urgent
Projects
None yet
2 participants