During KCP rollout, etcd churn can prevent renewal of the bootstrap token causing KCP node registration to fail #5477

randomvariable · 2021-10-22T12:56:30Z

/kind bug

Follow up from #2770

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

Deploy a 3-node CP cluster using CAPV with kube-vip using PhotonOS (for some reason, it's more likely to occur here) and then set kcp.spec.upgradeAfter to trigger a rollout.

Due to a mixture of kube-vip/kube-vip#214 and the fact we haven't yet implemented etcd learner mode in kubeadm or have full support in kubernetes, etcd leader switches around many times, with kube-vip leader election also rotating. During this time, CAPI controllers are unable to fully reconcile, and neither can kubelet register nodes. Importantly, CABPK is also unable to renew the bootstrap token. Eventually, etcd replication completes but after the 15 minute bootstrap token timeout. kubelet node registration ultimately fails and we end up with an orphaned control plane machine which is a valid member of the etcd cluster, complicating cleanup.

What did you expect to happen:

Rollout succeeds.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

It might be worth in these instances to have a configurable bootstrap token TTL. We can't account for every type of environment with a longer TTL so we should make it configurable and vendors and cluster operators can decide, since there is a security trade off.

Furthermore, VMware would like to request a backport to v0.3.x branch.

Environment:

Cluster-api- version: v0.3.22
Kubernetes version: (use kubectl version): v1.21
OS (e.g. from /etc/os-release): Photon 3

The text was updated successfully, but these errors were encountered:

sbueringer · 2021-10-22T13:00:09Z

+1 for a configurable timeout

neolit123 · 2021-10-22T13:02:49Z

It seems to me too many things are happening at once? Would it make sense for upgradeAfter to trigger a more step by step operation with bs token renewal being the first task, then the rest...etc.

…

On Oct 22, 2021 15:56, "Naadir Jeewa" ***@***.***> wrote: /kind bug Follow up from #2770 <#2770> *What steps did you take and what happened:* [A clear and concise description of what the bug is.] Deploy a 3-node CP cluster using CAPV with kube-vip using PhotonOS (for some reason, it's more likely to occur here) and then set kcp.spec.upgradeAfter to trigger a rollout. Due to a mixture of kube-vip/kube-vip#214 <kube-vip/kube-vip#214> and the fact we haven't yet implemented etcd learner mode in kubeadm or have full support in kubernetes, etcd leader switches around many times, with kube-vip leader election also rotating. During this time, CAPI controllers are unable to fully reconcile, and neither can kubelet register nodes. Importantly, CABPK is also unable to renew the bootstrap token. Eventually, etcd replication but after the 15 minute bootstrap token timeout. kubelet node registration ultimately fails and we end up with an orphaned control plane machine which is a valid member of the etcd cluster, complicating cleanup. *What did you expect to happen:* Rollout succeeds. *Anything else you would like to add:* [Miscellaneous information that will assist in solving the issue.] It might be worth in these instances to have a configurable bootstrap token TTL. We can't account for every type of environment with a longer TTL so we should make it configurable and vendors and cluster operators can decide, since there is a security trade off. Furthermore, VMware would like to request a backport to v0.3.x branches. *Environment:* - Cluster-api- version: v0.3.22 - Kubernetes version: (use kubectl version): v1.21 - OS (e.g. from /etc/os-release): Photon 3 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#5477>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACRATE4PK7JOK76BV27YZTUIFNQVANCNFSM5GQP5HUA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

fabriziopandini · 2021-10-22T13:08:37Z

FYI CABPK controller manager has a --bootstrap-token-ttl flag

neolit123 · 2021-10-22T13:08:43Z

the fact we haven't yet implemented etcd learner mode in kubeadm or have full support in kubernetes, etcd leader switches around many times

Hm, would learner mode really solve this...i wonder if this is more of a problem about pinning the leader programmatically. This is something that k8s wishes to add in the future. There is a kep about it.

fabriziopandini · 2021-10-22T13:24:45Z

Also, AFAIK KCP move leadership only if the current leader is being deleted, so we are already very conservative in doing this operation and so I don't see a direct connection between current process and etcd leader switching around many times.

randomvariable · 2021-10-22T13:24:56Z

Would it make sense for upgradeAfter to trigger a more step by step operation with bs token renewal being the first task

We generate a token per machine, based on a Machine resource being created, so KCP isn't in the loop at that stage, just CABPK.

FYI CABPK controller manager has a --bootstrap-token-ttl flag

I think this means we won't need the backport, and I've filed vmware-tanzu/tanzu-framework#954

Hm, would learner mode really solve this...i wonder if this is more of a problem about pinning the leader programmatically. This is something that k8s wishes to add in the future. There is a kep about it.

Any pointers on that would be handy.

randomvariable · 2021-10-22T13:25:45Z

Also, KCP move leadership only if the current leader is being deleted, so we are already very conservative in doing this operation and so I don't see a direct connection between current process and etcd leader switching around many times.

There's an interaction between the etcd leader moving and kube-vip which is using kubernetes resource locking, which makes things esp. problematic.

randomvariable · 2021-10-22T13:33:17Z

Interestingly, @voor reports similar behaviour on AWS when cross-zone load balancing enabled.

neolit123 · 2021-10-22T13:40:39Z

Any pointers on that would be handy.

Here is the kep / issue. Still WIP. Although this is for k8s LE:
kubernetes/enhancements#2835

vincepri · 2021-10-22T17:48:08Z

/priority awaiting-more-evidence
/assign @randomvariable
/milestone Next

randomvariable · 2021-10-27T16:44:28Z

Been speaking to @gab-satchi about looking into some of these thorny etcd thingies.
/assign @gab-satchi

randomvariable · 2021-11-02T14:17:33Z

/area control-plane

k8s-triage-robot · 2022-01-31T14:38:12Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

voor · 2022-01-31T14:38:55Z

/remove-lifecycle stale

vincepri · 2022-01-31T16:52:39Z

/lifecycle frozen

fabriziopandini · 2022-10-03T19:46:48Z

/triage accepted
/unassign @randomvariable
/unassign @gab-satchi

This should be re-assessed after we improved how token renewal is managed (tokens are now renewed until the machine joins); also in kubeadm we recently discussed to revive the work on etcd learner mode

fabriziopandini · 2022-10-03T19:47:02Z

/help

k8s-ci-robot · 2022-10-03T19:47:03Z

@fabriziopandini:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sbueringer · 2022-10-05T10:57:46Z

Quick update: kubeadm learner mode RFE:

chrischdi · 2022-10-05T17:31:31Z

Also related: had to increase the timeout here for self-hosted upgrade tests. From my analysis the reason here is also etcd churn and side-effects of it (e.g. leader-election failing which makes webhooks unavailable).

code-change:

cluster-api/test/framework/cluster_proxy.go

Lines 49 to 53 in 02f42f2

    
           // retryableOperationTimeout requires a higher value especially for self-hosted upgrades. 
        
           // Short unavailability of the Kube APIServer due to joining etcd members paired with unreachable conversion webhooks due to 
        
           // failed leader election and thus controller restarts lead to longer taking retries. 
        
           // The timeout occurs when listing machines in `GetControlPlaneMachinesByCluster`. 
        
           retryableOperationTimeout = 3 * time.Minute

xref PR discussion: ✨ adjust self-hosted e2e test to also upgrade the cluster #7239 (comment)

fabriziopandini · 2022-10-06T10:34:45Z

We should better sequence what's happening here.
KCP move leadership to another node only before deleting the current leader, so we are already minimizing the number of forced leader changes / doing what is possible for them to happen in a controlled way.
Everything else is raft periodic leader election that should be pretty stable, assuming the current leader answers timely.
🤔

sbueringer · 2023-01-12T14:30:15Z

Just fyi regarding etcd learner mode:

The alpha-level code is merged for v1.27 in kubernetes/kubernetes#113318
kubernetes/kubeadm#1793 (comment)

k8s-triage-robot · 2024-01-20T06:12:45Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

fabriziopandini · 2024-04-12T14:32:11Z

/priority important-longterm

fabriziopandini · 2024-04-23T12:45:11Z

We now have a flag in CABPK for setting the TTL + we did a lot of improvements in the last two years.
Also learner mode in kubeadm is making progress, even if not yet used by CAPI

/close

k8s-ci-robot · 2024-04-23T12:45:15Z

@fabriziopandini: Closing this issue.

In response to this:

We now have a flag in CABPK for setting the TTL + we did a lot of improvements in the last two years.
Also learner mode in kubeadm is making progress, even if not yet used by CAPI

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 22, 2021

randomvariable mentioned this issue Oct 22, 2021

Increase the value of CABPK --bootstrap-token-ttl to account for etcd replication time in vSphere clusters vmware-tanzu/tanzu-framework#954

Open

10 tasks

k8s-ci-robot assigned randomvariable Oct 22, 2021

k8s-ci-robot added this to the Next milestone Oct 22, 2021

k8s-ci-robot added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Oct 22, 2021

k8s-ci-robot assigned gab-satchi Oct 27, 2021

randomvariable mentioned this issue Oct 28, 2021

KCP upgrade is failing because of wrong member in etcd #5509

Closed

k8s-ci-robot added the area/control-plane Issues or PRs related to control-plane lifecycle management label Nov 2, 2021

vincepri removed the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Nov 2, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 31, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 31, 2022

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jan 31, 2022

fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022

fabriziopandini removed this from the Next milestone Jul 29, 2022

fabriziopandini removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022

k8s-ci-robot unassigned randomvariable and gab-satchi Oct 3, 2022

k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Oct 3, 2022

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Oct 3, 2022

chrischdi mentioned this issue Oct 7, 2022

Investigate self-hosted control-plane upgrades to exceed the 1 minute retryableOperationTimeout #7360

Closed

andyzheung mentioned this issue Nov 25, 2022

About CAPV related to CAPI kubernetes-sigs/cluster-api-provider-vsphere#1700

Closed

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 20, 2024

k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Apr 12, 2024

fabriziopandini removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Apr 16, 2024

k8s-ci-robot closed this as completed Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

During KCP rollout, etcd churn can prevent renewal of the bootstrap token causing KCP node registration to fail #5477

During KCP rollout, etcd churn can prevent renewal of the bootstrap token causing KCP node registration to fail #5477

randomvariable commented Oct 22, 2021 •

edited

Loading

sbueringer commented Oct 22, 2021 •

edited

Loading

neolit123 commented Oct 22, 2021 via email •

edited

Loading

fabriziopandini commented Oct 22, 2021

neolit123 commented Oct 22, 2021

fabriziopandini commented Oct 22, 2021 •

edited

Loading

randomvariable commented Oct 22, 2021

randomvariable commented Oct 22, 2021

randomvariable commented Oct 22, 2021

neolit123 commented Oct 22, 2021

vincepri commented Oct 22, 2021

randomvariable commented Oct 27, 2021

randomvariable commented Nov 2, 2021

k8s-triage-robot commented Jan 31, 2022

voor commented Jan 31, 2022

vincepri commented Jan 31, 2022

fabriziopandini commented Oct 3, 2022

fabriziopandini commented Oct 3, 2022

k8s-ci-robot commented Oct 3, 2022

sbueringer commented Oct 5, 2022

chrischdi commented Oct 5, 2022

fabriziopandini commented Oct 6, 2022

sbueringer commented Jan 12, 2023 •

edited

Loading

k8s-triage-robot commented Jan 20, 2024

fabriziopandini commented Apr 12, 2024

fabriziopandini commented Apr 23, 2024

k8s-ci-robot commented Apr 23, 2024

During KCP rollout, etcd churn can prevent renewal of the bootstrap token causing KCP node registration to fail #5477

During KCP rollout, etcd churn can prevent renewal of the bootstrap token causing KCP node registration to fail #5477

Comments

randomvariable commented Oct 22, 2021 • edited Loading

sbueringer commented Oct 22, 2021 • edited Loading

neolit123 commented Oct 22, 2021 via email • edited Loading

fabriziopandini commented Oct 22, 2021

neolit123 commented Oct 22, 2021

fabriziopandini commented Oct 22, 2021 • edited Loading

randomvariable commented Oct 22, 2021

randomvariable commented Oct 22, 2021

randomvariable commented Oct 22, 2021

neolit123 commented Oct 22, 2021

vincepri commented Oct 22, 2021

randomvariable commented Oct 27, 2021

randomvariable commented Nov 2, 2021

k8s-triage-robot commented Jan 31, 2022

voor commented Jan 31, 2022

vincepri commented Jan 31, 2022

fabriziopandini commented Oct 3, 2022

fabriziopandini commented Oct 3, 2022

k8s-ci-robot commented Oct 3, 2022

Guidelines

sbueringer commented Oct 5, 2022

chrischdi commented Oct 5, 2022

fabriziopandini commented Oct 6, 2022

sbueringer commented Jan 12, 2023 • edited Loading

k8s-triage-robot commented Jan 20, 2024

fabriziopandini commented Apr 12, 2024

fabriziopandini commented Apr 23, 2024

k8s-ci-robot commented Apr 23, 2024

randomvariable commented Oct 22, 2021 •

edited

Loading

sbueringer commented Oct 22, 2021 •

edited

Loading

neolit123 commented Oct 22, 2021 via email •

edited

Loading

fabriziopandini commented Oct 22, 2021 •

edited

Loading

sbueringer commented Jan 12, 2023 •

edited

Loading