KCP upgrade is failing because of wrong member in etcd #5509

MaxRink · 2021-10-27T11:50:38Z

What steps did you take and what happened:
After starting an upgrade of KCP etcd on the new nodes never got up and running, thus API-Server was faling to come up and the upgrade never succeeded.
After debugging we found that there was a wong peer in ETCd

root@ceco-1-678vn:/home/a92615428# crictl exec b558198d7b88e etcdctl member list --key /etc/kubernetes/pki/etcd/peer.key --cert /etc/kubernetes/pki/etcd/peer.crt --cacert /etc/kubernetes/pki/etcd/ca.crt
11f9c022fe4cf41f, started, ceco-1-tnz97.reftmdc.bn.schiff.telekom.de, https://172.22.132.236:2380, https://172.22.132.236:2379, false
bb9f0725b3f81fa1, started, ceco-1-678vn.reftmdc.bn.schiff.telekom.de, https://172.22.132.232:2380, https://172.22.132.232:2379, false
d35430096176221f, started, ceco-1-6zjg4.reftmdc.bn.schiff.telekom.de, https://172.22.132.229:2380, https://172.22.132.229:2379, false
e38f7103cfd379b6, unstarted, , https://172.22.132.233:2380, , false

After removing that everything started working.
The question is, how has that made it into etcd, given that that cluster has only be touched by CAPI
What did you expect to happen:
Upgrade succeeds without manual intervention
Anything else you would like to add:
https://kubernetes.slack.com/archives/C8TSNPY4T/p1635330143156400

Environment:

Cluster-api version: latest alpha4 release
Minikube/KIND version:
Kubernetes version: (use kubectl version): 1.20.9 -> 1.20.11
OS (e.g. from /etc/os-release):

/kind bug

The text was updated successfully, but these errors were encountered:

randomvariable · 2021-10-27T16:45:50Z

Did you have a machine healthcheck on this cluster, and are you using kube-vip or something else for the LB?

MaxRink · 2021-10-27T20:03:40Z

Yes, its kube-vip and yes, it has MHCs for both workers and controlplanes

fabriziopandini · 2021-10-28T08:28:03Z

/priority awaiting-more-evidence

@MaxRink could you provide KCP logs and a dump of KCP + control plane machines.

MaxRink · 2021-10-28T08:51:20Z

Unfortunately not, i needed to fix the cluster as it was used.
I added a few excerpts in Slack tho, i can copy them over to here.

MaxRink · 2021-10-28T08:51:41Z

https://gist.github.com/MaxRink/8d95f60fc518c6cb053aec37f4fe74ed
KCP status:

Status:
  Conditions:
    Last Transition Time:  2021-10-27T10:42:34Z
    Message:               Rolling 3 replicas with outdated spec (1 replicas up to date)
    Reason:                RollingUpdateInProgress
    Severity:              Warning
    Status:                False
    Type:                  Ready
    Last Transition Time:  2021-05-31T19:47:31Z
    Status:                True
    Type:                  Available
    Last Transition Time:  2021-05-31T19:12:44Z
    Status:                True
    Type:                  CertificatesAvailable
    Last Transition Time:  2021-10-27T10:43:54Z
    Message:               Following machines are reporting control plane errors: ceco-1-j98cq
    Reason:                ControlPlaneComponentsUnhealthy
    Severity:              Error
    Status:                False
    Type:                  ControlPlaneComponentsHealthy
    Last Transition Time:  2021-10-27T10:43:54Z
    Message:               etcd member 16397449029464848822 (Name not yet assigned) does not have a corresponding machine; Following machines are reporting etcd member errors: ceco-1-j98cq
    Reason:                EtcdClusterUnhealthy
    Severity:              Error
    Status:                False
    Type:                  EtcdClusterHealthyCondition
    Last Transition Time:  2021-09-29T16:47:57Z
    Status:                True
    Type:                  MachinesCreated
    Last Transition Time:  2021-10-27T10:43:50Z
    Status:                True
    Type:                  MachinesReady
    Last Transition Time:  2021-10-27T10:42:34Z
    Message:               Rolling 3 replicas with outdated spec (1 replicas up to date)
    Reason:                RollingUpdateInProgress
    Severity:              Warning
    Status:                False
    Type:                  MachinesSpecUpToDate
    Last Transition Time:  2021-10-27T10:42:32Z
    Message:               Scaling down control plane to 3 replicas (actual 4)
    Reason:                ScalingDown
    Severity:              Warning
    Status:                False
    Type:                  Resized
  Initialized:             true
  Observed Generation:     11091
  Ready:                   true
  Ready Replicas:          3
  Replicas:                4
  Selector:                cluster.x-k8s.io/cluster-name=ceco-1,cluster.x-k8s.io/control-plane
  Unavailable Replicas:    1
  Updated Replicas:        1
  Version:                 v1.20.9

randomvariable · 2021-10-28T09:25:05Z

@MaxRink this kind of smells of #5477

@gab-satchi and @srm09 have been testing a workaround in vmware-tanzu/tanzu-framework#954 . I don't know if that resolved it.

randomvariable · 2021-11-02T14:14:27Z

/area control-plane

fabriziopandini · 2022-01-26T15:18:55Z

/milestone v1.2

k8s-triage-robot · 2022-04-26T16:18:49Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-05-26T16:54:28Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-06-25T17:15:21Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-06-25T17:15:31Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 27, 2021

k8s-ci-robot added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Oct 28, 2021

k8s-ci-robot added the area/control-plane Issues or PRs related to control-plane lifecycle management label Nov 2, 2021

k8s-ci-robot added this to the v1.2 milestone Jan 26, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 26, 2022

k8s-ci-robot closed this as completed Jun 25, 2022

andyzheung mentioned this issue Nov 25, 2022

About CAPV related to CAPI kubernetes-sigs/cluster-api-provider-vsphere#1700

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KCP upgrade is failing because of wrong member in etcd #5509

KCP upgrade is failing because of wrong member in etcd #5509

MaxRink commented Oct 27, 2021

randomvariable commented Oct 27, 2021

MaxRink commented Oct 27, 2021

fabriziopandini commented Oct 28, 2021

MaxRink commented Oct 28, 2021

MaxRink commented Oct 28, 2021 •

edited

Loading

randomvariable commented Oct 28, 2021

randomvariable commented Nov 2, 2021

fabriziopandini commented Jan 26, 2022

k8s-triage-robot commented Apr 26, 2022

k8s-triage-robot commented May 26, 2022

k8s-triage-robot commented Jun 25, 2022

k8s-ci-robot commented Jun 25, 2022

KCP upgrade is failing because of wrong member in etcd #5509

KCP upgrade is failing because of wrong member in etcd #5509

Comments

MaxRink commented Oct 27, 2021

randomvariable commented Oct 27, 2021

MaxRink commented Oct 27, 2021

fabriziopandini commented Oct 28, 2021

MaxRink commented Oct 28, 2021

MaxRink commented Oct 28, 2021 • edited Loading

randomvariable commented Oct 28, 2021

randomvariable commented Nov 2, 2021

fabriziopandini commented Jan 26, 2022

k8s-triage-robot commented Apr 26, 2022

k8s-triage-robot commented May 26, 2022

k8s-triage-robot commented Jun 25, 2022

k8s-ci-robot commented Jun 25, 2022

MaxRink commented Oct 28, 2021 •

edited

Loading