Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KCP upgrade is failing because of wrong member in etcd #5509

Closed
MaxRink opened this issue Oct 27, 2021 · 12 comments
Closed

KCP upgrade is failing because of wrong member in etcd #5509

MaxRink opened this issue Oct 27, 2021 · 12 comments
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Milestone

Comments

@MaxRink
Copy link
Contributor

MaxRink commented Oct 27, 2021

What steps did you take and what happened:
After starting an upgrade of KCP etcd on the new nodes never got up and running, thus API-Server was faling to come up and the upgrade never succeeded.
After debugging we found that there was a wong peer in ETCd

root@ceco-1-678vn:/home/a92615428# crictl exec b558198d7b88e etcdctl member list --key /etc/kubernetes/pki/etcd/peer.key --cert /etc/kubernetes/pki/etcd/peer.crt --cacert /etc/kubernetes/pki/etcd/ca.crt
11f9c022fe4cf41f, started, ceco-1-tnz97.reftmdc.bn.schiff.telekom.de, https://172.22.132.236:2380, https://172.22.132.236:2379, false
bb9f0725b3f81fa1, started, ceco-1-678vn.reftmdc.bn.schiff.telekom.de, https://172.22.132.232:2380, https://172.22.132.232:2379, false
d35430096176221f, started, ceco-1-6zjg4.reftmdc.bn.schiff.telekom.de, https://172.22.132.229:2380, https://172.22.132.229:2379, false
e38f7103cfd379b6, unstarted, , https://172.22.132.233:2380, , false

After removing that everything started working.
The question is, how has that made it into etcd, given that that cluster has only be touched by CAPI
What did you expect to happen:
Upgrade succeeds without manual intervention
Anything else you would like to add:
https://kubernetes.slack.com/archives/C8TSNPY4T/p1635330143156400

Environment:

  • Cluster-api version: latest alpha4 release
  • Minikube/KIND version:
  • Kubernetes version: (use kubectl version): 1.20.9 -> 1.20.11
  • OS (e.g. from /etc/os-release):

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 27, 2021
@randomvariable
Copy link
Member

Did you have a machine healthcheck on this cluster, and are you using kube-vip or something else for the LB?

@MaxRink
Copy link
Contributor Author

MaxRink commented Oct 27, 2021

Yes, its kube-vip and yes, it has MHCs for both workers and controlplanes

@fabriziopandini
Copy link
Member

/priority awaiting-more-evidence

@MaxRink could you provide KCP logs and a dump of KCP + control plane machines.

@k8s-ci-robot k8s-ci-robot added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Oct 28, 2021
@MaxRink
Copy link
Contributor Author

MaxRink commented Oct 28, 2021

Unfortunately not, i needed to fix the cluster as it was used.
I added a few excerpts in Slack tho, i can copy them over to here.

@MaxRink
Copy link
Contributor Author

MaxRink commented Oct 28, 2021

https://gist.github.com/MaxRink/8d95f60fc518c6cb053aec37f4fe74ed
KCP status:

Status:
  Conditions:
    Last Transition Time:  2021-10-27T10:42:34Z
    Message:               Rolling 3 replicas with outdated spec (1 replicas up to date)
    Reason:                RollingUpdateInProgress
    Severity:              Warning
    Status:                False
    Type:                  Ready
    Last Transition Time:  2021-05-31T19:47:31Z
    Status:                True
    Type:                  Available
    Last Transition Time:  2021-05-31T19:12:44Z
    Status:                True
    Type:                  CertificatesAvailable
    Last Transition Time:  2021-10-27T10:43:54Z
    Message:               Following machines are reporting control plane errors: ceco-1-j98cq
    Reason:                ControlPlaneComponentsUnhealthy
    Severity:              Error
    Status:                False
    Type:                  ControlPlaneComponentsHealthy
    Last Transition Time:  2021-10-27T10:43:54Z
    Message:               etcd member 16397449029464848822 (Name not yet assigned) does not have a corresponding machine; Following machines are reporting etcd member errors: ceco-1-j98cq
    Reason:                EtcdClusterUnhealthy
    Severity:              Error
    Status:                False
    Type:                  EtcdClusterHealthyCondition
    Last Transition Time:  2021-09-29T16:47:57Z
    Status:                True
    Type:                  MachinesCreated
    Last Transition Time:  2021-10-27T10:43:50Z
    Status:                True
    Type:                  MachinesReady
    Last Transition Time:  2021-10-27T10:42:34Z
    Message:               Rolling 3 replicas with outdated spec (1 replicas up to date)
    Reason:                RollingUpdateInProgress
    Severity:              Warning
    Status:                False
    Type:                  MachinesSpecUpToDate
    Last Transition Time:  2021-10-27T10:42:32Z
    Message:               Scaling down control plane to 3 replicas (actual 4)
    Reason:                ScalingDown
    Severity:              Warning
    Status:                False
    Type:                  Resized
  Initialized:             true
  Observed Generation:     11091
  Ready:                   true
  Ready Replicas:          3
  Replicas:                4
  Selector:                cluster.x-k8s.io/cluster-name=ceco-1,cluster.x-k8s.io/control-plane
  Unavailable Replicas:    1
  Updated Replicas:        1
  Version:                 v1.20.9

@randomvariable
Copy link
Member

@MaxRink this kind of smells of #5477

@gab-satchi and @srm09 have been testing a workaround in vmware-tanzu/tanzu-framework#954 . I don't know if that resolved it.

@randomvariable
Copy link
Member

/area control-plane

@k8s-ci-robot k8s-ci-robot added the area/control-plane Issues or PRs related to control-plane lifecycle management label Nov 2, 2021
@fabriziopandini
Copy link
Member

/milestone v1.2

@k8s-ci-robot k8s-ci-robot added this to the v1.2 milestone Jan 26, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 26, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
None yet
Development

No branches or pull requests

5 participants