Control plane nodes scale down causes etcd to loose quorum and do not restore #263

Danil-Grigorev · 2024-02-12T12:21:42Z

What happened:
[A clear and concise description of what the bug is.]
After initial provisioning, scaling down control plane replicas to 1 causes API server to stop responding and never restore.

etcd logs show this:

{"level":"warn","ts":"2024-02-12T12:18:56.818575Z","caller":"etcdserver/server.go:2085","msg":"failed to publish local member to cluster through raft","local-member-id":"56c279675ce0ee85","local-member-attributes":"{Name:caprke2-e2e-60mj1n-control-plane-xdffd-d9da8b76 ClientURLs:[https://172.18.0.10:2379]}","request-path":"/0/members/56c279675ce0ee85/attributes","publish-timeout":"15s","error":"etcdserver: request timed out"}
{"level":"warn","ts":"2024-02-12T12:18:56.904853Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"5320353a7d98bdee","rtt":"0s","error":"dial tcp 172.18.0.4:2380: connect: no route to host"}
{"level":"warn","ts":"2024-02-12T12:18:56.904964Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"5320353a7d98bdee","rtt":"0s","error":"dial tcp 172.18.0.4:2380: connect: no route to host"}
{"level":"warn","ts":"2024-02-12T12:18:56.9283Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"ef183bce98bbb166","rtt":"0s","error":"dial tcp 172.18.0.9:2380: connect: no route to host"}
{"level":"warn","ts":"2024-02-12T12:18:56.930688Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"ef183bce98bbb166","rtt":"0s","error":"dial tcp 172.18.0.9:2380: connect: no route to host"}

caprke2-e2e-60mj1n-control-plane-xdffd-d9da8b76 is a node which was recently removed, yet the server infinitely tries to access it.

What did you expect to happen:
Cluster control plane node scaling to not affect API server health.

How to reproduce it:

Create an RKE2 cluster with 3 control plane replicas.
Wait for cluster to become available.
Scale down CP to 1.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

rke provider version: rke2 version v1.28.1+rke2r1
OS (e.g. from /etc/os-release):

The text was updated successfully, but these errors were encountered:

Danil-Grigorev added kind/bug Something isn't working needs-priority Indicates an issue or PR needs a priority assigning to it needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 12, 2024

Danil-Grigorev self-assigned this Feb 12, 2024

Danil-Grigorev added this to CAPI & Hosted Kubernetes providers (EKS/AKS/GKE) Feb 13, 2024

Danil-Grigorev moved this to In Progress (5 max) in CAPI & Hosted Kubernetes providers (EKS/AKS/GKE) Feb 13, 2024

This was referenced Feb 13, 2024

[e2e] Test have been failing for a long time #242

Closed

🐛 Reconcile etcd members on control plane scale down #265

Merged

Danil-Grigorev moved this from In Progress (8 max) to PR to be reviewed in CAPI & Hosted Kubernetes providers (EKS/AKS/GKE) Feb 20, 2024

Danil-Grigorev mentioned this issue Feb 26, 2024

✨Simplify alternative control-plane implementation kubernetes-sigs/cluster-api#10198

Closed

richardcase moved this from PR to be reviewed to CAPI Backlog in CAPI & Hosted Kubernetes providers (EKS/AKS/GKE) Feb 27, 2024

Danil-Grigorev closed this as completed in #265 Apr 19, 2024

github-project-automation bot moved this from CAPI Backlog to Done in CAPI & Hosted Kubernetes providers (EKS/AKS/GKE) Apr 19, 2024

Danil-Grigorev mentioned this issue Sep 9, 2024

Avoid deleting etcd member before node is drained #434

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Control plane nodes scale down causes etcd to loose quorum and do not restore #263

Control plane nodes scale down causes etcd to loose quorum and do not restore #263

Danil-Grigorev commented Feb 12, 2024

Control plane nodes scale down causes etcd to loose quorum and do not restore #263

Control plane nodes scale down causes etcd to loose quorum and do not restore #263

Comments

Danil-Grigorev commented Feb 12, 2024