Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control plane nodes scale down causes etcd to loose quorum and do not restore #263

Closed
Danil-Grigorev opened this issue Feb 12, 2024 · 0 comments · Fixed by #265
Closed
Assignees
Labels
kind/bug Something isn't working needs-priority Indicates an issue or PR needs a priority assigning to it needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@Danil-Grigorev
Copy link
Contributor

What happened:
[A clear and concise description of what the bug is.]
After initial provisioning, scaling down control plane replicas to 1 causes API server to stop responding and never restore.

etcd logs show this:

{"level":"warn","ts":"2024-02-12T12:18:56.818575Z","caller":"etcdserver/server.go:2085","msg":"failed to publish local member to cluster through raft","local-member-id":"56c279675ce0ee85","local-member-attributes":"{Name:caprke2-e2e-60mj1n-control-plane-xdffd-d9da8b76 ClientURLs:[https://172.18.0.10:2379]}","request-path":"/0/members/56c279675ce0ee85/attributes","publish-timeout":"15s","error":"etcdserver: request timed out"}
{"level":"warn","ts":"2024-02-12T12:18:56.904853Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"5320353a7d98bdee","rtt":"0s","error":"dial tcp 172.18.0.4:2380: connect: no route to host"}
{"level":"warn","ts":"2024-02-12T12:18:56.904964Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"5320353a7d98bdee","rtt":"0s","error":"dial tcp 172.18.0.4:2380: connect: no route to host"}
{"level":"warn","ts":"2024-02-12T12:18:56.9283Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"ef183bce98bbb166","rtt":"0s","error":"dial tcp 172.18.0.9:2380: connect: no route to host"}
{"level":"warn","ts":"2024-02-12T12:18:56.930688Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"ef183bce98bbb166","rtt":"0s","error":"dial tcp 172.18.0.9:2380: connect: no route to host"}

caprke2-e2e-60mj1n-control-plane-xdffd-d9da8b76 is a node which was recently removed, yet the server infinitely tries to access it.

What did you expect to happen:
Cluster control plane node scaling to not affect API server health.

How to reproduce it:

  • Create an RKE2 cluster with 3 control plane replicas.
  • Wait for cluster to become available.
  • Scale down CP to 1.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • rke provider version: rke2 version v1.28.1+rke2r1
  • OS (e.g. from /etc/os-release):
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working needs-priority Indicates an issue or PR needs a priority assigning to it needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Development

Successfully merging a pull request may close this issue.

1 participant