Decreasing number of replicas in K0sControlPlane is not working properly #459

nekwar · 2024-02-21T14:46:09Z

Details

Environment: vSphere
k0smotron version: v0.8.0
k0s version: tested versions are 1.28.5 and 1.29.1 -- behaviour is similar

Problem summary

Downscaling controllers managed by K0sContolPlane is not working properly. Behaviour of deletion is quite unpredictable -- some times node is deleted on a Kubernetes level, some times it is not. But what is common between all deletion cases is that

node is not deleted from etcd member-list (known issue: ControlNode improvements k0s#3808)
node can't be manually deleted from etcd member list with k0s etcd leave <node-ip> due to etcd cluster "being unhealthy":

root@example-cluster-0:/home/example# k0s etcd leave 172.16.164.134
{"level":"warn","ts":"2024-02-21T11:53:29.659865Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000a3efc0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}
ERRO[2024-02-21 11:53:29] Failed to delete node from cluster            peerID=13807347799361342233 peerURL="https://172.16.164.130:2380/"
Error: etcdserver: unhealthy cluster

But, what is more interesting is that node can be removed from etcd member list with etcdctl:

root@example-cluster-0:/home/example# ETCDCTL_API=3 etcdctl --cert=/var/lib/k0s/pki/etcd/server.crt --key=/var/lib/k0s/pki/etcd/server.key --cacert /var/lib/k0s/pki/etcd/ca.crt member list
3cc271ed05e79912, started, example-cluster-2, https://172.16.164.134:2380/, https://127.0.0.1:2379/
bf9d8dab47344b19, started, example-cluster-0, https://172.16.164.130:2380/, https://127.0.0.1:2379/
dbe13ac41e24fe7c, started, example-cluster-1, https://172.16.164.135:2380/, https://127.0.0.1:2379/
root@example-cluster-0:/home/example# ETCDCTL_API=3 etcdctl --cert=/var/lib/k0s/pki/etcd/server.crt --key=/var/lib/k0s/pki/etcd/server.key --cacert /var/lib/k0s/pki/etcd/ca.crt member remove 3cc271ed05e79912
Member 3cc271ed05e79912 removed from cluster  958028b2b34428

After that manipulation the node is not a member list in k0s etcd member-list command output.

Expected behaviour
Controller node to be properly deleted (at least on Kubernetes level, I understand that etcd membership is another issue) by downscaling replicas in K0sControlPlane

The text was updated successfully, but these errors were encountered:

nekwar · 2024-02-21T14:47:28Z

The question here is what is the proper process of node deletion?

IMO, theoretically, the most proper way would be to cordon/drain node first, then delete the node (similar to kubectl node delete), but I'm not sure if this can be implemented with k0smotron.

twz123 · 2024-02-23T10:17:59Z

node can't be manually deleted from etcd member list with k0s etcd leave <node-ip> due to etcd cluster "being unhealthy"

The right way to specify the peer address that should be removed is k0s etcd leave --peer-address <node-ip>. When passing <node-ip> as an argument instead as a flag, it will be simply ignored and k0s etcd leave will default to remove the current node from the cluster. I admit that this is very confusing, and it took me a while to realize it myself.

nekwar · 2024-03-01T14:52:28Z

@twz123 Thank you for PR!

Just a question - won't it make more sense to use same syntax for k0s etcd command as standard etcdctl? I think it will be the most user-friendly solution

twz123 · 2024-03-01T20:32:22Z

I've thought about that, but removing the --peer-address flag would have been a breaking change to the CLI interface that I didn't want to make.

makhov · 2024-03-28T12:23:57Z

@nekwar we have just released the new k0smotron v0.9.0 with a bunch of improvements and the downscaling should work properly

twz123 mentioned this issue Feb 23, 2024

Harden etcd subcommand usage and validation k0sproject/k0s#4118

Merged

16 tasks

k0s-bot mentioned this issue Mar 1, 2024

[Backport release-1.29] Harden etcd subcommand usage and validation k0sproject/k0s#4128

Merged

twz123 mentioned this issue Mar 28, 2024

[Backport release-1.27] Harden etcd subcommand usage and validation k0sproject/k0s#4217

Merged

k0s-bot mentioned this issue Apr 3, 2024

[Backport release-1.26] Harden etcd subcommand usage and validation k0sproject/k0s#4231

Merged

jnummelin closed this as completed Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decreasing number of replicas in K0sControlPlane is not working properly #459

Decreasing number of replicas in K0sControlPlane is not working properly #459

nekwar commented Feb 21, 2024

nekwar commented Feb 21, 2024

twz123 commented Feb 23, 2024

nekwar commented Mar 1, 2024

twz123 commented Mar 1, 2024

makhov commented Mar 28, 2024

Decreasing number of replicas in K0sControlPlane is not working properly #459

Decreasing number of replicas in K0sControlPlane is not working properly #459

Comments

nekwar commented Feb 21, 2024

nekwar commented Feb 21, 2024

twz123 commented Feb 23, 2024

nekwar commented Mar 1, 2024

twz123 commented Mar 1, 2024

makhov commented Mar 28, 2024