Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decreasing number of replicas in K0sControlPlane is not working properly #459

Closed
nekwar opened this issue Feb 21, 2024 · 5 comments
Closed

Comments

@nekwar
Copy link
Contributor

nekwar commented Feb 21, 2024

Details

  • Environment: vSphere
  • k0smotron version: v0.8.0
  • k0s version: tested versions are 1.28.5 and 1.29.1 -- behaviour is similar

Problem summary

Downscaling controllers managed by K0sContolPlane is not working properly. Behaviour of deletion is quite unpredictable -- some times node is deleted on a Kubernetes level, some times it is not. But what is common between all deletion cases is that

  • node is not deleted from etcd member-list (known issue: ControlNode improvements k0s#3808)
  • node can't be manually deleted from etcd member list with k0s etcd leave <node-ip> due to etcd cluster "being unhealthy":
root@example-cluster-0:/home/example# k0s etcd leave 172.16.164.134
{"level":"warn","ts":"2024-02-21T11:53:29.659865Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000a3efc0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}
ERRO[2024-02-21 11:53:29] Failed to delete node from cluster            peerID=13807347799361342233 peerURL="https://172.16.164.130:2380/"
Error: etcdserver: unhealthy cluster

But, what is more interesting is that node can be removed from etcd member list with etcdctl:

root@example-cluster-0:/home/example# ETCDCTL_API=3 etcdctl --cert=/var/lib/k0s/pki/etcd/server.crt --key=/var/lib/k0s/pki/etcd/server.key --cacert /var/lib/k0s/pki/etcd/ca.crt member list
3cc271ed05e79912, started, example-cluster-2, https://172.16.164.134:2380/, https://127.0.0.1:2379/
bf9d8dab47344b19, started, example-cluster-0, https://172.16.164.130:2380/, https://127.0.0.1:2379/
dbe13ac41e24fe7c, started, example-cluster-1, https://172.16.164.135:2380/, https://127.0.0.1:2379/
root@example-cluster-0:/home/example# ETCDCTL_API=3 etcdctl --cert=/var/lib/k0s/pki/etcd/server.crt --key=/var/lib/k0s/pki/etcd/server.key --cacert /var/lib/k0s/pki/etcd/ca.crt member remove 3cc271ed05e79912
Member 3cc271ed05e79912 removed from cluster  958028b2b34428

After that manipulation the node is not a member list in k0s etcd member-list command output.

Expected behaviour
Controller node to be properly deleted (at least on Kubernetes level, I understand that etcd membership is another issue) by downscaling replicas in K0sControlPlane

@nekwar
Copy link
Contributor Author

nekwar commented Feb 21, 2024

The question here is what is the proper process of node deletion?

IMO, theoretically, the most proper way would be to cordon/drain node first, then delete the node (similar to kubectl node delete), but I'm not sure if this can be implemented with k0smotron.

@twz123
Copy link
Member

twz123 commented Feb 23, 2024

node can't be manually deleted from etcd member list with k0s etcd leave <node-ip> due to etcd cluster "being unhealthy"

The right way to specify the peer address that should be removed is k0s etcd leave --peer-address <node-ip>. When passing <node-ip> as an argument instead as a flag, it will be simply ignored and k0s etcd leave will default to remove the current node from the cluster. I admit that this is very confusing, and it took me a while to realize it myself.

@nekwar
Copy link
Contributor Author

nekwar commented Mar 1, 2024

@twz123 Thank you for PR!

Just a question - won't it make more sense to use same syntax for k0s etcd command as standard etcdctl? I think it will be the most user-friendly solution

@twz123
Copy link
Member

twz123 commented Mar 1, 2024

I've thought about that, but removing the --peer-address flag would have been a breaking change to the CLI interface that I didn't want to make.

@makhov
Copy link
Contributor

makhov commented Mar 28, 2024

@nekwar we have just released the new k0smotron v0.9.0 with a bunch of improvements and the downscaling should work properly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants