-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KCP] Fails to upgrade single node control plane #2915
Comments
This seems like needs to be investigated |
I tried reproducing with CAPA and everything worked as it should, the new machine appears and joins the cluster, the old machine is deleted and its node disappears when it's removed |
I am testing it on a clusterctl initiated cluster with CAPD. I am investigating a bit more to see if this is related to image tag upgrades, will update soon. |
e2e test (docker_upgrade_test.go) fails when I change control plane replica number from 3 to 1. E.g., below
I also tried upgrading just the Kubernetes version and not touch image tags. Same result, old node is dangling. Since @benmoss confirmed that it is working in aws, I suspect this is a CAPD issue. |
@sedefsavas scaling down a KCP replicas isn't a supported use case, @detiber can you confirm? |
Scale down is something that we had originally intended to support in the proposal (mainly as a pre-requisite for upgrade), not sure if anything has changed since, though. |
I was under the assumption that we were not allowing to go 1 replica -> 3 replicas -> 1 replica |
I see this issue with CAPV too. It is happening more often than not. @benmoss can you redo this test for CAPA too to see if it is consistently succeeding? |
Is scale-down not working for this node? Can you up the logs in the controller and try to trace what happens? |
Yes, in the scale down. It is failing to remove etcd member, hence machine is never deleted. I don't understand why it happens only sometimes though. |
Sounds like it could be a timing issue, it'd be great to have an exact trace when it fails |
Can you give some more details of what the change you're making is? You're upgrading Kubernetes, etcd, and CoreDNS all at the same time? |
No, only upgrading Kubernetes version.
|
It seems like that failure is related to the leaderCandidate not having a NodeRef yet, which is a little strange. |
Aren't we waiting any longer that all the nodes in the control plane are ready before proceeding to delete the older machines? |
I don't see a node ready check, without CNI is installed, 3 node control-plane upgrades are working fine. |
Can you run a test locally with a custom build after adding a check the NodeRef is there for the leaderCandidate and see if that fixes it? |
Another set of errors I see when the times it does not panic during upgrade is etcd remove member error:
Now testing what @vincepri suggested. |
If I had to guess, it seems like it's trying to remove the etcd member too soon |
@vincepri Noderef is not the issue, we already wait for kube-api-server to be ready. |
/assign |
We shouldn't have panics though, so we need a check in place somewhere before doing that scale down
We need to wait for the NodeRef to be on Machines |
I have seen that consistently in metal3-dev-env. @sedefsavas Do you have any update on this issue or some time line ? |
@vincepri Thanks, will follow the issue. |
What steps did you take and what happened:
On a single-node control-plane, I upgraded kubernetes version, etcd and CoreDNS image tags by modifying KCP object. New node with upgraded Kubernetes version is created, old node is physically deleted but old node object and all pods that was on the old pod are dangling.
What did you expect to happen:
I expected KCP to remove node object and cleanup resources that was on that node.
/kind bug
/area control-plane
The text was updated successfully, but these errors were encountered: