-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨Forward etcd leadership from machine that is being deleted #2525
Conversation
Hi @alexander-demichev. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few points to address, but i like this approach!
controlplane/kubeadm/controllers/kubeadm_control_plane_controller.go
Outdated
Show resolved
Hide resolved
/ok-to-test |
/assign |
/milestone v0.3.0 |
@@ -755,6 +761,12 @@ func (r *KubeadmControlPlaneReconciler) reconcileDelete(ctx context.Context, clu | |||
for i := range machinesToDelete { | |||
m := machinesToDelete[i] | |||
logger := logger.WithValues("machine", m) | |||
// If etcd leadership is on machine that is about to be deleted, move it to first follower | |||
if err := workloadCluster.ForwardEtcdLeadership(ctx, m); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me the advantages of doing this.
We appear to be removing a cluster with this operation. All the other machines are gone. We get to this section, we're deleting the remaining control plane machines. Since we cordon and drain before a machine is deleted, it appears that we'll just assign etcd into a circle: 1->2->3->1 by the time this for loop completes. It's also not clear why we need to do this at all, since the cluster is going away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! The original intention was to do this for control plane scale down operations rather than control plane deletion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@detiber that makes a lot more sense. I thought it was something like that, but the linked issue was not very specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Moving this out of v0.3.0 given that it looks like an improvement that can go in later /milestone v0.3.x |
@alexander-demichev are you still interested in working on this change? |
per @michaelgugino comment, we need to make sure to focus the change only to target upgrades, rather than doing this when we're deleting the whole cluster. Ideally, it'd be great to move etcd leadership to the first newly-created machine during an upgrade process, this will probably ensure that the leader is stable during the upgrade process and minimize possible disruptions. |
@vincepri Yes, I'm still interested in this :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep private identifiers private unless they need to become public
@chuckha fixed |
I think some basic unit tests in the workload_cluster_test.go file would be really good to add. We won't need to add anything to the reconciler as that logic doesn't need testing there. It might be nice to add an e2e test for scale down, but that, i think, is outside the scope of this PR and can be done in parallel to this work. /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alexander-demichev, chuckha The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
if member.ID != currentMember.ID { | ||
err := etcdClient.MoveLeader(ctx, member.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should make sure that the new member isn't a machine that's going to be deleted, a (potential) simple solution would be to always pick the last created machine/etcd member
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexander-demichev do you have time to tackle the above? or we can do it in a follow-up PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are fine with this PR then feel free to merge. A follow-up sounds good, if that's not urgent I can try to make something during the week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/milestone v0.3.2 |
During the scaledown process, we need to be sure that the control plane machine that is about to be deleted is not etcd leader. This PR always moves the leadership to the first follower. It also introduces a few minor changes to etcd client, it missed the ability to get the leader ID
Closes #2398