-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Change the number of expected etcd members #2696
🐛 Change the number of expected etcd members #2696
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: chuckha The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I'm checking this out locally to test now. |
2e5f9f6
to
fa4574d
Compare
/milestone v0.3.2 |
I commented out the "etcd member removed" annotation apply step to simulate the failure we experienced, and wanted to see it loop forever on re-removing the same etcd member. What I saw instead was a different health check failure:
That node happens to be the one that got member remove'd:
The good news is that it's no longer complaining about there being a mismatch in the number, though! How should we have the |
I'm tempted to suggest we check if the pod is ready in that loop and skip over it if not. It doesn't look like we're really waiting for etcd to be healthy again before deleting the control plane machine:
It looks like a log line "Upgrading control plane" followed by "waiting for control plane to pass control plane health check before removing a control plane machine" is how the KCP reconciler spells "successfully removed an etcd member. Also, looking more carefully now, I would expect this health check change to also remove the need for the annotation at all – it looks like we're counting on early returns from healthcheck failures to sequence the upgrade steps. |
This was also my first instinct. just need to read a little more code to convince myself |
It makes me a little nervous in general, because I'm worried that means we'll give a pass to an etcd ring that has e.g. a single member who just got oomkilled. But hopefully we'll catch that by comparing against the |
Reading through the code a little more, a few observations:
wdyt? |
This requires more restructuring elsewhere. I think this could be a nice improvement but it's something a little more invasive than what is going on here. Want to open an issue with your thoughts on the matter? |
Sure yeah, let's wrap this up for now. Is there any other change that needs to happen in this PR? |
Yep. I'm adding a test for this and have some changes locally that need to be made. |
fa4574d
to
2772355
Compare
@@ -18,173 +18,41 @@ package fake | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file was out of date since it was unused entirely. I updated it so I could fake etcd responses to simulate the situations necessary for the test.
/test pull-cluster-api-make o.o....compiles on my machine.....? |
}, | ||
}, | ||
}, | ||
etcdClientGenerator: &fakeEtcdClientGenerator{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't what i want it to look like, but i'm working with the existing structs in order to finish this up
oh. issue with rebase on master, a bunch of stuff changed 🙃 |
Signed-off-by: Chuck Ha <[email protected]>
2772355
to
85efa5c
Compare
/retest |
o7 |
sethp-nr test bot reports that with these two additional PRs:
I was able to successfully upgrade a KCP even without letting the reconciler add the "etcd member removed" annotation! /lgtm |
Signed-off-by: Chuck Ha [email protected]
What this PR does / why we need it:
This PR allows the control plane to upgrade through an expected state where there are 4 control planes and 3 etcd members. This happens when the control plane controller dies after removing a member but not marking it as having been removed. It's one step closer to removing the annotations that manage state.
Some slightly larger changes due to fixing a stale fake etcd client that went unused in tests.
Adds a test for the particular case we are trying to fix.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #2651
/cc @sethp-nr @rudoi
/assign @vincepri