Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Change the number of expected etcd members #2696

Merged
merged 1 commit into from
Mar 18, 2020

Conversation

chuckha
Copy link
Contributor

@chuckha chuckha commented Mar 17, 2020

Signed-off-by: Chuck Ha [email protected]

What this PR does / why we need it:
This PR allows the control plane to upgrade through an expected state where there are 4 control planes and 3 etcd members. This happens when the control plane controller dies after removing a member but not marking it as having been removed. It's one step closer to removing the annotations that manage state.

Some slightly larger changes due to fixing a stale fake etcd client that went unused in tests.

Adds a test for the particular case we are trying to fix.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2651

/cc @sethp-nr @rudoi
/assign @vincepri

@k8s-ci-robot k8s-ci-robot requested review from rudoi and sethp-nr March 17, 2020 18:33
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 17, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chuckha

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 17, 2020
@sethp-nr
Copy link
Contributor

I'm checking this out locally to test now.

@chuckha chuckha force-pushed the etcd-health-check branch 2 times, most recently from 2e5f9f6 to fa4574d Compare March 17, 2020 18:40
@vincepri
Copy link
Member

/milestone v0.3.2

@k8s-ci-robot k8s-ci-robot added this to the v0.3.2 milestone Mar 17, 2020
@sethp-nr
Copy link
Contributor

sethp-nr commented Mar 17, 2020

I commented out the "etcd member removed" annotation apply step to simulate the failure we experienced, and wanted to see it loop forever on re-removing the same etcd member. What I saw instead was a different health check failure:

E0317 19:40:17.837043 39 kubeadm_control_plane_controller.go:537] controllers/KubeadmControlPlane "msg"="waiting for control plane to pass etcd health check before removing a control plane machine" "error"="node "ip-10-0-122-121.ec2.internal": failed to create etcd client: unable to create etcd client: context deadline exceeded" "cluster"="test" "kubeadmControlPlane"="test-controlplane" "namespace"="test"

That node happens to be the one that got member remove'd:

etcd-ip-10-0-122-121.ec2.internal                      0/1     CrashLoopBackOff   6          16m
etcd-ip-10-0-23-80.ec2.internal                        1/1     Running            0          6m29s
etcd-ip-10-0-36-60.ec2.internal                        1/1     Running            0          11m
etcd-ip-10-0-8-103.ec2.internal                        1/1     Running            0          19m

The good news is that it's no longer complaining about there being a mismatch in the number, though!

How should we have the for _, node := range controlPlaneNodes.Items { loop skip the remove'd member in this case?

@sethp-nr
Copy link
Contributor

I'm tempted to suggest we check if the pod is ready in that loop and skip over it if not. It doesn't look like we're really waiting for etcd to be healthy again before deleting the control plane machine:

I0317 19:37:52.065804 39 kubeadm_control_plane_controller.go:253] controllers/KubeadmControlPlane "msg"="Upgrading Control Plane" "cluster"="test" "kubeadmControlPlane"="test-controlplane" "namespace"="test"
E0317 19:37:57.127608 39 kubeadm_control_plane_controller.go:537] controllers/KubeadmControlPlane "msg"="waiting for control plane to pass etcd health check before removing a control plane machine" "error"="control plane machine test/test-controlplane-pq76m has no status.nodeRef" "cluster"="test" "kubeadmControlPlane"="test-controlplane" "namespace"="test"
I0317 19:38:07.514075 39 kubeadm_control_plane_controller.go:253] controllers/KubeadmControlPlane "msg"="Upgrading Control Plane" "cluster"="test" "kubeadmControlPlane"="test-controlplane" "namespace"="test"
E0317 19:38:18.326999 39 kubeadm_control_plane_controller.go:552] controllers/KubeadmControlPlane "msg"="waiting for control plane to pass control plane health check before removing a control plane machine" "error"="node "ip-10-0-23-80.ec2.internal": pods "kube-apiserver-ip-10-0-23-80.ec2.internal" not found" "cluster"="test" "kubeadmControlPlane"="test-controlplane" "namespace"="test"

It looks like a log line "Upgrading control plane" followed by "waiting for control plane to pass control plane health check before removing a control plane machine" is how the KCP reconciler spells "successfully removed an etcd member.

Also, looking more carefully now, I would expect this health check change to also remove the need for the annotation at all – it looks like we're counting on early returns from healthcheck failures to sequence the upgrade steps.

@chuckha
Copy link
Contributor Author

chuckha commented Mar 17, 2020

I'm tempted to suggest we check if the pod is ready in that loop and skip over it if not. It doesn't look like we're really waiting for etcd to be healthy again before deleting the control plane machine:

This was also my first instinct. just need to read a little more code to convince myself

@sethp-nr
Copy link
Contributor

It makes me a little nervous in general, because I'm worried that means we'll give a pass to an etcd ring that has e.g. a single member who just got oomkilled. But hopefully we'll catch that by comparing against the knownMemberIDSet

@vincepri
Copy link
Member

vincepri commented Mar 17, 2020

Reading through the code a little more, a few observations:

  • The Workload.EtcdIsHealthy method should probably only return a HealthCheckResult without making assumption about when a node is healthy or not, this method seems more geared towards "check the health of each etcd member and report everything back".
  • Callers of this method, can use the result to make their own decisions as if to proceed or not. For example, in the case of scale down, we can ignore the member currently in the process of being deleted as it was suggested above.
  • The HealthCheckResult type can offer helper methods with pre-defined checks that callers can decide upon.
  • Optional: Might be worth adding some poller when getting the etcd client, to give some headroom for etcd to recover on its own before we declare it unhealthy

wdyt?

@chuckha
Copy link
Contributor Author

chuckha commented Mar 17, 2020

Reading through the code a little more, a few observations:

  • The Workload.EtcdIsHealthy method should probably only return a HealthCheckResult without making assumption about when a node is healthy or not, this method seems more geared towards "check the health of each etcd member and report everything back".
  • Callers of this method, can use the result to make their own decisions as if to proceed or not. For example, in the case of scale down, we can ignore the member currently in the process of being deleted as it was suggested above.
  • The HealthCheckResult type can offer helper methods with pre-defined checks that callers can decide upon.

wdyt?

This requires more restructuring elsewhere. I think this could be a nice improvement but it's something a little more invasive than what is going on here. Want to open an issue with your thoughts on the matter?

@vincepri
Copy link
Member

Sure yeah, let's wrap this up for now. Is there any other change that needs to happen in this PR?

@chuckha
Copy link
Contributor Author

chuckha commented Mar 17, 2020

Sure yeah, let's wrap this up for now. Is there any other change that needs to happen in this PR?

Yep. I'm adding a test for this and have some changes locally that need to be made.

@chuckha chuckha force-pushed the etcd-health-check branch from fa4574d to 2772355 Compare March 17, 2020 21:19
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 17, 2020
@@ -18,173 +18,41 @@ package fake

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was out of date since it was unused entirely. I updated it so I could fake etcd responses to simulate the situations necessary for the test.

@chuckha
Copy link
Contributor Author

chuckha commented Mar 17, 2020

/test pull-cluster-api-make

o.o....compiles on my machine.....?

},
},
},
etcdClientGenerator: &fakeEtcdClientGenerator{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't what i want it to look like, but i'm working with the existing structs in order to finish this up

@chuckha
Copy link
Contributor Author

chuckha commented Mar 17, 2020

oh. issue with rebase on master, a bunch of stuff changed 🙃

@chuckha chuckha force-pushed the etcd-health-check branch from 2772355 to 85efa5c Compare March 17, 2020 21:32
@chuckha
Copy link
Contributor Author

chuckha commented Mar 17, 2020

/retest

@sethp-nr
Copy link
Contributor

/retest

o7

@sethp-nr
Copy link
Contributor

sethp-nr test bot reports that with these two additional PRs:

I was able to successfully upgrade a KCP even without letting the reconciler add the "etcd member removed" annotation!

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 18, 2020
@k8s-ci-robot k8s-ci-robot merged commit 7157b85 into kubernetes-sigs:master Mar 18, 2020
@chuckha chuckha deleted the etcd-health-check branch March 18, 2020 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KCP: upgrade gets stuck after removing etcd member
4 participants