🐛 KCP reconcileEtcdMembers should use its own NodeRefs #3964

vincepri · 2020-11-30T20:49:11Z

Signed-off-by: Vince Prignano [email protected]

What this PR does / why we need it:

These changes bring more safety when we reconcile the etcd members for
the given workload cluster.

To perform these changes, some modifications to the internal structs and
interfaces were needed. The etcd client generator now accepts node names
(as []string) instead of corev1.Node(s). This allows us to be more
flexible in how we pass in the list of nodes that we expect the etcd
member list to have.

The reconcileEtcdMembers method already waits for all machines to have
NodeRefs set before proceeding. While we check for that, now we also
collect all the names in a slice before passing it in to the inner
Workload struct method.

A NodeRef is assigned by the Machine controller as soon as that
Machine's infrastructure provider exposes the ProviderID, the machine
controller then compares the ProviderID to the list of nodes available
in the workload cluster, and finally assigns the NodeRef under the
Machine's Status field.

If a NodeRef is assigned to a Machine that KCP owns, we know it should
be a control plane machine even if kubeadm join hasn't set the label on
the Node object.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #3961

/milestone v0.4.0

These changes bring more safety when we reconcile the etcd members for the given workload cluster. To perform these changes, some modifications to the internal structs and interfaces were needed. The etcd client generator now accepts node names (as []string) instead of corev1.Node(s). This allows us to be more flexible in how we pass in the list of nodes that we expect the etcd member list to have. The reconcileEtcdMembers method already waits for all machines to have NodeRefs set before proceeding. While we check for that, now we also collect all the names in a slice before passing it in to the inner Workload struct method. A NodeRef is assigned by the Machine controller as soon as that Machine's infrastructure provider exposes the ProviderID, the machine controller then compares the ProviderID to the list of nodes available in the workload cluster, and finally assigns the NodeRef under the Machine's Status field. If a NodeRef is assigned to a Machine that KCP owns, we know it _should_ be a control plane machine even if kubeadm join hasn't set the label on the Node object. Signed-off-by: Vince Prignano <[email protected]>

fabriziopandini

lgtm for me with an open question not blocking for this PR

fabriziopandini · 2020-11-30T20:57:33Z

controlplane/kubeadm/internal/workload_cluster_etcd.go

 }

 // ReconcileEtcdMembers iterates over all etcd members and finds members that do not have corresponding nodes.
 // If there are any such members, it deletes them from etcd and removes their nodes from the kubeadm configmap so that kubeadm does not run etcd health checks on them.
-func (w *Workload) ReconcileEtcdMembers(ctx context.Context) ([]string, error) {
-	controlPlaneNodes, err := w.getControlPlaneNodes(ctx)


So, here we are moving away from getControlPlaneNodes because it can give false negatives while kubeadm join is still in progress.
Wondering if we should reconsider also all the other points where getControlPlaneNodes is being called

We could, but I didn't anticipate any issues in getting the nodes from the cluster for those, given that we only use it to connect to etcd and not as an authoritative list of members that we compare against.

We really need to work on the etcd client generator though, it's really confusing on how it's structured today.

vincepri · 2020-11-30T21:11:22Z

/test pull-cluster-api-e2e-full-main

vincepri · 2020-11-30T21:15:24Z

/test pull-cluster-api-test-main

detiber · 2020-11-30T21:24:29Z

These changes lgtm

ncdc · 2020-11-30T21:57:58Z

/lgtm
Looks like we're gonna need to figure out conversion webhook for cluster.x-k8s.io/v1alpha4, Kind=Cluster failed: the server could not find the requested resource though.

vincepri · 2020-11-30T22:30:05Z

/test pull-cluster-api-e2e-full-main

vincepri · 2020-12-01T03:51:13Z

/test pull-cluster-api-e2e-full-main

k8s-ci-robot · 2020-12-01T04:08:13Z

@vincepri: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
pull-cluster-api-e2e-full-main	`979197c`	link	`/test pull-cluster-api-e2e-full-main`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

vincepri · 2020-12-01T16:04:03Z

@ncdc Good to approve/merge?

ncdc · 2020-12-01T16:05:12Z

LGTM from me - given the criticality of this issue, let's get someone else to approve?

CecileRobertMichon

/lgtm
/approve

k8s-ci-robot · 2020-12-01T16:15:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [CecileRobertMichon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added this to the v0.4.0 milestone Nov 30, 2020

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 30, 2020

k8s-ci-robot requested review from CecileRobertMichon and JoelSpeed November 30, 2020 20:49

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 30, 2020

fabriziopandini reviewed Nov 30, 2020

View reviewed changes

vincepri force-pushed the reconciler-etcd-members-safe branch from 9011258 to 979197c Compare November 30, 2020 21:08

k8s-ci-robot assigned ncdc Nov 30, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 30, 2020

CecileRobertMichon approved these changes Dec 1, 2020

View reviewed changes

k8s-ci-robot assigned CecileRobertMichon Dec 1, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 1, 2020

k8s-ci-robot merged commit 09d34e8 into kubernetes-sigs:master Dec 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 KCP reconcileEtcdMembers should use its own NodeRefs #3964

🐛 KCP reconcileEtcdMembers should use its own NodeRefs #3964

vincepri commented Nov 30, 2020 •

edited

Loading

fabriziopandini left a comment

fabriziopandini Nov 30, 2020

vincepri Nov 30, 2020

vincepri commented Nov 30, 2020

vincepri commented Nov 30, 2020

detiber commented Nov 30, 2020

ncdc commented Nov 30, 2020

vincepri commented Nov 30, 2020

vincepri commented Dec 1, 2020

k8s-ci-robot commented Dec 1, 2020

vincepri commented Dec 1, 2020

ncdc commented Dec 1, 2020

CecileRobertMichon left a comment

k8s-ci-robot commented Dec 1, 2020

🐛 KCP reconcileEtcdMembers should use its own NodeRefs #3964

🐛 KCP reconcileEtcdMembers should use its own NodeRefs #3964

Conversation

vincepri commented Nov 30, 2020 • edited Loading

fabriziopandini left a comment

Choose a reason for hiding this comment

fabriziopandini Nov 30, 2020

Choose a reason for hiding this comment

vincepri Nov 30, 2020

Choose a reason for hiding this comment

vincepri commented Nov 30, 2020

vincepri commented Nov 30, 2020

detiber commented Nov 30, 2020

ncdc commented Nov 30, 2020

vincepri commented Nov 30, 2020

vincepri commented Dec 1, 2020

k8s-ci-robot commented Dec 1, 2020

vincepri commented Dec 1, 2020

ncdc commented Dec 1, 2020

CecileRobertMichon left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Dec 1, 2020

vincepri commented Nov 30, 2020 •

edited

Loading