Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 KCP reconcileEtcdMembers should use its own NodeRefs #3964

Merged

Conversation

vincepri
Copy link
Member

@vincepri vincepri commented Nov 30, 2020

Signed-off-by: Vince Prignano [email protected]

What this PR does / why we need it:

These changes bring more safety when we reconcile the etcd members for
the given workload cluster.

To perform these changes, some modifications to the internal structs and
interfaces were needed. The etcd client generator now accepts node names
(as []string) instead of corev1.Node(s). This allows us to be more
flexible in how we pass in the list of nodes that we expect the etcd
member list to have.

The reconcileEtcdMembers method already waits for all machines to have
NodeRefs set before proceeding. While we check for that, now we also
collect all the names in a slice before passing it in to the inner
Workload struct method.

A NodeRef is assigned by the Machine controller as soon as that
Machine's infrastructure provider exposes the ProviderID, the machine
controller then compares the ProviderID to the list of nodes available
in the workload cluster, and finally assigns the NodeRef under the
Machine's Status field.

If a NodeRef is assigned to a Machine that KCP owns, we know it should
be a control plane machine even if kubeadm join hasn't set the label on
the Node object.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #3961

/milestone v0.4.0

@k8s-ci-robot k8s-ci-robot added this to the v0.4.0 milestone Nov 30, 2020
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 30, 2020
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 30, 2020
These changes bring more safety when we reconcile the etcd members for
the given workload cluster.

To perform these changes, some modifications to the internal structs and
interfaces were needed. The etcd client generator now accepts node names
(as []string) instead of corev1.Node(s). This allows us to be more
flexible in how we pass in the list of nodes that we expect the etcd
member list to have.

The reconcileEtcdMembers method already waits for all machines to have
NodeRefs set before proceeding. While we check for that, now we also
collect all the names in a slice before passing it in to the inner
Workload struct method.

A NodeRef is assigned by the Machine controller as soon as that
Machine's infrastructure provider exposes the ProviderID, the machine
controller then compares the ProviderID to the list of nodes available
in the workload cluster, and finally assigns the NodeRef under the
Machine's Status field.

If a NodeRef is assigned to a Machine that KCP owns, we know it _should_
be a control plane machine even if kubeadm join hasn't set the label on
the Node object.

Signed-off-by: Vince Prignano <[email protected]>
Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm for me with an open question not blocking for this PR

}

// ReconcileEtcdMembers iterates over all etcd members and finds members that do not have corresponding nodes.
// If there are any such members, it deletes them from etcd and removes their nodes from the kubeadm configmap so that kubeadm does not run etcd health checks on them.
func (w *Workload) ReconcileEtcdMembers(ctx context.Context) ([]string, error) {
controlPlaneNodes, err := w.getControlPlaneNodes(ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, here we are moving away from getControlPlaneNodes because it can give false negatives while kubeadm join is still in progress.
Wondering if we should reconsider also all the other points where getControlPlaneNodes is being called

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but I didn't anticipate any issues in getting the nodes from the cluster for those, given that we only use it to connect to etcd and not as an authoritative list of members that we compare against.

We really need to work on the etcd client generator though, it's really confusing on how it's structured today.

@vincepri vincepri force-pushed the reconciler-etcd-members-safe branch from 9011258 to 979197c Compare November 30, 2020 21:08
@vincepri
Copy link
Member Author

/test pull-cluster-api-e2e-full-main

@vincepri
Copy link
Member Author

/test pull-cluster-api-test-main

@detiber
Copy link
Member

detiber commented Nov 30, 2020

These changes lgtm

@ncdc
Copy link
Contributor

ncdc commented Nov 30, 2020

/lgtm
Looks like we're gonna need to figure out conversion webhook for cluster.x-k8s.io/v1alpha4, Kind=Cluster failed: the server could not find the requested resource though.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 30, 2020
@vincepri
Copy link
Member Author

/test pull-cluster-api-e2e-full-main

1 similar comment
@vincepri
Copy link
Member Author

vincepri commented Dec 1, 2020

/test pull-cluster-api-e2e-full-main

@k8s-ci-robot
Copy link
Contributor

@vincepri: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-cluster-api-e2e-full-main 979197c link /test pull-cluster-api-e2e-full-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@vincepri
Copy link
Member Author

vincepri commented Dec 1, 2020

@ncdc Good to approve/merge?

@ncdc
Copy link
Contributor

ncdc commented Dec 1, 2020

LGTM from me - given the criticality of this issue, let's get someone else to approve?

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 1, 2020
@k8s-ci-robot k8s-ci-robot merged commit 09d34e8 into kubernetes-sigs:master Dec 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KCP race condition between joining new control plane nodes and etcd reconciliation
6 participants