-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 KCP reconcileEtcdMembers should use its own NodeRefs #3964
🐛 KCP reconcileEtcdMembers should use its own NodeRefs #3964
Conversation
These changes bring more safety when we reconcile the etcd members for the given workload cluster. To perform these changes, some modifications to the internal structs and interfaces were needed. The etcd client generator now accepts node names (as []string) instead of corev1.Node(s). This allows us to be more flexible in how we pass in the list of nodes that we expect the etcd member list to have. The reconcileEtcdMembers method already waits for all machines to have NodeRefs set before proceeding. While we check for that, now we also collect all the names in a slice before passing it in to the inner Workload struct method. A NodeRef is assigned by the Machine controller as soon as that Machine's infrastructure provider exposes the ProviderID, the machine controller then compares the ProviderID to the list of nodes available in the workload cluster, and finally assigns the NodeRef under the Machine's Status field. If a NodeRef is assigned to a Machine that KCP owns, we know it _should_ be a control plane machine even if kubeadm join hasn't set the label on the Node object. Signed-off-by: Vince Prignano <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm for me with an open question not blocking for this PR
} | ||
|
||
// ReconcileEtcdMembers iterates over all etcd members and finds members that do not have corresponding nodes. | ||
// If there are any such members, it deletes them from etcd and removes their nodes from the kubeadm configmap so that kubeadm does not run etcd health checks on them. | ||
func (w *Workload) ReconcileEtcdMembers(ctx context.Context) ([]string, error) { | ||
controlPlaneNodes, err := w.getControlPlaneNodes(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, here we are moving away from getControlPlaneNodes because it can give false negatives while kubeadm join is still in progress.
Wondering if we should reconsider also all the other points where getControlPlaneNodes
is being called
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but I didn't anticipate any issues in getting the nodes from the cluster for those, given that we only use it to connect to etcd and not as an authoritative list of members that we compare against.
We really need to work on the etcd client generator though, it's really confusing on how it's structured today.
9011258
to
979197c
Compare
/test pull-cluster-api-e2e-full-main |
/test pull-cluster-api-test-main |
These changes lgtm |
/lgtm |
/test pull-cluster-api-e2e-full-main |
1 similar comment
/test pull-cluster-api-e2e-full-main |
@vincepri: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@ncdc Good to approve/merge? |
LGTM from me - given the criticality of this issue, let's get someone else to approve? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: CecileRobertMichon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: Vince Prignano [email protected]
What this PR does / why we need it:
These changes bring more safety when we reconcile the etcd members for
the given workload cluster.
To perform these changes, some modifications to the internal structs and
interfaces were needed. The etcd client generator now accepts node names
(as []string) instead of corev1.Node(s). This allows us to be more
flexible in how we pass in the list of nodes that we expect the etcd
member list to have.
The reconcileEtcdMembers method already waits for all machines to have
NodeRefs set before proceeding. While we check for that, now we also
collect all the names in a slice before passing it in to the inner
Workload struct method.
A NodeRef is assigned by the Machine controller as soon as that
Machine's infrastructure provider exposes the ProviderID, the machine
controller then compares the ProviderID to the list of nodes available
in the workload cluster, and finally assigns the NodeRef under the
Machine's Status field.
If a NodeRef is assigned to a Machine that KCP owns, we know it should
be a control plane machine even if kubeadm join hasn't set the label on
the Node object.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #3961
/milestone v0.4.0