-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Add node watcher to MachinePool controller #8443
🐛 Add node watcher to MachinePool controller #8443
Conversation
if err := r.Client.List( | ||
context.TODO(), | ||
machinePoolList, | ||
append(filters, client.MatchingFields{index.MachinePoolNodeNameField: node.Name})...); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure how to get indexing to work for MachinePools given there are multiple nodes per MachinePool... The extract func here returns a list of all node names associated with the MachinePool. What should this filter be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the documentation looks like this way using the filter should work as MatchingFields
also supports cache indices.
And, indexing with multiple "keys" is apparently supported however it would lack compatibility with Kubernetes API server (not sure what the means or what the impact would be).
abf0b02
to
571d0d3
Compare
/hold |
/cherry-pick release-1.4 |
@ykakarap: once the present PR merges, I will cherry-pick it on top of release-1.4 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
571d0d3
to
f88d849
Compare
/lgtm |
LGTM label has been added. Git tree hash: 1b6f745788bf342e68b9fc5228b83282e81ce7a3
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
I think this solution makes sense for now, and we can eventually follow up with some improvements like an in-memory map if we figure out that this does not scale well with a growing number of nodes.
Also, with machine pool machines we can probably greatly simplify this and just rely on machine pools watching machines
@ykakarap I got a repro of the issue while testing with this branch in the latest run on kubernetes-sigs/cluster-api-provider-azure#3378 so I'm trying to figure out what happened. edit: I'm adding logs to my branch and going to re-run the tests with it to see what went wrong |
/hold this is an optimization but doesn't actually fix the issue. See #8462 for the real fix. |
/hold cancel let's get this in along with the other fix as it does fix some potential delays (although in practice it doesn't actually fix the observed flake as we're constantly requeuing when Node provider IDs aren't set: https://github.com/kubernetes-sigs/cluster-api/blob/main/exp/internal/controllers/machinepool_controller_noderef.go#L83) |
@@ -80,7 +86,8 @@ func (r *MachinePoolReconciler) reconcileNodeRefs(ctx context.Context, cluster * | |||
if err != nil { | |||
if err == errNoAvailableNodes { | |||
log.Info("Cannot assign NodeRefs to MachinePool, no matching Nodes") | |||
return ctrl.Result{RequeueAfter: 10 * time.Second}, nil | |||
// No need to requeue here. Nodes emit an event that triggers reconciliation. | |||
return ctrl.Result{}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c0ac856
to
d8b5fc1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
LGTM label has been added. Git tree hash: 8c9fa0343abfab7f982fab2ecb81f9db1b5be951
|
/hold we're not going to include this in v1.4.1 as tests are passing without it, will take more time to validate this one |
/hold cancel Got 4 passing tests in a row using kubernetes-sigs/cluster-api-provider-azure#3378 |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: CecileRobertMichon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@ykakarap: new pull request created: #8474 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/area machinepool |
What this PR does / why we need it: Since CAPI v1.4.0, the MachinePool controller drops the
node.cluster.x-k8s.io/uninitialized
taint from nodes as soon as the providerID gets added to the node by the cloud-controller-manager and CAPI is able to match the node with a MachinePool. However, it is currently not watching Nodes (unlike the Machine controller) which can cause delays in the taint being dropped on MachinePool nodes, which in turn causes the Node to not become schedulable until 10-15 minutes after the node isReady
.Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #8442