🐛 Add node watcher to MachinePool controller #8443

CecileRobertMichon · 2023-03-31T22:19:42Z

What this PR does / why we need it: Since CAPI v1.4.0, the MachinePool controller drops the node.cluster.x-k8s.io/uninitialized taint from nodes as soon as the providerID gets added to the node by the cloud-controller-manager and CAPI is able to match the node with a MachinePool. However, it is currently not watching Nodes (unlike the Machine controller) which can cause delays in the taint being dropped on MachinePool nodes, which in turn causes the Node to not become schedulable until 10-15 minutes after the node is Ready.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #8442

CecileRobertMichon · 2023-03-31T22:22:00Z

exp/internal/controllers/machinepool_controller.go

+	if err := r.Client.List(
+		context.TODO(),
+		machinePoolList,
+		append(filters, client.MatchingFields{index.MachinePoolNodeNameField: node.Name})...); err != nil {


I'm not entirely sure how to get indexing to work for MachinePools given there are multiple nodes per MachinePool... The extract func here returns a list of all node names associated with the MachinePool. What should this filter be?

From the documentation looks like this way using the filter should work as MatchingFields also supports cache indices.

And, indexing with multiple "keys" is apparently supported however it would lack compatibility with Kubernetes API server (not sure what the means or what the impact would be).

CecileRobertMichon · 2023-03-31T23:58:22Z

/hold

testing in kubernetes-sigs/cluster-api-provider-azure#3378

exp/controllers/alias.go

ykakarap · 2023-04-01T01:32:21Z

/cherry-pick release-1.4

k8s-infra-cherrypick-robot · 2023-04-01T01:32:23Z

@ykakarap: once the present PR merges, I will cherry-pick it on top of release-1.4 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ykakarap · 2023-04-02T20:01:50Z

/lgtm

k8s-ci-robot · 2023-04-02T20:01:55Z

LGTM label has been added.

Git tree hash: 1b6f745788bf342e68b9fc5228b83282e81ce7a3

fabriziopandini

/lgtm

I think this solution makes sense for now, and we can eventually follow up with some improvements like an in-memory map if we figure out that this does not scale well with a growing number of nodes.

Also, with machine pool machines we can probably greatly simplify this and just rely on machine pools watching machines

CecileRobertMichon · 2023-04-03T15:59:26Z

@ykakarap I got a repro of the issue while testing with this branch in the latest run on kubernetes-sigs/cluster-api-provider-azure#3378 so I'm trying to figure out what happened.

edit: I'm adding logs to my branch and going to re-run the tests with it to see what went wrong

CecileRobertMichon · 2023-04-03T22:04:37Z

/hold

this is an optimization but doesn't actually fix the issue. See #8462 for the real fix.

CecileRobertMichon · 2023-04-03T22:39:22Z

/hold cancel

let's get this in along with the other fix as it does fix some potential delays (although in practice it doesn't actually fix the observed flake as we're constantly requeuing when Node provider IDs aren't set: https://github.com/kubernetes-sigs/cluster-api/blob/main/exp/internal/controllers/machinepool_controller_noderef.go#L83)

CecileRobertMichon · 2023-04-04T00:03:29Z

exp/internal/controllers/machinepool_controller_noderef.go

@@ -80,7 +86,8 @@ func (r *MachinePoolReconciler) reconcileNodeRefs(ctx context.Context, cluster *
 	if err != nil {
 		if err == errNoAvailableNodes {
 			log.Info("Cannot assign NodeRefs to MachinePool, no matching Nodes")
-			return ctrl.Result{RequeueAfter: 10 * time.Second}, nil
+			// No need to requeue here. Nodes emit an event that triggers reconciliation.
+			return ctrl.Result{}, nil


pushed this change to match https://github.com/kubernetes-sigs/cluster-api/blob/main/internal/controllers/machine/machine_controller_noderef.go#L81

fabriziopandini

/lgtm

k8s-ci-robot · 2023-04-04T09:43:20Z

LGTM label has been added.

Git tree hash: 8c9fa0343abfab7f982fab2ecb81f9db1b5be951

CecileRobertMichon · 2023-04-04T17:05:00Z

/hold

we're not going to include this in v1.4.1 as tests are passing without it, will take more time to validate this one

CecileRobertMichon · 2023-04-04T23:23:59Z

/hold cancel

Got 4 passing tests in a row using kubernetes-sigs/cluster-api-provider-azure#3378

CecileRobertMichon · 2023-04-04T23:24:17Z

/approve

k8s-ci-robot · 2023-04-04T23:24:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [CecileRobertMichon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-infra-cherrypick-robot · 2023-04-04T23:36:18Z

@ykakarap: new pull request created: #8474

In response to this:

/cherry-pick release-1.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

johannesfrey · 2023-06-16T07:50:19Z

/area machinepool

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 31, 2023

k8s-ci-robot requested review from jackfrancis and richardcase March 31, 2023 22:19

CecileRobertMichon commented Mar 31, 2023

View reviewed changes

CecileRobertMichon changed the title ~~[WIP] Add node watcher to MachinePool controller~~ [WIP] 🐛 Add node watcher to MachinePool controller Mar 31, 2023

CecileRobertMichon mentioned this pull request Mar 31, 2023

🐛 fix node label propagation #8427

Merged

CecileRobertMichon force-pushed the watch-mp-nodes branch 2 times, most recently from abf0b02 to 571d0d3 Compare March 31, 2023 23:54

CecileRobertMichon changed the title ~~[WIP] 🐛 Add node watcher to MachinePool controller~~ 🐛 Add node watcher to MachinePool controller Mar 31, 2023

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 31, 2023

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 31, 2023

ykakarap suggested changes Apr 1, 2023

View reviewed changes

exp/controllers/alias.go Show resolved Hide resolved

CecileRobertMichon force-pushed the watch-mp-nodes branch from 571d0d3 to f88d849 Compare April 1, 2023 02:06

k8s-ci-robot assigned ykakarap Apr 2, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 2, 2023

fabriziopandini reviewed Apr 3, 2023

View reviewed changes

k8s-ci-robot assigned fabriziopandini Apr 3, 2023

k8s-ci-robot removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Apr 3, 2023

k8s-ci-robot requested a review from fabriziopandini April 4, 2023 00:02

k8s-ci-robot requested a review from ykakarap April 4, 2023 00:02

CecileRobertMichon commented Apr 4, 2023

View reviewed changes

Add node watcher to MachinePool controller

d8b5fc1

CecileRobertMichon force-pushed the watch-mp-nodes branch from c0ac856 to d8b5fc1 Compare April 4, 2023 01:26

fabriziopandini reviewed Apr 4, 2023

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 4, 2023

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 4, 2023

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 4, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 4, 2023

k8s-ci-robot merged commit 94f5468 into kubernetes-sigs:main Apr 4, 2023

k8s-ci-robot added this to the v1.5 milestone Apr 4, 2023

k8s-infra-cherrypick-robot mentioned this pull request Apr 4, 2023

[release-1.4] 🐛 Add node watcher to MachinePool controller #8474

Merged

k8s-ci-robot added the area/machinepool Issues or PRs related to machinepools label Jun 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Add node watcher to MachinePool controller #8443

🐛 Add node watcher to MachinePool controller #8443

CecileRobertMichon commented Mar 31, 2023 •

edited

Loading

CecileRobertMichon Mar 31, 2023

ykakarap Apr 1, 2023

CecileRobertMichon commented Mar 31, 2023

ykakarap commented Apr 1, 2023

k8s-infra-cherrypick-robot commented Apr 1, 2023

ykakarap commented Apr 2, 2023

k8s-ci-robot commented Apr 2, 2023

fabriziopandini left a comment

CecileRobertMichon commented Apr 3, 2023 •

edited

Loading

CecileRobertMichon commented Apr 3, 2023

CecileRobertMichon commented Apr 3, 2023

CecileRobertMichon Apr 4, 2023

fabriziopandini left a comment

k8s-ci-robot commented Apr 4, 2023

CecileRobertMichon commented Apr 4, 2023

CecileRobertMichon commented Apr 4, 2023

CecileRobertMichon commented Apr 4, 2023

k8s-ci-robot commented Apr 4, 2023

k8s-infra-cherrypick-robot commented Apr 4, 2023

johannesfrey commented Jun 16, 2023

🐛 Add node watcher to MachinePool controller #8443

🐛 Add node watcher to MachinePool controller #8443

Conversation

CecileRobertMichon commented Mar 31, 2023 • edited Loading

CecileRobertMichon Mar 31, 2023

Choose a reason for hiding this comment

ykakarap Apr 1, 2023

Choose a reason for hiding this comment

CecileRobertMichon commented Mar 31, 2023

ykakarap commented Apr 1, 2023

k8s-infra-cherrypick-robot commented Apr 1, 2023

ykakarap commented Apr 2, 2023

k8s-ci-robot commented Apr 2, 2023

fabriziopandini left a comment

Choose a reason for hiding this comment

CecileRobertMichon commented Apr 3, 2023 • edited Loading

CecileRobertMichon commented Apr 3, 2023

CecileRobertMichon commented Apr 3, 2023

CecileRobertMichon Apr 4, 2023

Choose a reason for hiding this comment

fabriziopandini left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Apr 4, 2023

CecileRobertMichon commented Apr 4, 2023

CecileRobertMichon commented Apr 4, 2023

CecileRobertMichon commented Apr 4, 2023

k8s-ci-robot commented Apr 4, 2023

k8s-infra-cherrypick-robot commented Apr 4, 2023

johannesfrey commented Jun 16, 2023

CecileRobertMichon commented Mar 31, 2023 •

edited

Loading

CecileRobertMichon commented Apr 3, 2023 •

edited

Loading