🐛 Fix cluster reconcilation predicates #6425

Unix4ever · 2022-04-19T19:25:59Z

The current implementation for ClusterUpdateUnpaused is filtering all
cluster updates except for the case when spec.paused is updated from
true to false.

As all Cluster updates do not trigger Machines reconcilation,
setting ControlPlaneInitialized to True does not start workload nodes
watch in MachinesController.

That leads to cluster deployment being stuck and hanging for 15 minutes.

Signed-off-by: Artem Chernyshev [email protected]

k8s-ci-robot · 2022-04-19T19:26:07Z

Hi @Unix4ever. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ykakarap

/ok-to-test

ykakarap · 2022-04-21T01:48:15Z

As all Cluster updates do not trigger Machines reconcilation,
setting ControlPlaneInitialized to True does not start workload nodes
watch in MachinesController

ControlPlaneInitialized is no longer part of the Cluster type in the latest type (v1beta1).
Looks like ControlPlaneInitialized is part of the v1alpha3 Cluster type which is pretty old.

I am not sure if this fix to main (currently on the v1beta1) types would address your problem.

ykakarap · 2022-04-21T01:48:16Z

As all Cluster updates do not trigger Machines reconcilation,
setting ControlPlaneInitialized to True does not start workload nodes
watch in MachinesController

ControlPlaneInitialized is no longer part of the Cluster type in the latest type (v1beta1).
Looks like ControlPlaneInitialized is part of the v1alpha3 Cluster type which is pretty old.

I am not sure if this fix to main (currently on the v1beta1) types would address your problem.

Unix4ever · 2022-04-21T10:22:15Z

I was testing that on v1beta1 and it actually fixes my problem.
And I was talking about the condition: https://github.com/kubernetes-sigs/cluster-api/blob/main/api/v1beta1/condition_consts.go#L65 , not about the field in status.
I don't see it being deprecated or something.

And I still see it being set after controlplane resource initialized flag changes to true.

ykakarap · 2022-04-21T23:04:06Z

I was testing that on v1beta1 and it actually fixes my problem. And I was talking about the condition: https://github.com/kubernetes-sigs/cluster-api/blob/main/api/v1beta1/condition_consts.go#L65 , not about the field in status. I don't see it being deprecated or something.

And I still see it being set after controlplane resource initialized flag changes to true.

You are right. My bad, I missed the condition. It is clear that the current behavior does not reconcile machines if changes are made to the cluster if it is already unpaused. I am LGTM for the changes however I want to get @fabriziopandini's opinion here on two things:

Was there any historical reason on limiting the predicate to only cluster transitions from paused to unpaused instead of being a generic "check if cluster is unpaused". Asking because the comments on the function are pretty clear in that they only want to limit to transition.
This change will change the behavior of the function as it currently is. Since this is an exposed function and part of the code API is it okay to change this behavior or should we ideally create a new predicate that checks if the Cluster is unpaused and use that instead?

sbueringer · 2022-04-27T08:00:47Z

I think we should not change ClusterUpdateUnpaused as the intention seems pretty clear that it should only capture update events which explicitly change the pause field from true to false. One level above in ClusterUnpaused the intention seems to be similar.

What I'm absolutely not sure about is what the intention in the controllers is.

So concrete, in the MachineController do we always want to reconcile on Cluster events if the cluster is not paused or do we explicitly only want to reconcile if an unpause occurred.

If we always want to reconcile on update events of unpaused cluster we should introduce a new predicate (e.g. ClusterNotPaused and use that accordingly.

At a first glance it seems reasonable to do that, but I'm not aware of the history and as this triggers a lot of additional reconciles we should be really sure.

@vincepri @fabriziopandini WDYT?

Unix4ever · 2022-04-28T11:46:24Z

Ok, I've introduced a new predicate and injected it using or in the predicates list of the machine_controller.go.

So concrete, in the MachineController do we always want to reconcile on Cluster events if the cluster is not paused or do we > explicitly only want to reconcile if an unpause occurred.

Pretty long time ago I had another attempt to fix that issue in the machine controller and back then I've got the following answer:
#5884 (comment)

Judging from the answer I think the intention was to react on any cluster updates.

vincepri · 2022-04-29T03:01:31Z

Hey folks just reading through this thread and was hoping to provide a bit more clarity on why the predicates were built this way.

The Machine controller watches all Cluster objects, when a Cluster has an event (regardless of filter) is fed through the watch informers to controller runtime. A predicate can be used in this scenario to filter out some events and prevent reconciliation.

In the beginning of the project, the Machine controller was watching every event on the Cluster object, regardless on the type; this caused lots of frequent reconciliations which ultimately caused the system to reconcile Machine too often.

The current predicate ClusterUpdateUnpaused is there because we always want to make sure to reconcile each Machine right after the Cluster object spec.paused field was changed from true to false. This is a common scenario that happens during clusterctl move operations (or backup/restore), it can also be used when performing manual maintenance on a cluster.

Going back to the original problem reported, it sounds like that ControlPlaneInitialized on the Cluster object isn't triggering reconciliation of Machines which causes a delay. Usually though, ControlPlaneInitialized when using machine-based Clusters should receive events when the Machine itself gets created and/or joins the cluster and status.nodeRef is populated. Is this event not coming through?

@Unix4ever Can you also clarify what actually gets stuck for 15 minutes?

internal/controllers/machine/machine_controller.go

smira · 2022-04-29T07:23:29Z

Going back to the original problem reported, it sounds like that ControlPlaneInitialized on the Cluster object isn't triggering reconciliation of Machines which causes a delay. Usually though, ControlPlaneInitialized when using machine-based Clusters should receive events when the Machine itself gets created and/or joins the cluster and status.nodeRef is populated. Is this event not coming through?

@vincepri I believe the actual problem is that MachineController doesn't start watching workload cluster Node resources until the control plane is initialized. So there will be reconcile triggered when the provider ID is set, and Machine never becomes ready. This is a problem only on the initial cluster creation, as once MachineController is triggered at least once, it starts watching workload cluster Nodes, and things are back to normal. But with the current predicate, if MachineController reconciles a Machine before control plane is initialized, it will be stuck "forever" as no event will trigger a reconcilation as Nodes are not watched. So probably predicate could be updated to reconcile only on changes to the Cluster control plane initizlied field to keep the number of reconciles low.

vincepri · 2022-04-29T13:58:55Z

That makes sense, thanks for digging into it more @smira! @Unix4ever Do you have some time to try to add another predicate that allows the Machine controller to reconcile when ControlPlaneInitialized is set to true?

Unix4ever · 2022-04-29T19:09:15Z

That makes sense, thanks for digging into it more @smira! @Unix4ever Do you have some time to try to add another predicate that allows the Machine controller to reconcile when ControlPlaneInitialized is set to true?

Yeah, updated and tested that it still works.

util/predicates/cluster_predicates.go

config/default/manager_image_patch.yaml

util/predicates/cluster_predicates.go

util/predicates/cluster_predicates_test.go

The current implementation for `ClusterUpdateUnpaused` is filtering all cluster updates except for the case when `spec.paused` is updated from `true` to `false`. As all `Cluster` updates do not trigger `Machines` reconcilation, setting `ControlPlaneInitialized` to `True` does not start workload nodes watch in `MachinesController`. That leads to cluster deployment being stuck and hanging until any other unrelated event triggers that reconcilation. Introduce a new predicate that triggers reconcilation when `ControlPlaneInitialized` condition is set on a cluster. Signed-off-by: Artem Chernyshev <[email protected]>

ykakarap

/lgtm

smira · 2022-05-06T14:43:22Z

if/when this gets merged, can we please backport it to 1.1.x branch?

vincepri

/approve
/cherry-pick release-1.1

k8s-ci-robot · 2022-05-06T16:25:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vincepri]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vincepri · 2022-05-06T16:40:19Z

/cherry-pick release-1.1

k8s-infra-cherrypick-robot · 2022-05-06T16:41:06Z

@vincepri: new pull request created: #6488

In response to this:

/cherry-pick release-1.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 19, 2022

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Apr 19, 2022

Unix4ever changed the title ~~Fix cluster reconcilation predicates~~ 🐛 Fix cluster reconcilation predicates Apr 19, 2022

k8s-ci-robot requested review from JoelSpeed and vincepri April 19, 2022 19:26

Unix4ever changed the title ~~🐛 Fix cluster reconcilation predicates~~ 🐛 Fix cluster reconcilation predicates Apr 19, 2022

Unix4ever force-pushed the fix-cluster-watch branch from 7706fd5 to 276430c Compare April 19, 2022 19:29

smira mentioned this pull request Apr 19, 2022

fix: mark control plane as initialized as soon as endpoints are ready siderolabs/cluster-api-control-plane-provider-talos#122

Merged

ykakarap reviewed Apr 20, 2022

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 20, 2022

Unix4ever force-pushed the fix-cluster-watch branch from 276430c to d47d726 Compare April 28, 2022 11:42

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 28, 2022

vincepri reviewed Apr 29, 2022

View reviewed changes

internal/controllers/machine/machine_controller.go Outdated Show resolved Hide resolved

Unix4ever force-pushed the fix-cluster-watch branch from d47d726 to 06eca9f Compare April 29, 2022 19:06

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 29, 2022

Unix4ever force-pushed the fix-cluster-watch branch from 06eca9f to 895a44b Compare April 29, 2022 19:11

vincepri reviewed May 1, 2022

View reviewed changes

util/predicates/cluster_predicates.go Outdated Show resolved Hide resolved

util/predicates/cluster_predicates.go Show resolved Hide resolved

Unix4ever force-pushed the fix-cluster-watch branch from 895a44b to a909eec Compare May 2, 2022 19:25

ykakarap suggested changes May 2, 2022

View reviewed changes

config/default/manager_image_patch.yaml Outdated Show resolved Hide resolved

util/predicates/cluster_predicates.go Outdated Show resolved Hide resolved

util/predicates/cluster_predicates_test.go Show resolved Hide resolved

Unix4ever force-pushed the fix-cluster-watch branch 2 times, most recently from fc7b9ea to c889680 Compare May 2, 2022 21:28

Unix4ever force-pushed the fix-cluster-watch branch from c889680 to d882cde Compare May 2, 2022 21:40

ykakarap approved these changes May 3, 2022

View reviewed changes

k8s-ci-robot assigned ykakarap May 3, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 3, 2022

vincepri approved these changes May 6, 2022

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 6, 2022

k8s-ci-robot merged commit 0d45662 into kubernetes-sigs:main May 6, 2022

k8s-ci-robot added this to the v1.2 milestone May 6, 2022

k8s-infra-cherrypick-robot mentioned this pull request May 6, 2022

🐛 Fix cluster reconcilation predicates #6488

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Fix cluster reconcilation predicates #6425

🐛 Fix cluster reconcilation predicates #6425

Unix4ever commented Apr 19, 2022

k8s-ci-robot commented Apr 19, 2022

ykakarap left a comment

ykakarap commented Apr 21, 2022

ykakarap commented Apr 21, 2022

Unix4ever commented Apr 21, 2022 •

edited

Loading

ykakarap commented Apr 21, 2022

sbueringer commented Apr 27, 2022 •

edited

Loading

Unix4ever commented Apr 28, 2022 •

edited

Loading

vincepri commented Apr 29, 2022

smira commented Apr 29, 2022

vincepri commented Apr 29, 2022

Unix4ever commented Apr 29, 2022

ykakarap left a comment

smira commented May 6, 2022

vincepri left a comment

k8s-ci-robot commented May 6, 2022

vincepri commented May 6, 2022

k8s-infra-cherrypick-robot commented May 6, 2022

🐛 Fix cluster reconcilation predicates #6425

🐛 Fix cluster reconcilation predicates #6425

Conversation

Unix4ever commented Apr 19, 2022

k8s-ci-robot commented Apr 19, 2022

ykakarap left a comment

Choose a reason for hiding this comment

ykakarap commented Apr 21, 2022

ykakarap commented Apr 21, 2022

Unix4ever commented Apr 21, 2022 • edited Loading

ykakarap commented Apr 21, 2022

sbueringer commented Apr 27, 2022 • edited Loading

Unix4ever commented Apr 28, 2022 • edited Loading

vincepri commented Apr 29, 2022

smira commented Apr 29, 2022

vincepri commented Apr 29, 2022

Unix4ever commented Apr 29, 2022

ykakarap left a comment

Choose a reason for hiding this comment

smira commented May 6, 2022

vincepri left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented May 6, 2022

vincepri commented May 6, 2022

k8s-infra-cherrypick-robot commented May 6, 2022

Unix4ever commented Apr 21, 2022 •

edited

Loading

sbueringer commented Apr 27, 2022 •

edited

Loading

Unix4ever commented Apr 28, 2022 •

edited

Loading