-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Fix cluster reconcilation predicates #6425
Conversation
Hi @Unix4ever. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
7706fd5
to
276430c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ok-to-test
I am not sure if this fix to main (currently on the v1beta1) types would address your problem. |
1 similar comment
I am not sure if this fix to main (currently on the v1beta1) types would address your problem. |
I was testing that on And I still see it being set after controlplane resource |
You are right. My bad, I missed the condition. It is clear that the current behavior does not reconcile machines if changes are made to the cluster if it is already unpaused. I am LGTM for the changes however I want to get @fabriziopandini's opinion here on two things:
|
I think we should not change What I'm absolutely not sure about is what the intention in the controllers is. So concrete, in the MachineController do we always want to reconcile on Cluster events if the cluster is not paused or do we explicitly only want to reconcile if an unpause occurred. If we always want to reconcile on update events of unpaused cluster we should introduce a new predicate (e.g. At a first glance it seems reasonable to do that, but I'm not aware of the history and as this triggers a lot of additional reconciles we should be really sure. @vincepri @fabriziopandini WDYT? |
276430c
to
d47d726
Compare
Ok, I've introduced a new predicate and injected it using
Pretty long time ago I had another attempt to fix that issue in the machine controller and back then I've got the following answer: Judging from the answer I think the intention was to react on any cluster updates. |
Hey folks just reading through this thread and was hoping to provide a bit more clarity on why the predicates were built this way. The Machine controller watches all Cluster objects, when a Cluster has an event (regardless of filter) is fed through the watch informers to controller runtime. A predicate can be used in this scenario to filter out some events and prevent reconciliation. In the beginning of the project, the Machine controller was watching every event on the Cluster object, regardless on the type; this caused lots of frequent reconciliations which ultimately caused the system to reconcile Machine too often. The current predicate Going back to the original problem reported, it sounds like that @Unix4ever Can you also clarify what actually gets stuck for 15 minutes? |
@vincepri I believe the actual problem is that |
That makes sense, thanks for digging into it more @smira! @Unix4ever Do you have some time to try to add another predicate that allows the Machine controller to reconcile when ControlPlaneInitialized is set to true? |
d47d726
to
06eca9f
Compare
Yeah, updated and tested that it still works. |
06eca9f
to
895a44b
Compare
895a44b
to
a909eec
Compare
fc7b9ea
to
c889680
Compare
The current implementation for `ClusterUpdateUnpaused` is filtering all cluster updates except for the case when `spec.paused` is updated from `true` to `false`. As all `Cluster` updates do not trigger `Machines` reconcilation, setting `ControlPlaneInitialized` to `True` does not start workload nodes watch in `MachinesController`. That leads to cluster deployment being stuck and hanging until any other unrelated event triggers that reconcilation. Introduce a new predicate that triggers reconcilation when `ControlPlaneInitialized` condition is set on a cluster. Signed-off-by: Artem Chernyshev <[email protected]>
c889680
to
d882cde
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
if/when this gets merged, can we please backport it to 1.1.x branch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/cherry-pick release-1.1
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vincepri The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cherry-pick release-1.1 |
@vincepri: new pull request created: #6488 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The current implementation for
ClusterUpdateUnpaused
is filtering allcluster updates except for the case when
spec.paused
is updated fromtrue
tofalse
.As all
Cluster
updates do not triggerMachines
reconcilation,setting
ControlPlaneInitialized
toTrue
does not start workload nodeswatch in
MachinesController
.That leads to cluster deployment being stuck and hanging for 15 minutes.
Signed-off-by: Artem Chernyshev [email protected]