ROX-17712 prevent unnecessary central updates #1122

ludydoo · 2023-06-23T13:37:35Z

This PR adds some logic to prevent fleetshard from updating Centrals if there are no changes, which prevent unnecessary opreator reconciliation loops.

SimonBaeumer · 2023-06-26T10:37:09Z

fleetshard/pkg/central/reconciler/reconciler.go

+		// This will prevent unnecessary operator reconciliation loops.
+
+		desiredCentral := existingCentral.DeepCopy()
+		desiredCentral.Spec = *central.Spec.DeepCopy()


What happens if an annotation was updated? It looks to me that changes on the metadata are not updated anymore.

The current logic ignores them completely, and just takes the labels and annotations from the existing central:

existingCentral.Spec = central.Spec if err := r.client.Update(ctx, &existingCentral); err != nil { return errors.Wrapf(err, "updating central %s/%s", central.GetNamespace(), central.GetName())

SimonBaeumer · 2023-06-26T10:40:03Z

fleetshard/pkg/central/reconciler/reconciler.go

 		}
 	}

 	return nil
 }

+func printCentralDiff(desired, actual *v1alpha1.Central) {


Do you think the log verbosity could be problem?
When ~50 Centrals are reconciled, during an upgrade this log could get very verbose when it logs all Centrals which do not have any errors.
Could the diff be more conditional to not always print it?

Hey, yes I've added a feature flag for enabling the diffs.

SimonBaeumer · 2023-06-26T13:41:11Z

fleetshard/pkg/central/reconciler/reconciler.go

+		printCentralDiff(wouldBeCentral, &existingCentral)
+
+		updatedCentral := existingCentral.DeepCopy()
+		updatedCentral.Spec = *central.Spec.DeepCopy()


Would it be possible to build a hash over the labels/annotations + spec and use the computed hash to detect if an update is necessary?
Not useful for printing a diff though.

Would be possible, but imho it's a less robust approach

Why is it less robust? 🤔

My take on it is that the apiserver must be thought as some sort of black box. It might have various webhooks, and other defaulting strategies that are impossible to account for. So the existingCentral might differ from the central because of apiserver side effects. By using a dry-run, those side effects are accounted for

Yes, I got the point about dryRun and webhooks from the API server and agree.
With that being said, I don't see the reason for checking every property instead of building a hash over the same data in this function:
https://github.com/stackrox/acs-fleet-manager/pull/1122/files#diff-8b2c412ebc3fa158d46c2dfb0136cfef521abb4fb2dda0a77b403a32afd9c58eR466-R483

Are you suggesting that instead of DeepCompare, we would compare the hash of the would be central and the existing central ? What would be the advantage ?

SimonBaeumer · 2023-06-26T13:42:58Z

fleetshard/pkg/central/reconciler/reconciler.go

-		glog.Infof("Update central %s/%s", central.GetNamespace(), central.GetName())
-		existingCentral.Spec = central.Spec
+		// perform a dry run to see if the update would change anything.
+		// This would apply the defaults and the mutating webhooks without actually updating the object.


Is this a problem currently with mutating webhooks?
I am not fully understanding this change after a second pass. The logic feels redundant with the last Central hash implementation already existent and the logic here seems to complicated to safe one unnecessary update.
Which problem is solved here by running a dry run? 🤔

The dry run will run the update through the api server, setting all default fields, applying (if any) mutating/validating webhooks, etc, without storing the update. The object returned will be as if it was actually updated. Then only is it safe to compare the desired state (after a "would-be" update) with the current state.

The problem with the central hash is that it contains fields that don't affect the actual Central CR, such as the status. So the fleetshard will still update the Central CR even if it didn't actually needs to change.

Also, everytime fleetshard does so, it increments a "revision" annotation. But it turns out that in 95% of cases, the only thing that fleetshard updates is the "revision", without modifying any other field on the CR. This creates a lot of traffic for the operator. This is especially visible in the first minutes of deploying a Central, where there could be as many as 20-40 unnecessary reconciliation loops (without actual changes to the CR) triggered due to this.

So basically this logic is to check whether or not the Central will change as a result of the update or not. And the dry-run is to mitigate any side effects that could be somehow applied either on the apiserver, or through webhooks.

https://kubernetes.io/blog/2019/01/14/apiserver-dry-run-and-kubectl-diff/#kubectl-diff

This is also how the helm operator detects if resources are changed or not, and if an upgrade needs to be performed or not. https://github.com/operator-framework/helm-operator-plugins/blob/c16a400954f7e43bb987b196e8ecf2f4d2d4ab0f/pkg/reconciler/reconciler.go#L720

openshift-ci · 2023-06-28T10:27:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ludydoo, SimonBaeumer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [SimonBaeumer]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ROX-17712 prevent unnecessary central updates

79ca07d

ludydoo requested a review from SimonBaeumer June 23, 2023 13:37

ludydoo temporarily deployed to development June 23, 2023 13:37 — with GitHub Actions Inactive

ROX-17712 add some warn logging

ff7c4a6

ludydoo temporarily deployed to development June 23, 2023 13:42 — with GitHub Actions Inactive

ludydoo requested a review from kovayur June 23, 2023 13:46

SimonBaeumer requested changes Jun 26, 2023

View reviewed changes

openshift-ci bot assigned SimonBaeumer Jun 26, 2023

ROX-17712 PR Comments

37dc57b

ludydoo temporarily deployed to development June 26, 2023 13:03 — with GitHub Actions Inactive

ludydoo requested a review from SimonBaeumer June 26, 2023 13:24

SimonBaeumer requested changes Jun 26, 2023

View reviewed changes

ludydoo requested a review from SimonBaeumer June 27, 2023 06:18

SimonBaeumer approved these changes Jun 27, 2023

View reviewed changes

openshift-ci bot added lgtm approved labels Jun 27, 2023

ROX-17712 Add tests and annotation/label handling

0594056

openshift-ci bot removed the lgtm label Jun 27, 2023

ludydoo temporarily deployed to development June 27, 2023 08:55 — with GitHub Actions Inactive

ROX-17712 Remove todo

9cc3f35

ludydoo temporarily deployed to development June 27, 2023 09:10 — with GitHub Actions Inactive

SimonBaeumer approved these changes Jun 28, 2023

View reviewed changes

openshift-ci bot added the lgtm label Jun 28, 2023

ludydoo merged commit 6b0bc74 into main Jun 29, 2023

ludydoo deleted the ROX-17712-prevent-unnecessary-central-updates branch June 29, 2023 14:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROX-17712 prevent unnecessary central updates #1122

ROX-17712 prevent unnecessary central updates #1122

ludydoo commented Jun 23, 2023

SimonBaeumer Jun 26, 2023

ludydoo Jun 26, 2023

SimonBaeumer Jun 26, 2023

ludydoo Jun 26, 2023 •

edited

Loading

SimonBaeumer Jun 26, 2023

ludydoo Jun 26, 2023

SimonBaeumer Jun 27, 2023

ludydoo Jun 27, 2023

SimonBaeumer Jun 27, 2023

ludydoo Jun 27, 2023

SimonBaeumer Jun 26, 2023

ludydoo Jun 26, 2023

ludydoo Jun 26, 2023

ludydoo Jun 26, 2023

openshift-ci bot commented Jun 28, 2023

ROX-17712 prevent unnecessary central updates #1122

ROX-17712 prevent unnecessary central updates #1122

Conversation

ludydoo commented Jun 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ludydoo Jun 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Jun 28, 2023

ludydoo Jun 26, 2023 •

edited

Loading