Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROX-17712 prevent unnecessary central updates #1122

Merged
merged 5 commits into from
Jun 29, 2023

Conversation

ludydoo
Copy link
Collaborator

@ludydoo ludydoo commented Jun 23, 2023

This PR adds some logic to prevent fleetshard from updating Centrals if there are no changes, which prevent unnecessary opreator reconciliation loops.

@ludydoo ludydoo requested a review from SimonBaeumer June 23, 2023 13:37
@ludydoo ludydoo temporarily deployed to development June 23, 2023 13:37 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development June 23, 2023 13:37 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development June 23, 2023 13:37 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development June 23, 2023 13:42 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development June 23, 2023 13:42 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development June 23, 2023 13:42 — with GitHub Actions Inactive
@ludydoo ludydoo requested a review from kovayur June 23, 2023 13:46
// This will prevent unnecessary operator reconciliation loops.

desiredCentral := existingCentral.DeepCopy()
desiredCentral.Spec = *central.Spec.DeepCopy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if an annotation was updated? It looks to me that changes on the metadata are not updated anymore.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current logic ignores them completely, and just takes the labels and annotations from the existing central:

existingCentral.Spec = central.Spec
if err := r.client.Update(ctx, &existingCentral); err != nil {
			return errors.Wrapf(err, "updating central %s/%s", central.GetNamespace(), central.GetName())

}
}

return nil
}

func printCentralDiff(desired, actual *v1alpha1.Central) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think the log verbosity could be problem?
When ~50 Centrals are reconciled, during an upgrade this log could get very verbose when it logs all Centrals which do not have any errors.
Could the diff be more conditional to not always print it?

Copy link
Collaborator Author

@ludydoo ludydoo Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, yes I've added a feature flag for enabling the diffs.

@ludydoo ludydoo temporarily deployed to development June 26, 2023 13:03 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development June 26, 2023 13:03 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development June 26, 2023 13:03 — with GitHub Actions Inactive
@ludydoo ludydoo requested a review from SimonBaeumer June 26, 2023 13:24
printCentralDiff(wouldBeCentral, &existingCentral)

updatedCentral := existingCentral.DeepCopy()
updatedCentral.Spec = *central.Spec.DeepCopy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to build a hash over the labels/annotations + spec and use the computed hash to detect if an update is necessary?
Not useful for printing a diff though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be possible, but imho it's a less robust approach

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it less robust? 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My take on it is that the apiserver must be thought as some sort of black box. It might have various webhooks, and other defaulting strategies that are impossible to account for. So the existingCentral might differ from the central because of apiserver side effects. By using a dry-run, those side effects are accounted for

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I got the point about dryRun and webhooks from the API server and agree.
With that being said, I don't see the reason for checking every property instead of building a hash over the same data in this function:
https://github.com/stackrox/acs-fleet-manager/pull/1122/files#diff-8b2c412ebc3fa158d46c2dfb0136cfef521abb4fb2dda0a77b403a32afd9c58eR466-R483

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting that instead of DeepCompare, we would compare the hash of the would be central and the existing central ? What would be the advantage ?

glog.Infof("Update central %s/%s", central.GetNamespace(), central.GetName())
existingCentral.Spec = central.Spec
// perform a dry run to see if the update would change anything.
// This would apply the defaults and the mutating webhooks without actually updating the object.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a problem currently with mutating webhooks?
I am not fully understanding this change after a second pass. The logic feels redundant with the last Central hash implementation already existent and the logic here seems to complicated to safe one unnecessary update.
Which problem is solved here by running a dry run? 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dry run will run the update through the api server, setting all default fields, applying (if any) mutating/validating webhooks, etc, without storing the update. The object returned will be as if it was actually updated. Then only is it safe to compare the desired state (after a "would-be" update) with the current state.

The problem with the central hash is that it contains fields that don't affect the actual Central CR, such as the status. So the fleetshard will still update the Central CR even if it didn't actually needs to change.

Also, everytime fleetshard does so, it increments a "revision" annotation. But it turns out that in 95% of cases, the only thing that fleetshard updates is the "revision", without modifying any other field on the CR. This creates a lot of traffic for the operator. This is especially visible in the first minutes of deploying a Central, where there could be as many as 20-40 unnecessary reconciliation loops (without actual changes to the CR) triggered due to this.

So basically this logic is to check whether or not the Central will change as a result of the update or not. And the dry-run is to mitigate any side effects that could be somehow applied either on the apiserver, or through webhooks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also how the helm operator detects if resources are changed or not, and if an upgrade needs to be performed or not. https://github.com/operator-framework/helm-operator-plugins/blob/c16a400954f7e43bb987b196e8ecf2f4d2d4ab0f/pkg/reconciler/reconciler.go#L720

@ludydoo ludydoo requested a review from SimonBaeumer June 27, 2023 06:18
@openshift-ci openshift-ci bot removed the lgtm label Jun 27, 2023
@ludydoo ludydoo temporarily deployed to development June 27, 2023 08:55 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development June 27, 2023 08:55 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development June 27, 2023 08:55 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development June 27, 2023 09:10 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development June 27, 2023 09:10 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development June 27, 2023 09:10 — with GitHub Actions Inactive
@openshift-ci openshift-ci bot added the lgtm label Jun 28, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 28, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ludydoo, SimonBaeumer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ludydoo ludydoo merged commit 6b0bc74 into main Jun 29, 2023
@ludydoo ludydoo deleted the ROX-17712-prevent-unnecessary-central-updates branch June 29, 2023 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants