Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor to avoid cache races #837

Merged

Conversation

jpeeler
Copy link

@jpeeler jpeeler commented May 1, 2019

This is essentially a fix for unit tests that I noticed locally. It involves making sure a newly created CSV is not attempted to be retrieved using a lister within the same sync loop.

Obsoletes #814.

As much as I'd like to say this fixes every flake, this might only fix one of them. But I haven't seen any locally with this fix.

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 1, 2019
Copy link
Member

@ecordell ecordell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you've been hard at work finding and fixing these flakes but I think we need to rethink how we're writing the tests or some different approach. I'm concerned that this will make our performance problems return and we'll start dealing with flakes in the e2e tests instead

}
// Ensure operator has access to targetnamespaces with cluster RBAC
// (roles/rolebindings are checked for each target namespace in syncCopyCSV)
if err := a.ensureRBACInTargetNamespace(clusterServiceVersion, operatorGroup); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure moving this here fixes the unit test flakes, but the reason we needed to move this into the syncCopyCSV in the first place was because of performance issues with large numbers of namespaces (well, not that large - 50 or so).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method has been changed to only handle cluster roles and bindings. Do you still think this would be a problem?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, thanks - I think that makes this fine :)

a.Log.Errorf("operatorgroup '%v' should have non-nil status", operatorGroup.GetName())
return nil
}
for _, ns := range targetNamespaces {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does moving this here fix anything in particular? Don't we call:

  • ensureCSVsInNamespaces
  • ensureTenantRBAC

so this change is basically folding them into one?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensureTenantRBAC has been modified to not use the lister for retrieving the CSV, which is made possible by copyToNamespace returning the CSV.
Also, related to your other point, this keeps RBAC operations for anything other than all namespaces to stay in the syncCopyCSV loop.

Copy link
Author

@jpeeler jpeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would definitely not want to put this code in before feature freeze, but I still think what I've written is ok. I might still be able to move ensureRBACInTargetNamespace back to syncCopyCSV as I was really just focused on the necessary coupling of ensureCSVsInNamespaces and ensureTenantRBAC (as you pointed out).

}
// Ensure operator has access to targetnamespaces with cluster RBAC
// (roles/rolebindings are checked for each target namespace in syncCopyCSV)
if err := a.ensureRBACInTargetNamespace(clusterServiceVersion, operatorGroup); err != nil {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method has been changed to only handle cluster roles and bindings. Do you still think this would be a problem?

a.Log.Errorf("operatorgroup '%v' should have non-nil status", operatorGroup.GetName())
return nil
}
for _, ns := range targetNamespaces {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensureTenantRBAC has been modified to not use the lister for retrieving the CSV, which is made possible by copyToNamespace returning the CSV.
Also, related to your other point, this keeps RBAC operations for anything other than all namespaces to stay in the syncCopyCSV loop.

@jpeeler
Copy link
Author

jpeeler commented May 1, 2019

/retest
I didn't see any e2e failures locally, but I'll be slightly concerned if they don't pass this time. (And maybe that speaks for itself...)

@jpeeler
Copy link
Author

jpeeler commented May 3, 2019

/retest
now that #838 has merged, e2e should pass.

@jpeeler
Copy link
Author

jpeeler commented May 3, 2019

/retest
Cluster didn't come up.

@@ -24,7 +25,7 @@ all: test build
test: clean cover.out

unit:
go test $(MOD_FLAGS) -v -race -tags=json1 -count=1 ./pkg/...
go test $(MOD_FLAGS) $(SPECIFIC_UNIT_TEST) -v -race -tags=json1 -count=1 ./pkg/...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks!

}
// Ensure operator has access to targetnamespaces with cluster RBAC
// (roles/rolebindings are checked for each target namespace in syncCopyCSV)
if err := a.ensureRBACInTargetNamespace(clusterServiceVersion, operatorGroup); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, thanks - I think that makes this fine :)

@ecordell
Copy link
Member

ecordell commented May 9, 2019

@jpeeler this needs a rebase, and thanks for explaining these changes to me :)

I'd like to see the tests pass and then play with it in a real cluster just to check some of the performance characteristics, but otherwise it looks good to me.

@ecordell
Copy link
Member

ecordell commented May 9, 2019

/hold

holding until master is open for 4.2 😢

@openshift-ci-robot openshift-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 9, 2019
@ecordell
Copy link
Member

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 13, 2019
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 16, 2019
@@ -468,7 +468,7 @@ func (a *Operator) ensureTenantRBAC(operatorNamespace, targetNamespace string, c
if !ownerutil.IsOwnedBy(ownedRoleBinding, csv) {
continue
}
_, ok := targetRolesByName[ownedRoleBinding.GetName()]
_, ok := targetRoleBindingsByName[ownedRoleBinding.GetName()]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this change, but it looks right.

@@ -583,7 +591,7 @@ func (a *Operator) copyToNamespace(csv *v1alpha1.ClusterServiceVersion, namespac
fetchedCSV.SetLabels(utillabels.AddLabel(fetchedCSV.GetLabels(), v1alpha1.CopiedLabelKey, csv.GetNamespace()))
// CRs don't support strategic merge patching, but in the future if they do this should be updated to patch
logger.Debug("updating target CSV")
if _, err := a.client.OperatorsV1alpha1().ClusterServiceVersions(namespace).Update(fetchedCSV); err != nil {
if fetchedCSV, err = a.client.OperatorsV1alpha1().ClusterServiceVersions(namespace).Update(fetchedCSV); err != nil {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change and the one below is just getting a more refreshed copy of the CSV. But the way it's written to overwrite the copy from the lister may not be acceptable?

Jeff Peeler added 3 commits May 16, 2019 11:04
This mirrors what's currently done for e2e tests. Although unit test
execution is much faster, being able to specify a single test for
repeated refinement still saves time.
This refactors the implementation to no longer race itself with reading
from the cache, which sometimes would fail. Essentially, the code has
been modified to not create a CSV and then later within the same sync
loop try to look up the same CSV in the lister.

Changes:
-ensureRBACInTargetNamespace was moved from syncCopyCSV to syncClusterServiceVersion
-ensureRBACInTargetNamespace changed to no longer create role/rolebindings, just cluster level RBAC
-ensureTenantRBAC was moved into ensureCSVsInNamespaces
-ensureTenantRBAC was changed to utilize newly created CSV of target namespace instead of retrieving from the cache
This forces TestSyncOperatorGroups to allow the cache to catch up as the
number of resources that are checked here is high.

The additional ignoreCopyError parameter in the tests have been added to
handle the scenario where syncCopyCSV would never be executed in a
production workflow, but for testing it's being called explicitly.
While polling if the CSV is not found in the lister, make sure to return
no error so that the polling continues.
@jpeeler
Copy link
Author

jpeeler commented May 16, 2019

/test e2e-aws-console-olm

3 similar comments
@jpeeler
Copy link
Author

jpeeler commented May 16, 2019

/test e2e-aws-console-olm

@jpeeler
Copy link
Author

jpeeler commented May 17, 2019

/test e2e-aws-console-olm

@jpeeler
Copy link
Author

jpeeler commented May 20, 2019

/test e2e-aws-console-olm

@jpeeler
Copy link
Author

jpeeler commented May 20, 2019

Dear reviewer, please verify ensureTenantRBAC is right. Otherwise, I think everything else is good.

@ecordell
Copy link
Member

/lgtm

🎉

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 20, 2019
@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ecordell, jpeeler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit f144c0f into operator-framework:master May 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants