-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor to avoid cache races #837
Refactor to avoid cache races #837
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you've been hard at work finding and fixing these flakes but I think we need to rethink how we're writing the tests or some different approach. I'm concerned that this will make our performance problems return and we'll start dealing with flakes in the e2e tests instead
} | ||
// Ensure operator has access to targetnamespaces with cluster RBAC | ||
// (roles/rolebindings are checked for each target namespace in syncCopyCSV) | ||
if err := a.ensureRBACInTargetNamespace(clusterServiceVersion, operatorGroup); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure moving this here fixes the unit test flakes, but the reason we needed to move this into the syncCopyCSV in the first place was because of performance issues with large numbers of namespaces (well, not that large - 50 or so).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method has been changed to only handle cluster roles and bindings. Do you still think this would be a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, thanks - I think that makes this fine :)
a.Log.Errorf("operatorgroup '%v' should have non-nil status", operatorGroup.GetName()) | ||
return nil | ||
} | ||
for _, ns := range targetNamespaces { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does moving this here fix anything in particular? Don't we call:
- ensureCSVsInNamespaces
- ensureTenantRBAC
so this change is basically folding them into one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ensureTenantRBAC has been modified to not use the lister for retrieving the CSV, which is made possible by copyToNamespace returning the CSV.
Also, related to your other point, this keeps RBAC operations for anything other than all namespaces to stay in the syncCopyCSV loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would definitely not want to put this code in before feature freeze, but I still think what I've written is ok. I might still be able to move ensureRBACInTargetNamespace back to syncCopyCSV as I was really just focused on the necessary coupling of ensureCSVsInNamespaces and ensureTenantRBAC (as you pointed out).
} | ||
// Ensure operator has access to targetnamespaces with cluster RBAC | ||
// (roles/rolebindings are checked for each target namespace in syncCopyCSV) | ||
if err := a.ensureRBACInTargetNamespace(clusterServiceVersion, operatorGroup); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method has been changed to only handle cluster roles and bindings. Do you still think this would be a problem?
a.Log.Errorf("operatorgroup '%v' should have non-nil status", operatorGroup.GetName()) | ||
return nil | ||
} | ||
for _, ns := range targetNamespaces { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ensureTenantRBAC has been modified to not use the lister for retrieving the CSV, which is made possible by copyToNamespace returning the CSV.
Also, related to your other point, this keeps RBAC operations for anything other than all namespaces to stay in the syncCopyCSV loop.
/retest |
/retest |
/retest |
@@ -24,7 +25,7 @@ all: test build | |||
test: clean cover.out | |||
|
|||
unit: | |||
go test $(MOD_FLAGS) -v -race -tags=json1 -count=1 ./pkg/... | |||
go test $(MOD_FLAGS) $(SPECIFIC_UNIT_TEST) -v -race -tags=json1 -count=1 ./pkg/... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks!
} | ||
// Ensure operator has access to targetnamespaces with cluster RBAC | ||
// (roles/rolebindings are checked for each target namespace in syncCopyCSV) | ||
if err := a.ensureRBACInTargetNamespace(clusterServiceVersion, operatorGroup); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, thanks - I think that makes this fine :)
@jpeeler this needs a rebase, and thanks for explaining these changes to me :) I'd like to see the tests pass and then play with it in a real cluster just to check some of the performance characteristics, but otherwise it looks good to me. |
/hold holding until master is open for 4.2 😢 |
/hold cancel |
@@ -468,7 +468,7 @@ func (a *Operator) ensureTenantRBAC(operatorNamespace, targetNamespace string, c | |||
if !ownerutil.IsOwnedBy(ownedRoleBinding, csv) { | |||
continue | |||
} | |||
_, ok := targetRolesByName[ownedRoleBinding.GetName()] | |||
_, ok := targetRoleBindingsByName[ownedRoleBinding.GetName()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this change, but it looks right.
@@ -583,7 +591,7 @@ func (a *Operator) copyToNamespace(csv *v1alpha1.ClusterServiceVersion, namespac | |||
fetchedCSV.SetLabels(utillabels.AddLabel(fetchedCSV.GetLabels(), v1alpha1.CopiedLabelKey, csv.GetNamespace())) | |||
// CRs don't support strategic merge patching, but in the future if they do this should be updated to patch | |||
logger.Debug("updating target CSV") | |||
if _, err := a.client.OperatorsV1alpha1().ClusterServiceVersions(namespace).Update(fetchedCSV); err != nil { | |||
if fetchedCSV, err = a.client.OperatorsV1alpha1().ClusterServiceVersions(namespace).Update(fetchedCSV); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change and the one below is just getting a more refreshed copy of the CSV. But the way it's written to overwrite the copy from the lister may not be acceptable?
This mirrors what's currently done for e2e tests. Although unit test execution is much faster, being able to specify a single test for repeated refinement still saves time.
This refactors the implementation to no longer race itself with reading from the cache, which sometimes would fail. Essentially, the code has been modified to not create a CSV and then later within the same sync loop try to look up the same CSV in the lister. Changes: -ensureRBACInTargetNamespace was moved from syncCopyCSV to syncClusterServiceVersion -ensureRBACInTargetNamespace changed to no longer create role/rolebindings, just cluster level RBAC -ensureTenantRBAC was moved into ensureCSVsInNamespaces -ensureTenantRBAC was changed to utilize newly created CSV of target namespace instead of retrieving from the cache
This forces TestSyncOperatorGroups to allow the cache to catch up as the number of resources that are checked here is high. The additional ignoreCopyError parameter in the tests have been added to handle the scenario where syncCopyCSV would never be executed in a production workflow, but for testing it's being called explicitly.
While polling if the CSV is not found in the lister, make sure to return no error so that the polling continues.
/test e2e-aws-console-olm |
3 similar comments
/test e2e-aws-console-olm |
/test e2e-aws-console-olm |
/test e2e-aws-console-olm |
Dear reviewer, please verify ensureTenantRBAC is right. Otherwise, I think everything else is good. |
/lgtm 🎉 |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ecordell, jpeeler The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This is essentially a fix for unit tests that I noticed locally. It involves making sure a newly created CSV is not attempted to be retrieved using a lister within the same sync loop.
Obsoletes #814.
As much as I'd like to say this fixes every flake, this might only fix one of them. But I haven't seen any locally with this fix.