-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(bugfix): reduce frequency of update requests for copied CSVs #3411
base: master
Are you sure you want to change the base?
(bugfix): reduce frequency of update requests for copied CSVs #3411
Conversation
by adding annotations to copied CSVs that are populated with hashes of the non-status fields and the status fields. This seems to be how this was intended to work, but was not actually working this way because the annotations never actually existed on the copied CSV. This resulted in a hot loop of update requests being made on all copied CSVs. Signed-off-by: everettraven <[email protected]>
Signed-off-by: everettraven <[email protected]>
Those annotations were set on the CSVs in the informer's cache when the hash comparison was introduced (cf618d3). The observation was that CSV copies needed to be written if and only if the copy was out of sync with the original, so caching a pruned object with a hash was much cheaper than caching complete objects while still being able to stomp CSV copies at the appropriate times. This seems to have regressed as part of #3003. Edit: Breaking the hash comparison would have masked this with the unnecessary update requests, but using a partialobjectmeta informer for CSV copies means that olm-operator doesn't have enough information to determine whether or not a copy is in sync with the original. Even if a hash of the original is included in an annotation when olm-operator writes a copy, if someone else makes a change to a copy's spec, that won't be observable by olm-operator. |
Signed-off-by: everettraven <[email protected]>
@benluddy I think we want to push ahead with this PR for a couple reasons:
|
Signed-off-by: Brett Tofel <[email protected]>
You should find with this PR that you can issue a write to a copied CSV that puts it out of sync with the original CSV, and this control loop will not bring the copy back into sync with the original. Edit: I am a little confused by this response. Isn't this PR an attempt to fix a problem that was both reported by a user and introduced by the linked PR? |
@benluddy I think what @bentito is trying to say is that we have received no complaints from users about the bug you called out specifically related to writes to a copied CSV, without changing the annotation, would not be reverted. My understand of the current plan, and I may be a tad out of the loop due to hurricane preparations, is:
I agree that reverting #3003 is the right approach to fix both the critical bug and the bug you called out that was introduced by the use of the PartialObjectMetadata informer. There are folks with concerns about reverting that change because we needed the performance gain to enable folks running on the edge to be able to use OLM. Reverting #3003 is probably still the best thing to do and identify a way to still achieve enough performance gain for the edge use cases, but the priority is being placed on resolving the update 100% of the time ASAP and the consensus of the current maintainers seems to be that this PR is what allows us to achieve that and give us some breathing room to re-assess. I know it isn't a great answer, but hopefully this gives some insight as to the decisions the team has made re: continuing with this PR. |
@everettraven I don't think Ben is suggesting that we revert #3003, I think he is suggesting that we don't introduce another regression on top of the previous one as a fix. With this change, I think we are putting ourselves in a situation where we are going to ignore spec changes in the copied csvs, which I'm pretty sure are considered in resolution. |
@kevinrizza The regression is technically caused by #3003 . We just would realize it in this PR because we add the annotations back that prevent updating 100% of the time. That being said, maybe I'm missing that there is a different way to do this? |
Thinking about this a bit more, maybe we can remove the hash annotation and always calculate the hash of the copied CSV and compare that? EDIT: This isn't possible without reverting #3003 because the lister returns a PartialObjectMeta which doesn't contain spec or status information EDIT EDIT: This could be possible, likely with significant refactoring. I haven't had time to dig into this more. @bentito is taking over this work since I'll be impacted by Hurricane Milton. |
Seems likely that some of the observed memory reductions over caching pruned CSVs might have come from reducing the high watermark transient allocations from decoding a big list response (unlike CSVs, or any CRD-backed resource, a PartialObjectMetaList can be encoded as Protobuf and not only JSON).
Right, without the full object, the copy loop can't see all the fields that might have diverged from the original. The metadata.generation field is automatically incremented for all CRs on updates that change non-metadata fields (https://github.com/kubernetes/kubernetes/blob/d9c46d8ecb1ede9be30545c9803e17682fcc4b50/staging/src/k8s.io/apiextensions-apiserver/pkg/registry/customresource/strategy.go#L185) so you might be able to keep track of the last observed generation per copy with only metadata available. The only problem would be that the status subresource is enabled on the CSV CRD, so status updates don't cause metadata.generation to be incremented. |
Description of the change:
Reduces the frequency of update requests for copied CSVs by:
olm.operatorframework.io/nonStatusCopyHash
annotation to the copied CSVs, populated with a hash of all the non-status fields of the original CSV.olm.operatorframework.io/statusCopyHash
annotation to the copied CSVs, populated with a hash of all the status fields of the original CSV.This appears to have been the desired behavior of this function, as evidenced by:
operator-lifecycle-manager/pkg/controller/operators/olm/operatorgroup.go
Lines 829 to 830 in 461018e
The problem with the previous implementation was that these annotations were never set on the copied CSVs.
Motivation for the change:
UPDATE
requests made to the Kubernetes API Server for copied CSVsArchitectural changes:
Testing remarks:
Reviewer Checklist
/doc
[FLAKE]
are truly flaky and have an issue