Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/cache: optimize peak memory usage during cache build #1281

Conversation

joelanford
Copy link
Member

@joelanford joelanford commented Apr 22, 2024

Description of the change:
I extracted a commit from #1278 , which can be implemented on its own with our existing caching algorithm. As I mentioned there:

[This commit] changes the way the cache is built. It writes meta objects to a temporary file and records the locations of each meta in the file, grouped by package. That way we can later read just the metas for a particular package into memory.

Then we go package by package building a model, converting to the package index, and writing api bundles to the cache. The beauty is that only a single package's model is loaded in memory at any given time.

This may mean that we can stop storing caches in the catalog image and can go back to building them on-the-fly when the container starts!

I noticed that when using an FBC with olm.csv.metadata, startup peak memory and time was basically inconsequential when building a cache on the fly.

In order to maintain the performance of cache building, we need to ensure that WalkMetasFS can make use of concurrency in the same way that LoadFS can (which is what the cache builder currently uses). Therefore, the first commit in this PR includes those changes.

Motivation for the change:
There have been numerous issues reported about how finicky pre-built caches are. There are cases where a catalog image with a pre-built cache works correctly on one node, but not another. There are other cases where caches built outside the image and then injected in are mangled enough to throw off the digest calculation. While these cases are likely problems with the specific digest algorithm we use, this could all be avoided if we were able to build the cache on-the-fly.

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Docs updated or added to /docs
  • Commit messages sensible and descriptive

Copy link
Contributor

openshift-ci bot commented Apr 22, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: joelanford

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 22, 2024
Copy link

codecov bot commented Apr 22, 2024

Codecov Report

Attention: Patch coverage is 72.72727% with 45 lines in your changes are missing coverage. Please review.

Project coverage is 54.03%. Comparing base (aa0777c) to head (092a36d).

Files Patch % Lines
pkg/cache/json.go 67.90% 18 Missing and 8 partials ⚠️
alpha/declcfg/load.go 77.38% 13 Missing and 6 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1281   +/-   ##
=======================================
  Coverage   54.02%   54.03%           
=======================================
  Files         108      108           
  Lines       11266    11314   +48     
=======================================
+ Hits         6087     6113   +26     
- Misses       4190     4207   +17     
- Partials      989      994    +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@joelanford
Copy link
Member Author

joelanford commented Apr 22, 2024

/hold
I'm going to make this faster by adding concurrency support to WalkMetasFS.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 22, 2024
@joelanford joelanford force-pushed the cache-peak-memory-improvement branch from 77c0299 to 092a36d Compare April 22, 2024 15:22
@joelanford
Copy link
Member Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 22, 2024
})

var mapMu sync.Mutex
for i := 0; i < runtime.NumCPU(); i++ {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that data for up to runtime.NumCPU packages will be in memory at once. I noticed the following when building a cache for the operatorhub catalog on my 10-core Mac M1 Pro:

On master:

  • With unmigrated catalog
    • With GOMAXPROCS=1: Peak memory=778Mb, Duration=15s
    • With unset GOMAXPROCS, on master branch: Peak memory=731Mb, Duration=10s
  • With migrated catalog
    • With GOMAXPROCS=1 on master branch: Peak memory=166Mb, Duration=2.7s
    • With unset GOMAXPROCS, on master branch: Peak memory=141Mb, Duration=2.0s

On PR branch:

  • With unmigrated catalog
    • With GOMAXPROCS=1 on PR branch: Peak memory=234Mb, Duration=16s
    • With unset GOMAXPROCS, on PR branch: Peak memory=290Mb, Duration=6.6s
  • With migrated catalog
    • With GOMAXPROCS=1 on PR branch: Peak memory=115Mb, Duration=3s
    • With unset GOMAXPROCS, on PR branch: Peak memory=117Mb, Duration=1.39s

@joelanford
Copy link
Member Author

I'm going to close this one. I think we should focus on #1278.

@joelanford joelanford closed this Apr 24, 2024
@joelanford joelanford deleted the cache-peak-memory-improvement branch April 24, 2024 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant