Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to use a "compressed" (one file) repository format for performance and sustainability purpose #5648

Open
kit-ty-kate opened this issue Aug 30, 2023 · 9 comments

Comments

@kit-ty-kate
Copy link
Member

opam repositories currently have a "one file per packages/versions" but as the number of packages grow it creates a sustainability problem for people with low number of inodes for their filesystems (e.g. see #5484) and a performance problem (you have to open each file on every opam update)

I'm not set on a particular format for that file but it could be the format that opam switch export already uses.
@mseri also suggested using SQLite

@rjbou
Copy link
Collaborator

rjbou commented Sep 1, 2023

Some data: today repo concatened gives a file of 39M and 1 235 506 lines.

@kit-ty-kate
Copy link
Member Author

some more data: a xz-compressed repository would be only 2M and takes 0.2s to uncompress on my local machine

@avsm
Copy link
Member

avsm commented Sep 4, 2023

One mechanism would be to use normal tar.xz files and use the OCaml libraries to parse them directly with unpacking. That has the benefits of making them easy to create, and there performance improvements from not having a lot of small files.

@rjbou
Copy link
Collaborator

rjbou commented Sep 19, 2023

From dev meeting

We can have several formats in the opam repository itself:

  • the historical one: directories and opam files per package,
  • a aggregated file: concat of all opam files, human readable,
  • a compressed file: xz-compressed version of the aggregated file.

The repo file can mention what it the format of the repo, but they can coexists in a simple repo. Opam can understand all those formats (backward compatibility), for API users, it is imperceptible as fetching repo functions remain the same. Opam can also try to retrieve compressed format, then fallback on aggregated format, then fallback to plain directory one.

These new formats can be served via a webserver, for example having opam2web generate them. For github main opam repository, one solution is to have an alternate branch, that serves the aggregated file. It would be automatically updated for each merge.

@c-cube
Copy link
Contributor

c-cube commented Dec 8, 2023

I have another suggestion: a zip file. It doesn't compress as well as .zst or .xz but it has a big advantage that it's randomly addressable, so you never need to actual unzip it. If you want foo/foo.1.2/opam you can directly get the corresponding entry without decompression.

@kit-ty-kate
Copy link
Member Author

xz is also randomely addressable. I use this feature in https://github.com/kit-ty-kate/opam-health-check-ng using the pixz external tool.

@c-cube
Copy link
Contributor

c-cube commented Dec 8, 2023

Oh I didn't know that!! Zip has the upside of having very mature bindings (camlzip) but xz does compress a lot better. In any case I think it'd accelerate some things a lot.

Another performance issue I've seen is that opam tends to check the state of various switches many, many times in a row.

@dinosaure
Copy link

dinosaure commented Dec 19, 2023

I'd like to point out that we currently have an opam-mirror implementation as an unikernel that uses tar (as well as zlib/decompress) and allows random addressable contents. In your proposition, it would be difficult for us to support *.xz in the immediate future (our approach would suggest a re-implementation of this format in OCaml) unfortunately.

I know this probably implies a regression in the compression ratio but, as @c-cube points out, zip (or even tar) has the advantage of a mature existence in the OCaml ecosystem (in contrast to xz).

@kit-ty-kate
Copy link
Member Author

I had a deeper look and I think we can keep the current .tar.gz format and hijack OpamRepositoryConfig.repo_tarring to implement this feature in the simplest way i know (this is still not trivial though)

I implemented a proof of concept reader of the opam-repository's index.tar.gz using ocaml-tar and a fold over each files the whole archive takes between 1.5 to 0.5 seconds depending on the which checkseum's backend you use (1.5 for the ocaml backend and 0.5 for the C backend). Here is the code for the curious eyes: kit-ty-kate/ocaml-tar-playground@0f3b315

The major pain-point in the opam code that I could see on switching to use that, is that currently we diff the previous state of the repository against the new one so if we want to keep doing that we'd need to reimplement diffing between two archives manually.
However, I'm not sure this is useful so we could take the opportunity to simplify the repository backends (as in src/repository) code to avoid using this overlay of diff+patch.

Following every use of OpamRepositoryPath.tar and OpamRepositoryConfig.repo_tarring should give a full enough picture to know where to change things. Reading is done in OpamRepositoryState.load_opams_from_dir so this should be the function to change to use the ocaml-tar PoC above.

There is a chance this current issue is required to fix #5741 which is currently slotted for 2.2.0~rc1 so I might bite the bullet and take the time to implement if no-one does it beforehand (if you do please ping me so we can synchronize)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants