Allow to use a "compressed" (one file) repository format for performance and sustainability purpose #5648

kit-ty-kate · 2023-08-30T17:11:15Z

opam repositories currently have a "one file per packages/versions" but as the number of packages grow it creates a sustainability problem for people with low number of inodes for their filesystems (e.g. see #5484) and a performance problem (you have to open each file on every opam update)

I'm not set on a particular format for that file but it could be the format that opam switch export already uses.
@mseri also suggested using SQLite

The text was updated successfully, but these errors were encountered:

rjbou · 2023-09-01T12:53:31Z

Some data: today repo concatened gives a file of 39M and 1 235 506 lines.

kit-ty-kate · 2023-09-04T13:05:03Z

some more data: a xz-compressed repository would be only 2M and takes 0.2s to uncompress on my local machine

avsm · 2023-09-04T14:15:16Z

One mechanism would be to use normal tar.xz files and use the OCaml libraries to parse them directly with unpacking. That has the benefits of making them easy to create, and there performance improvements from not having a lot of small files.

rjbou · 2023-09-19T16:44:38Z

From dev meeting

We can have several formats in the opam repository itself:

the historical one: directories and opam files per package,
a aggregated file: concat of all opam files, human readable,
a compressed file: xz-compressed version of the aggregated file.

The repo file can mention what it the format of the repo, but they can coexists in a simple repo. Opam can understand all those formats (backward compatibility), for API users, it is imperceptible as fetching repo functions remain the same. Opam can also try to retrieve compressed format, then fallback on aggregated format, then fallback to plain directory one.

These new formats can be served via a webserver, for example having opam2web generate them. For github main opam repository, one solution is to have an alternate branch, that serves the aggregated file. It would be automatically updated for each merge.

c-cube · 2023-12-08T15:46:54Z

I have another suggestion: a zip file. It doesn't compress as well as .zst or .xz but it has a big advantage that it's randomly addressable, so you never need to actual unzip it. If you want foo/foo.1.2/opam you can directly get the corresponding entry without decompression.

kit-ty-kate · 2023-12-08T19:55:53Z

xz is also randomely addressable. I use this feature in https://github.com/kit-ty-kate/opam-health-check-ng using the pixz external tool.

c-cube · 2023-12-08T20:25:13Z

Oh I didn't know that!! Zip has the upside of having very mature bindings (camlzip) but xz does compress a lot better. In any case I think it'd accelerate some things a lot.

Another performance issue I've seen is that opam tends to check the state of various switches many, many times in a row.

dinosaure · 2023-12-19T12:45:12Z

I'd like to point out that we currently have an opam-mirror implementation as an unikernel that uses tar (as well as zlib/decompress) and allows random addressable contents. In your proposition, it would be difficult for us to support *.xz in the immediate future (our approach would suggest a re-implementation of this format in OCaml) unfortunately.

I know this probably implies a regression in the compression ratio but, as @c-cube points out, zip (or even tar) has the advantage of a mature existence in the OCaml ecosystem (in contrast to xz).

kit-ty-kate · 2024-01-09T22:39:02Z

I had a deeper look and I think we can keep the current .tar.gz format and hijack OpamRepositoryConfig.repo_tarring to implement this feature in the simplest way i know (this is still not trivial though)

I implemented a proof of concept reader of the opam-repository's index.tar.gz using ocaml-tar and a fold over each files the whole archive takes between 1.5 to 0.5 seconds depending on the which checkseum's backend you use (1.5 for the ocaml backend and 0.5 for the C backend). Here is the code for the curious eyes: kit-ty-kate/ocaml-tar-playground@0f3b315

The major pain-point in the opam code that I could see on switching to use that, is that currently we diff the previous state of the repository against the new one so if we want to keep doing that we'd need to reimplement diffing between two archives manually.
However, I'm not sure this is useful so we could take the opportunity to simplify the repository backends (as in src/repository) code to avoid using this overlay of diff+patch.

Following every use of OpamRepositoryPath.tar and OpamRepositoryConfig.repo_tarring should give a full enough picture to know where to change things. Reading is done in OpamRepositoryState.load_opams_from_dir so this should be the function to change to use the ocaml-tar PoC above.

There is a chance this current issue is required to fix #5741 which is currently slotted for 2.2.0~rc1 so I might bite the bullet and take the time to implement if no-one does it beforehand (if you do please ping me so we can synchronize)

kit-ty-kate added AREA: REPOSITORY AREA: PERFORMANCE labels Aug 30, 2023

kit-ty-kate mentioned this issue Dec 7, 2023

Requests for comments: how does opam-repository scale? ocaml/opam-repository#23789

Open

kit-ty-kate mentioned this issue Feb 1, 2024

remove the cstruct dependency mirage/ocaml-tar#137

Merged

dra27 mentioned this issue Feb 6, 2024

opam update completely reloads the repository #5824

Open

kit-ty-kate mentioned this issue May 13, 2024

opam.ocaml.org repository takes 10+ minutes to download on Windows #5741

Open

kit-ty-kate mentioned this issue May 21, 2024

Improve the performance of opam update/init on Windows #5966

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to use a "compressed" (one file) repository format for performance and sustainability purpose #5648

Allow to use a "compressed" (one file) repository format for performance and sustainability purpose #5648

kit-ty-kate commented Aug 30, 2023

rjbou commented Sep 1, 2023

kit-ty-kate commented Sep 4, 2023

avsm commented Sep 4, 2023

rjbou commented Sep 19, 2023

c-cube commented Dec 8, 2023

kit-ty-kate commented Dec 8, 2023

c-cube commented Dec 8, 2023

dinosaure commented Dec 19, 2023 •

edited

Loading

kit-ty-kate commented Jan 9, 2024

Allow to use a "compressed" (one file) repository format for performance and sustainability purpose #5648

Allow to use a "compressed" (one file) repository format for performance and sustainability purpose #5648

Comments

kit-ty-kate commented Aug 30, 2023

rjbou commented Sep 1, 2023

kit-ty-kate commented Sep 4, 2023

avsm commented Sep 4, 2023

rjbou commented Sep 19, 2023

c-cube commented Dec 8, 2023

kit-ty-kate commented Dec 8, 2023

c-cube commented Dec 8, 2023

dinosaure commented Dec 19, 2023 • edited Loading

kit-ty-kate commented Jan 9, 2024

dinosaure commented Dec 19, 2023 •

edited

Loading