Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

git tree object as alternative to NAR #1006

Open
Ericson2314 opened this issue Jul 30, 2016 · 21 comments
Open

git tree object as alternative to NAR #1006

Ericson2314 opened this issue Jul 30, 2016 · 21 comments
Assignees
Labels
feature Feature request or proposal stale

Comments

@Ericson2314
Copy link
Member

NAR has gotten us along way, but one limitation is it cannot support deduplication because only the outermost directory gets a hash.

Git tree object is perhaps not greatest format, because git is so widely used, is likely to have better support among external tools. I think it makes a fine defacto standard.

@domenkozar domenkozar added the feature Feature request or proposal label Aug 2, 2016
@vcunat
Copy link
Member

vcunat commented Aug 4, 2016

Well, if you store NARs as files in a (true) deduplicating storage/FS, e.g. in IPFS, you will get the insides deduplicated. AFAIK git packfiles aren't that efficient for binary files which is our main focus.

@Ericson2314
Copy link
Member Author

@vcunat Yeah I am more interested in the hashing scheme than the exact representation for exchange. I kinda also figured git had enough critical mass that if IPFS or anything else wanted to do transport for it it would want to special-case its hashing schema.

On the other hand git uses SHA1 which is dubiously, secure, and last I checked has no worked-out plan for migration. This makes me less sure whether this is a good idea.

@Ericson2314
Copy link
Member Author

ipfs/specs#130 IPFS may soon support git.

@spacekitteh
Copy link

What about something like SquashFS?

@Ericson2314
Copy link
Member Author

@spacekitteh that would be, uh, squashed? That means no space/bandwidth saving on identical files.

@rht
Copy link
Member

rht commented Jan 28, 2017

  • On git's sha1: there has been ongoing effort to replace the hardcoded sha1's into object_id; here is the latest in the series https://public-inbox.org/git/[email protected]/.
  • Git tree + packfile mapped to IPLD + IPFS: one of the reasons git is git is because its implementation is fast. But currently, this is as technically possible as booting a kernel on js.
  • Git packfile, while not efficient for binaries, is still sufficient for file-level deduplication and allows switching build inputs/deps (of several nix-env's one at a time) through git checkout derivation-hash (git checkout-index could be used for custom path).

If NAR files would be preferably stored with block-level deduplication, I wonder if /nix/store requires only up to file-level deduplication?

NAR: I haven't benchmarked to see the order-of-magnitudes range, but likely btrfs/zfs has the fastest block-level deduplication, but IPFS (whether accessed via POSIX api by being mounted with fuse or accessed with a POSIX-like interface through ipfs files) still much slower. But since NAR files are used mainly for archival purpose (only occasionally accessed), IPFS-for-dedup as of now, can almost be used right away.

/nix/store: (unpacked NAR's?) I think Git packfile could potentially be used here, but only for switching nix-env's. IPFS-for-dedup would be too slow.

(... #859 is too crowded >_<)

@Ericson2314
Copy link
Member Author

@rht, this is good stuff, but I'd encourage you to be careful about data model vs concrete representatations (on disk, wire protocols, or otherwise).

I'm interested in git trees+blobs because the data type is exactly what we use (e.g. yes executable bit, no setuid) and it's widely used. Remember there is no way to convert hashes without access to the hashed data, so it's nice to request hashes many computers in principle already have.

Yes go-IPFS is probably slow as hell, and git supposedly is bad with large binaries (though I wouldn't be surprised if this really is no block level deduplication in conjunction keeping hsitory, we wouldn't have that problem as there is no mandatory history to keep). But we need not use either implementation long term.

@arianvp
Copy link
Member

arianvp commented Apr 25, 2018

Another alternative could be the catar file format from casync which is a content addressable storage system by Lennart Poettering: https://github.com/systemd/casync

It seems very similar in goals as nar (it be, a reproducible version of tar)

CASync itself can then be used as a deduplication mechanism. that's what the project is for (storing chunked versions of catar files)

@ebkalderon
Copy link

ebkalderon commented Sep 22, 2018

@arianvp catar looks pretty great on paper, but it seems to have some Linux-specific bits in it (systemd/casync#147), making it not quite viable on Darwin. desync is a project which attempts to reimplement as much of upstream casync and catar as possible to be compatible with Darwin, but the catar archives it produces could have slight incompatibilities on Linux and vice versa, according to the README.

@Ericson2314
Copy link
Member Author

I also don't like they way it doesn't chunk across logical boundaries. The proper solution to more reuse is finer logical boundaries. That heuristic would yield results which are harder to predict and therefore rely on.

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/optimise-store-while-building-downloading-from-the-cache/11022/4

@wmertens
Copy link
Contributor

@Ericson2314 I missed this when you mentioned it, and I had the same thought and worked it out in https://gist.github.com/wmertens/eceebe0fc05461ebdc8fb106d90a6871

The formatting isn't great yet and I decided halfway that it can not only work with $cas but also with $out, so that needs some changes.

Summary: by stripping store references from files, git tree objects can achieve maximum deduplication, more than would be achievable by runtime binding wrapper tricks. It would also be amazing to download updates.

@masaeedu
Copy link
Contributor

@Ericson2314 It's a bit of a tradeoff I think. Logical boundaries will work only as well as your logic does, and your logic must work given an incomplete picture of the world.

You can of course do something very simple where there is no deduplication across slight modifications/patches of a file, but it's worth keeping in mind that something is being lost there.

@wmertens
Copy link
Contributor

wmertens commented Jul 19, 2021

I had the same idea and wrote it up at https://gist.github.com/wmertens/eceebe0fc05461ebdc8fb106d90a6871

One difference is that I propose patching out store references so that there's more deduplication.

I once started a script to try it out but got a little stuck on streaming reference recognition.

EDIT: Argh, re-reading the thread I see i already commented on this 🙈. Shouldn't comment on phones in the morning.

@stale
Copy link

stale bot commented Apr 17, 2022

I marked this as stale due to inactivity. → More info

@stale stale bot added the stale label Apr 17, 2022
@Ericson2314
Copy link
Member Author

Still interested.

@stale stale bot removed the stale label Apr 17, 2022
@wmertens
Copy link
Contributor

@Ericson2314 thoughts on using bup instead of git? https://stackoverflow.com/a/19494211

@Ericson2314
Copy link
Member Author

@wmertens Well, two things:

  1. I want to un-hard-code NAR, so we can be less idiosyncratic and better integrate with other tools and communities.
  2. I want to support Git as part of that un-hardcoding because it is in wide use (for source code) today, regardless of the technical merits.

From a brief look, bup has some interesting qualities, but I don't want to "choose a winner" -- a single best NAR successor.

@wmertens
Copy link
Contributor

@Ericson2314 is the backing store based on git that I describe at https://gist.github.com/wmertens/eceebe0fc05461ebdc8fb106d90a6871 what you have in mind or something different?

Reading that SO answer I do worry that git might be a poor backing store on embedded systems, at least when adding things to the store.
There are also some derivations that put 300MB ISO files in the store, I wonder how much memory git needs for that.

Furthermore, I wonder if I'm not prematurely optimizing the store path patching in my proposal, since bup for sure will find tons of unchanged chunks in binaries that only differ in embedded store paths, thanks to the rolling hash. Git OTOH needs to find a previous file to start from.

Given that bup uses git packfiles, I wonder if they can be fetched using git, limiting to only what's needed for a certain closure.

As a quick test I'll throw my store through bup to see what ratios I am getting (but not at my computer now)

@wmertens
Copy link
Contributor

I asked a question on their mailing list https://groups.google.com/g/bup-list/c/WSROvfjwz3M

@Ericson2314
Copy link
Member Author

@wmertens What I have is just #3635. There are no packfiles or other implementation changes. It is just does the normative part of allowing a git tree/blob hash for a content address. It also works just on the level of individual "store objects" (the things store paths point to), rather then being an entire-store design.

This I think is the right "beachhead", after which further work improving the implementation can be done transparently. It will at least allow us to start improving the way we ingest git sources right away.

@stale stale bot added the stale label Oct 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Feature request or proposal stale
Projects
None yet
Development

No branches or pull requests