git tree object as alternative to NAR #1006

Ericson2314 · 2016-07-30T17:58:07Z

NAR has gotten us along way, but one limitation is it cannot support deduplication because only the outermost directory gets a hash.

Git tree object is perhaps not greatest format, because git is so widely used, is likely to have better support among external tools. I think it makes a fine defacto standard.

vcunat · 2016-08-04T14:58:35Z

Well, if you store NARs as files in a (true) deduplicating storage/FS, e.g. in IPFS, you will get the insides deduplicated. AFAIK git packfiles aren't that efficient for binary files which is our main focus.

Ericson2314 · 2016-08-05T23:45:08Z

@vcunat Yeah I am more interested in the hashing scheme than the exact representation for exchange. I kinda also figured git had enough critical mass that if IPFS or anything else wanted to do transport for it it would want to special-case its hashing schema.

On the other hand git uses SHA1 which is dubiously, secure, and last I checked has no worked-out plan for migration. This makes me less sure whether this is a good idea.

Ericson2314 · 2016-08-23T17:01:21Z

ipfs/specs#130 IPFS may soon support git.

spacekitteh · 2016-11-27T08:02:41Z

What about something like SquashFS?

Ericson2314 · 2016-11-27T21:05:51Z

@spacekitteh that would be, uh, squashed? That means no space/bandwidth saving on identical files.

rht · 2017-01-28T14:00:51Z

On git's sha1: there has been ongoing effort to replace the hardcoded sha1's into object_id; here is the latest in the series https://public-inbox.org/git/[email protected]/.
Git tree + packfile mapped to IPLD + IPFS: one of the reasons git is git is because its implementation is fast. But currently, this is as technically possible as booting a kernel on js.
Git packfile, while not efficient for binaries, is still sufficient for file-level deduplication and allows switching build inputs/deps (of several nix-env's one at a time) through git checkout derivation-hash (git checkout-index could be used for custom path).

If NAR files would be preferably stored with block-level deduplication, I wonder if /nix/store requires only up to file-level deduplication?

NAR: I haven't benchmarked to see the order-of-magnitudes range, but likely btrfs/zfs has the fastest block-level deduplication, but IPFS (whether accessed via POSIX api by being mounted with fuse or accessed with a POSIX-like interface through ipfs files) still much slower. But since NAR files are used mainly for archival purpose (only occasionally accessed), IPFS-for-dedup as of now, can almost be used right away.

/nix/store: (unpacked NAR's?) I think Git packfile could potentially be used here, but only for switching nix-env's. IPFS-for-dedup would be too slow.

(... #859 is too crowded >_<)

Ericson2314 · 2017-01-28T18:36:48Z

@rht, this is good stuff, but I'd encourage you to be careful about data model vs concrete representatations (on disk, wire protocols, or otherwise).

I'm interested in git trees+blobs because the data type is exactly what we use (e.g. yes executable bit, no setuid) and it's widely used. Remember there is no way to convert hashes without access to the hashed data, so it's nice to request hashes many computers in principle already have.

Yes go-IPFS is probably slow as hell, and git supposedly is bad with large binaries (though I wouldn't be surprised if this really is no block level deduplication in conjunction keeping hsitory, we wouldn't have that problem as there is no mandatory history to keep). But we need not use either implementation long term.

arianvp · 2018-04-25T14:36:17Z

Another alternative could be the catar file format from casync which is a content addressable storage system by Lennart Poettering: https://github.com/systemd/casync

It seems very similar in goals as nar (it be, a reproducible version of tar)

CASync itself can then be used as a deduplication mechanism. that's what the project is for (storing chunked versions of catar files)

ebkalderon · 2018-09-22T13:16:54Z

@arianvp catar looks pretty great on paper, but it seems to have some Linux-specific bits in it (systemd/casync#147), making it not quite viable on Darwin. desync is a project which attempts to reimplement as much of upstream casync and catar as possible to be compatible with Darwin, but the catar archives it produces could have slight incompatibilities on Linux and vice versa, according to the README.

Ericson2314 · 2018-09-22T18:09:29Z

I also don't like they way it doesn't chunk across logical boundaries. The proper solution to more reuse is finer logical boundaries. That heuristic would yield results which are harder to predict and therefore rely on.

nixos-discourse · 2021-01-17T21:41:45Z

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/optimise-store-while-building-downloading-from-the-cache/11022/4

wmertens · 2021-01-24T20:30:31Z

@Ericson2314 I missed this when you mentioned it, and I had the same thought and worked it out in https://gist.github.com/wmertens/eceebe0fc05461ebdc8fb106d90a6871

The formatting isn't great yet and I decided halfway that it can not only work with $cas but also with $out, so that needs some changes.

Summary: by stripping store references from files, git tree objects can achieve maximum deduplication, more than would be achievable by runtime binding wrapper tricks. It would also be amazing to download updates.

masaeedu · 2021-07-19T05:19:08Z

@Ericson2314 It's a bit of a tradeoff I think. Logical boundaries will work only as well as your logic does, and your logic must work given an incomplete picture of the world.

You can of course do something very simple where there is no deduplication across slight modifications/patches of a file, but it's worth keeping in mind that something is being lost there.

wmertens · 2021-07-19T05:36:48Z

I had the same idea and wrote it up at https://gist.github.com/wmertens/eceebe0fc05461ebdc8fb106d90a6871

One difference is that I propose patching out store references so that there's more deduplication.

I once started a script to try it out but got a little stuck on streaming reference recognition.

EDIT: Argh, re-reading the thread I see i already commented on this 🙈. Shouldn't comment on phones in the morning.

stale · 2022-04-17T13:26:35Z

I marked this as stale due to inactivity. → More info

Ericson2314 · 2022-04-17T16:49:14Z

Still interested.

wmertens · 2022-04-17T20:24:09Z

@Ericson2314 thoughts on using bup instead of git? https://stackoverflow.com/a/19494211

Ericson2314 · 2022-04-18T00:42:00Z

@wmertens Well, two things:

I want to un-hard-code NAR, so we can be less idiosyncratic and better integrate with other tools and communities.
I want to support Git as part of that un-hardcoding because it is in wide use (for source code) today, regardless of the technical merits.

From a brief look, bup has some interesting qualities, but I don't want to "choose a winner" -- a single best NAR successor.

wmertens · 2022-04-18T09:15:12Z

@Ericson2314 is the backing store based on git that I describe at https://gist.github.com/wmertens/eceebe0fc05461ebdc8fb106d90a6871 what you have in mind or something different?

Reading that SO answer I do worry that git might be a poor backing store on embedded systems, at least when adding things to the store.
There are also some derivations that put 300MB ISO files in the store, I wonder how much memory git needs for that.

Furthermore, I wonder if I'm not prematurely optimizing the store path patching in my proposal, since bup for sure will find tons of unchanged chunks in binaries that only differ in embedded store paths, thanks to the rolling hash. Git OTOH needs to find a previous file to start from.

Given that bup uses git packfiles, I wonder if they can be fetched using git, limiting to only what's needed for a certain closure.

As a quick test I'll throw my store through bup to see what ratios I am getting (but not at my computer now)

wmertens · 2022-04-18T10:23:45Z

I asked a question on their mailing list https://groups.google.com/g/bup-list/c/WSROvfjwz3M

Ericson2314 · 2022-04-18T17:01:03Z

@wmertens What I have is just #3635. There are no packfiles or other implementation changes. It is just does the normative part of allowing a git tree/blob hash for a content address. It also works just on the level of individual "store objects" (the things store paths point to), rather then being an entire-store design.

This I think is the right "beachhead", after which further work improving the implementation can be done transparently. It will at least allow us to start improving the way we ingest git sources right away.

Ericson2314 mentioned this issue Jul 31, 2016

Nix and IPFS #859

Open

domenkozar added the feature Feature request or proposal label Aug 2, 2016

This was referenced Jan 29, 2017

RFC: Add IPFS to Nix #1167

Closed

Build Input Caching NixIPFS/notes#1

Open

CMCDragonkai mentioned this issue Mar 6, 2017

Integrating IPFS (Haskell Implementation) MatrixAI/Forge-Package-Archiving#1

Closed

shlevy added the backlog label Apr 1, 2018

shlevy assigned copumpkin Apr 1, 2018

Ericson2314 mentioned this issue Mar 24, 2020

[RFC 0017] Intensional Store NixOS/rfcs#17

Draft

domenkozar removed the backlog label Apr 30, 2020

stale bot added the stale label Apr 17, 2022

stale bot removed the stale label Apr 17, 2022

stale bot added the stale label Oct 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

git tree object as alternative to NAR #1006

git tree object as alternative to NAR #1006

Ericson2314 commented Jul 30, 2016

vcunat commented Aug 4, 2016 •

edited

Loading

Ericson2314 commented Aug 5, 2016

Ericson2314 commented Aug 23, 2016

spacekitteh commented Nov 27, 2016

Ericson2314 commented Nov 27, 2016

rht commented Jan 28, 2017 •

edited

Loading

Ericson2314 commented Jan 28, 2017

arianvp commented Apr 25, 2018 •

edited

Loading

ebkalderon commented Sep 22, 2018 •

edited

Loading

Ericson2314 commented Sep 22, 2018

nixos-discourse commented Jan 17, 2021

wmertens commented Jan 24, 2021

masaeedu commented Jul 19, 2021

wmertens commented Jul 19, 2021 •

edited

Loading

stale bot commented Apr 17, 2022

Ericson2314 commented Apr 17, 2022

wmertens commented Apr 17, 2022

Ericson2314 commented Apr 18, 2022

wmertens commented Apr 18, 2022

wmertens commented Apr 18, 2022

Ericson2314 commented Apr 18, 2022

git tree object as alternative to NAR #1006

git tree object as alternative to NAR #1006

Comments

Ericson2314 commented Jul 30, 2016

vcunat commented Aug 4, 2016 • edited Loading

Ericson2314 commented Aug 5, 2016

Ericson2314 commented Aug 23, 2016

spacekitteh commented Nov 27, 2016

Ericson2314 commented Nov 27, 2016

rht commented Jan 28, 2017 • edited Loading

Ericson2314 commented Jan 28, 2017

arianvp commented Apr 25, 2018 • edited Loading

ebkalderon commented Sep 22, 2018 • edited Loading

Ericson2314 commented Sep 22, 2018

nixos-discourse commented Jan 17, 2021

wmertens commented Jan 24, 2021

masaeedu commented Jul 19, 2021

wmertens commented Jul 19, 2021 • edited Loading

stale bot commented Apr 17, 2022

Ericson2314 commented Apr 17, 2022

wmertens commented Apr 17, 2022

Ericson2314 commented Apr 18, 2022

wmertens commented Apr 18, 2022

wmertens commented Apr 18, 2022

Ericson2314 commented Apr 18, 2022

vcunat commented Aug 4, 2016 •

edited

Loading

rht commented Jan 28, 2017 •

edited

Loading

arianvp commented Apr 25, 2018 •

edited

Loading

ebkalderon commented Sep 22, 2018 •

edited

Loading

wmertens commented Jul 19, 2021 •

edited

Loading