FR: Support large binary files better than Git #2865

PhilipMetzger · 2024-01-22T17:33:21Z

Is your feature request related to a problem? Please describe.

In certain industries like Games and Audio it'd be useful if jj could store Audio and Art efficiently. Currently we'd need to hack it into the Git backend, but if we ever have a native backend it should be simplified. Git LFS is not an alternative.

Describe the solution you'd like

We should support binary files with some kind of CDC mechanism and a special object tag.

Additional context
Discord conversations throughout the last 1.5 years. And the FastCDC paper for those who are interested.

I found only one conversation here.

The text was updated successfully, but these errors were encountered:

ilyagr · 2024-01-22T22:57:30Z

Naively, the data model of Restic (https://restic.net/) seems like it would work for our usecase. In fact, we could just have a restic repository as a backend, which would make binary file support work immediately with many storage backends (e.g. AWS, Backblaze B2, anything supported by rclone). Each commit could correspond to a Restic snapshot.

One possible limitation of Restic is that I'm not sure it's very good at pushing/pulling snapshots between the repositories. It has this functionality, but I'm not sure whether it's polished and optimized for our use-case.

This assumes that Restic's deduplication strategy is good enough and works well for pulling up files based on their hashes. I am not an expert on it, but I also haven't heard many complaints. I'm not sure whether it uses the same CDC mechanism as Philip mentioned.

Another equally popular backup tool is https://borgbackup.readthedocs.io/en/stable/. I don't have a strong opinion between the two; I'm highlighting Restic because I've tried it and its community seems pleasant.

Restic is written in Go, but there also exists https://github.com/rustic-rs/rustic.

necauqua · 2024-01-26T15:17:05Z

Reading on both of those, borg seems more interesting

Particularly, this part stood out for me (I mean, it's out there almost immediately on the main page heh)

Deduplication based on content-defined chunking is used to reduce the number of bytes stored: each file is split into a number of variable length chunks and only chunks that have never been seen before are added to the repository.

A chunk is considered duplicate if its id_hash value is identical. A cryptographically strong hash or MAC function is used as id_hash, e.g. (hmac-)sha256.

To deduplicate, all the chunks in the same repository are considered, no matter whether they come from different machines, from previous backups, from the same backup or even from the same single file.

Compared to other deduplication approaches, this method does NOT depend on:

file/directory names staying the same: So you can move your stuff around without killing the deduplication, even between machines sharing a repo.

complete files or time stamps staying the same: If a big file changes a little, only a few new chunks need to be stored - this is great for VMs or raw disks.

The absolute position of a data chunk inside a file: Stuff may get shifted and will still be found by the deduplication algorithm.

In fact, thanks for showing those exist, it's about time I back up a few of the more important gigs of things to not be a single drive failure away from a total disaster lol

ilyagr · 2024-01-26T20:50:39Z

For Restic, there's a lot of info here: https://restic.readthedocs.io/en/latest/100_references.html#design, though the snippet about deduplication is short. I'm hoping to look at it more carefully one day.

PhilipMetzger · 2024-02-04T20:21:55Z

First of sorry for the late response here.

Naively, the data model of Restic (https://restic.net/) seems like it would work for our usecase. In fact, we could just have a restic repository as a backend, which would make binary file support work immediately with many storage backends (e.g. AWS, Backblaze B2, anything supported by rclone). Each commit could correspond to a Restic snapshot.

One possible limitation of Restic is that I'm not sure it's very good at pushing/pulling snapshots between the repositories. It has this functionality, but I'm not sure whether it's polished and optimized for our use-case.

A restic backend could solve the use-case but doesn't actually achieve what I want. I want a better support for large files natively and as such, probably involves some hacks on the Git model (the object tag/storage strategy, I hinted at).

For the native repo we just should use a CDC mechanism anyway.

This assumes that Restic's deduplication strategy is good enough and works well for pulling up files based on their hashes. I am not an expert on it, but I also haven't heard many complaints. I'm not sure whether it uses the same CDC mechanism as Philip mentioned.

It probably won't solve the use-case for this bug, but it should be a good trial run for an alternative backend.

PhilipMetzger added the enhancement New feature or request label Jan 22, 2024

PhilipMetzger mentioned this issue Aug 20, 2024

FR: Support overlaying backends #4311

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FR: Support large binary files better than Git #2865

FR: Support large binary files better than Git #2865

PhilipMetzger commented Jan 22, 2024

ilyagr commented Jan 22, 2024 •

edited

Loading

necauqua commented Jan 26, 2024 •

edited

Loading

ilyagr commented Jan 26, 2024 •

edited

Loading

PhilipMetzger commented Feb 4, 2024

FR: Support large binary files better than Git #2865

FR: Support large binary files better than Git #2865

Comments

PhilipMetzger commented Jan 22, 2024

ilyagr commented Jan 22, 2024 • edited Loading

necauqua commented Jan 26, 2024 • edited Loading

ilyagr commented Jan 26, 2024 • edited Loading

PhilipMetzger commented Feb 4, 2024

ilyagr commented Jan 22, 2024 •

edited

Loading

necauqua commented Jan 26, 2024 •

edited

Loading

ilyagr commented Jan 26, 2024 •

edited

Loading