Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FR: Support large binary files better than Git #2865

Open
PhilipMetzger opened this issue Jan 22, 2024 · 4 comments
Open

FR: Support large binary files better than Git #2865

PhilipMetzger opened this issue Jan 22, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@PhilipMetzger
Copy link
Contributor

Is your feature request related to a problem? Please describe.

In certain industries like Games and Audio it'd be useful if jj could store Audio and Art efficiently. Currently we'd need to hack it into the Git backend, but if we ever have a native backend it should be simplified. Git LFS is not an alternative.

Describe the solution you'd like

We should support binary files with some kind of CDC mechanism and a special object tag.

Additional context
Discord conversations throughout the last 1.5 years. And the FastCDC paper for those who are interested.

I found only one conversation here.

@PhilipMetzger PhilipMetzger added the enhancement New feature or request label Jan 22, 2024
@ilyagr
Copy link
Contributor

ilyagr commented Jan 22, 2024

Naively, the data model of Restic (https://restic.net/) seems like it would work for our usecase. In fact, we could just have a restic repository as a backend, which would make binary file support work immediately with many storage backends (e.g. AWS, Backblaze B2, anything supported by rclone). Each commit could correspond to a Restic snapshot.

One possible limitation of Restic is that I'm not sure it's very good at pushing/pulling snapshots between the repositories. It has this functionality, but I'm not sure whether it's polished and optimized for our use-case.

This assumes that Restic's deduplication strategy is good enough and works well for pulling up files based on their hashes. I am not an expert on it, but I also haven't heard many complaints. I'm not sure whether it uses the same CDC mechanism as Philip mentioned.

Another equally popular backup tool is https://borgbackup.readthedocs.io/en/stable/. I don't have a strong opinion between the two; I'm highlighting Restic because I've tried it and its community seems pleasant.

Restic is written in Go, but there also exists https://github.com/rustic-rs/rustic.

@necauqua
Copy link
Contributor

necauqua commented Jan 26, 2024

Reading on both of those, borg seems more interesting

Particularly, this part stood out for me (I mean, it's out there almost immediately on the main page heh)

Deduplication based on content-defined chunking is used to reduce the number of bytes stored: each file is split into a number of variable length chunks and only chunks that have never been seen before are added to the repository.

A chunk is considered duplicate if its id_hash value is identical. A cryptographically strong hash or MAC function is used as id_hash, e.g. (hmac-)sha256.

To deduplicate, all the chunks in the same repository are considered, no matter whether they come from different machines, from previous backups, from the same backup or even from the same single file.

Compared to other deduplication approaches, this method does NOT depend on:

  • file/directory names staying the same: So you can move your stuff around without killing the deduplication, even between machines sharing a repo.
  • complete files or time stamps staying the same: If a big file changes a little, only a few new chunks need to be stored - this is great for VMs or raw disks.
  • The absolute position of a data chunk inside a file: Stuff may get shifted and will still be found by the deduplication algorithm.

In fact, thanks for showing those exist, it's about time I back up a few of the more important gigs of things to not be a single drive failure away from a total disaster lol

@ilyagr
Copy link
Contributor

ilyagr commented Jan 26, 2024

For Restic, there's a lot of info here: https://restic.readthedocs.io/en/latest/100_references.html#design, though the snippet about deduplication is short. I'm hoping to look at it more carefully one day.

@PhilipMetzger
Copy link
Contributor Author

First of sorry for the late response here.

Naively, the data model of Restic (https://restic.net/) seems like it would work for our usecase. In fact, we could just have a restic repository as a backend, which would make binary file support work immediately with many storage backends (e.g. AWS, Backblaze B2, anything supported by rclone). Each commit could correspond to a Restic snapshot.

One possible limitation of Restic is that I'm not sure it's very good at pushing/pulling snapshots between the repositories. It has this functionality, but I'm not sure whether it's polished and optimized for our use-case.

A restic backend could solve the use-case but doesn't actually achieve what I want. I want a better support for large files natively and as such, probably involves some hacks on the Git model (the object tag/storage strategy, I hinted at).

For the native repo we just should use a CDC mechanism anyway.

This assumes that Restic's deduplication strategy is good enough and works well for pulling up files based on their hashes. I am not an expert on it, but I also haven't heard many complaints. I'm not sure whether it uses the same CDC mechanism as Philip mentioned.

It probably won't solve the use-case for this bug, but it should be a good trial run for an alternative backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants