-
Notifications
You must be signed in to change notification settings - Fork 30
Content depending chunker #183
Comments
Have a look at ipfs-inactive/archives#142 |
Came across this while looking for Is there anyway, the chunking mechanism can be customized by the developers (when using ipfs js, libp2p etc.) |
This apparently is being objected to by the My personal opinion is that:
|
I am a go-ipfs dev now ?
Pls do not tell what other person things in their place without quoting them and linking to the full picture. I'm actually planning to experiment on some content based chunking myself soonish. What I was saying in the conversation we had on the discord is that content based chunking is very inefficient at reducing bundle sizes and saving space. Content based chunking is only really effective in cases of files formats that contain others and you want to store both the container and the files, for example Note it has positive impact on latency of files if you know their access layout and where the access seeks to a precise place but it's really tiny given the current blocksize. |
@Jorropo the reason I tagged you is you can correct the record of what you personally said (as you did). Also I did provide a link to the content on discord. But enough about that, please presume good intentions until proven otherwise. I think compression serves only one of the goals (data size, storage/retrieval speed), but not the other (content lineage, large number of variations). These are basically different use cases. When we know nothing about the data itself, Rabin-Karp is probably best followed by some GP compression (ZSTD is great). I posted some Windows Server deduplication stats (which uses Rabin-Karp) elsewhere and they look pretty good on data which has VMs for instance, but also other non-specific. The more important point I was trying to make is that we need some dials to select codec(s), chunk size, etc. to accommodate different use cases. That also presumes pluggable codecs architecture in all the main stacks, so we can avoid debates for what is and isn't included in the stack itself. I would like to choose the codecs and their parameters to pre-load on my IPFS node. Drawback is that pluggable non-default codecs partitions the data space into those who can read it and those who can't, but this can also be remedied by providing codec addresses (which should also be in IPFS, under some trust hierarchies) and default binary decoder if someone doesn't trust the particular codec or it isn't available. |
@andrey-savov from reading your post and the channel you seem to be conflating multiple different types of extensibility here:
Taking a look here there are generally three different extensibility points you could use here. I think they were largely covered in the Matrix thread, but for posterity/future discussion it's likely easier to track here. Note that in general developers I've encountered within the IPFS ecosystem try to make things as extensible as they can without making UX/DX miserable. If you find an area is insufficiently extensible and have a concrete proposal for how to make things better feel free to open up an issue focused on your particular proposal. Using a custom (UnixFS) chunkerUnixFS is widely supported all over the IPFS ecosystem. You can write a custom chunker that will take a file and chunk it up in a way that existing IPFS implementations can easily deal with. For example, it's very doable to do If you're writing in Go you can even make your chunker fulfill the interfaces from https://github.com/ipfs/go-ipfs-chunker and then try and upstream your changes into projects like go-ipfs. In the meanwhile though even while nothing is upstreamed your changes are easily usable within the ecosystem Using a custom IPLD RepresentationUtilizing a common existing IPLD codec to represent your file typeLike the above this is very doable and you can do Advantages compared to using UnixFS: You can use IPLD to query things like the headers and other format particulars of your file type. People definitely use non-UnixFS IPLD regularly. Historically people have gotten the most benefit doing this for data types that aren't well expressed as large single flat files (e.g. things with hash links in them, encrypted/signed pieces, etc.) Note: There was an early attempt at such a thing with the Utilizing a new IPLD codec to represent your file typeLike the above you can do The options here are then to:
Historically new codecs have mostly been used as compatibility with existing hash linked formats (e.g. Git), but there are certainly other use cases as well (e.g. DAG-JOSE). Advantages compared to using a supported codec: You can save some bytes by making more things implied by the codec identifier rather than explicit as fields in some format like DAG-CBOR or DAG-JSON. |
@aschmahmann Appreciate the thoughtful comment. While I am studying it in more detail to form concrete proposals, I wanted to make some high level comments:
In general, as we move toward decentralization of the Internet, we should rely less and less on centralized decisions, especially made by a group of people. In that respect, maintainers of the IPFS reference implementations should be rule-setters, not arbiters of merit. |
There is a need for content dependant chunker.
Therefore I welcome objective, non-opinionated discussion, proofs of concepts, benchmarks, and other around this subject.
Questionable is what technique, what polynomial (if applicable), and what parameters to use for given chunker, so that it is performant across the board and effectively helps everyone.
Also the question is what files/data to chunk by this chunker. Given that compressed data would likely not benefit at all from content dependent chunking. That is unless the file is archive with non-solid compression, or alike. Should this be decided automatically by some heuristics and for say text files that are not minified versions of js/css, etc. to use such chunker, else select regular chunker? By file headers?
This can have a great impact for (distributed) archival of knowledge (think Archive.org, except with dedup, better compression, and that can be distributed easily). Which also does raise question if chunks should be stored compressed. But that is partialy side-tracking this issue.
One reference implementation with focus on storage saving (faster convergence of chunk boundaries):
https://github.com/Tarsnap/tarsnap/blob/master/tar/multitape/chunkify.h
Other references:
https://en.wikipedia.org/wiki/MinHash
https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm
https://en.wikipedia.org/wiki/Rolling_hash
https://moinakg.github.io/pcompress/
The text was updated successfully, but these errors were encountered: