You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When chunking sparse files the chunker will converge on an "idle tone" for runs of zeroes ~>= 2 chunks.
When extracting these chunks are fetched over-and-over again, and also decrypted, checked etc. making it more slow than it has to be.
Suggestions:
LRUCache (chunk-id,) -> (length,) whose express purpose is to store all-zero chunks when --sparse is used. This needs a bit of work in extract_file and in the DownloadPipeline. As usual preload_ids may make this harder to implement (therefore creating this issue, so this doesn't get buried in my stack of notes). If we figure out borg extract: add --continue flag #1665 this shouldn't be hard then - basically the same problem description regarding preload.
An entirely different way to do this would be to make this work transparently in DownloadPipeline, by collapsing runs of the same chunk ID and noting the number of reptitions (ie. run-length coding), yielding repeated chunks locally. On second thought this may be a much better implementation path.
Preload still has to be considered, but on the plus side this works for any kind of repetition, not just zeroes or sparse files, and generally feels like DownloadPipeline is a more apt abstraction layer for this optimization.
Preload may be solvable differently than in borg extract: add --continue flag #1665, by doing the same RLE already in fetch_many, so not submitting the preload for repeated chunks in the first place.
The text was updated successfully, but these errors were encountered:
enkore
changed the title
cache sparse (zero) chunks
cache sparse (zero) chunks to speed up extracting sparse files
Oct 3, 2016
enkore
changed the title
cache sparse (zero) chunks to speed up extracting sparse files
better handling of repeated chunks to speed up extracting sparse files
Oct 3, 2016
It should. We still do the ID check though, limiting speed to something like 300-400 MB/s instead of ~infty. With other repeated chunks compression may be a different story, but I'm not sure if this is something that's significant at all.
note: borg create now uses such a LRUcache and maps (hashalgo, size) -> hashvalue.
I decided to add the hashalgo to play a bit safer although currently it likely is always the same for a normal borg cli invocation (but maybe not for unit tests or future use).
When chunking sparse files the chunker will converge on an "idle tone" for runs of zeroes ~>= 2 chunks.
When extracting these chunks are fetched over-and-over again, and also decrypted, checked etc. making it more slow than it has to be.
Suggestions:
LRUCache (chunk-id,) -> (length,) whose express purpose is to store all-zero chunks when --sparse is used. This needs a bit of work in extract_file and in the DownloadPipeline. As usual preload_ids may make this harder to implement (therefore creating this issue, so this doesn't get buried in my stack of notes). If we figure out borg extract: add --continue flag #1665 this shouldn't be hard then - basically the same problem description regarding preload.
An entirely different way to do this would be to make this work transparently in DownloadPipeline, by collapsing runs of the same chunk ID and noting the number of reptitions (ie. run-length coding), yielding repeated chunks locally. On second thought this may be a much better implementation path.
Preload still has to be considered, but on the plus side this works for any kind of repetition, not just zeroes or sparse files, and generally feels like DownloadPipeline is a more apt abstraction layer for this optimization.
Preload may be solvable differently than in borg extract: add --continue flag #1665, by doing the same RLE already in fetch_many, so not submitting the preload for repeated chunks in the first place.
The text was updated successfully, but these errors were encountered: