Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better handling of repeated chunks to speed up extracting sparse files #1678

Open
enkore opened this issue Oct 3, 2016 · 4 comments
Open
Milestone

Comments

@enkore
Copy link
Contributor

enkore commented Oct 3, 2016

When chunking sparse files the chunker will converge on an "idle tone" for runs of zeroes ~>= 2 chunks.

When extracting these chunks are fetched over-and-over again, and also decrypted, checked etc. making it more slow than it has to be.

Suggestions:

  1. LRUCache (chunk-id,) -> (length,) whose express purpose is to store all-zero chunks when --sparse is used. This needs a bit of work in extract_file and in the DownloadPipeline. As usual preload_ids may make this harder to implement (therefore creating this issue, so this doesn't get buried in my stack of notes). If we figure out borg extract: add --continue flag #1665 this shouldn't be hard then - basically the same problem description regarding preload.

  2. An entirely different way to do this would be to make this work transparently in DownloadPipeline, by collapsing runs of the same chunk ID and noting the number of reptitions (ie. run-length coding), yielding repeated chunks locally. On second thought this may be a much better implementation path.

    Preload still has to be considered, but on the plus side this works for any kind of repetition, not just zeroes or sparse files, and generally feels like DownloadPipeline is a more apt abstraction layer for this optimization.

    Preload may be solvable differently than in borg extract: add --continue flag #1665, by doing the same RLE already in fetch_many, so not submitting the preload for repeated chunks in the first place.

@enkore enkore changed the title cache sparse (zero) chunks cache sparse (zero) chunks to speed up extracting sparse files Oct 3, 2016
@enkore enkore changed the title cache sparse (zero) chunks to speed up extracting sparse files better handling of repeated chunks to speed up extracting sparse files Oct 3, 2016
@ThomasWaldmann
Copy link
Member

Yes, that is not very efficient - but is it really a problem? With compression, that all-zero chunk should be tiny.

@enkore
Copy link
Contributor Author

enkore commented Oct 3, 2016

It should. We still do the ID check though, limiting speed to something like 300-400 MB/s instead of ~infty. With other repeated chunks compression may be a different story, but I'm not sure if this is something that's significant at all.

@ThomasWaldmann
Copy link
Member

See also #1354 and #14.

@ThomasWaldmann
Copy link
Member

note: borg create now uses such a LRUcache and maps (hashalgo, size) -> hashvalue.

I decided to add the hashalgo to play a bit safer although currently it likely is always the same for a normal borg cli invocation (but maybe not for unit tests or future use).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants