Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backhand: Remove duplicate data #594

Merged
merged 2 commits into from
Aug 29, 2024

Conversation

wcampbell0x2a
Copy link
Owner

@wcampbell0x2a wcampbell0x2a commented Aug 25, 2024

  • Add cache to DataWriter, that if enabled will remember the file length and a hash of a previous file and use that instead of writing the matching file to disk.
  • Add set_no_duplicate_files to FilesystemWriter to make this configurable.
  • Make superblock Flags public

TODO

  • Maybe hash the first 32 bytes to do a quick comparison, before hashing the entire file
  • cleanup code quality
  • Add test that adds a file that already exists, should be the same size? (might want to check that just_copy_it, works)
  • Use https://github.com/paritytech/nohash-hasher (some other hashes I use could use this!)

Copy link

Benchmark for d58f903

Click to view benchmark
Test Base PR %
only_read/netgear_ax6100v2 2.4±0.04ms 2.4±0.00ms 0.00%
only_read/tplink_ax1800 6.2±0.01ms 6.2±0.01ms 0.00%
unsquashfs/full 11.1±0.16ms 11.0±0.19ms -0.90%
unsquashfs/full-path-filter 7.7±0.10ms 7.7±0.10ms 0.00%
unsquashfs/list 8.1±0.13ms 8.1±0.15ms 0.00%
unsquashfs/list-path-filter 7.2±0.13ms 7.2±0.10ms 0.00%
write_read/netgear_ax6100v2 1287.8±7.45ms 1291.3±10.50ms +0.27%
write_read/tplink_ax1800 7.1±0.04s 7.2±0.05s +1.41%

Copy link

Benchmark for 866e857

Click to view benchmark
Test Base PR %
only_read/netgear_ax6100v2 2.4±0.00ms 2.3±0.00ms -4.17%
only_read/tplink_ax1800 6.3±0.01ms 6.2±0.01ms -1.59%
unsquashfs/full 10.9±0.13ms 10.9±0.50ms 0.00%
unsquashfs/full-path-filter 7.6±0.06ms 7.6±0.05ms 0.00%
unsquashfs/list 7.9±0.07ms 7.9±0.09ms 0.00%
unsquashfs/list-path-filter 7.1±0.04ms 7.0±0.04ms -1.41%
write_read/netgear_ax6100v2 1277.2±3.77ms 1282.6±5.90ms +0.42%
write_read/tplink_ax1800 7.0±0.01s 7.0±0.01s 0.00%

@wcampbell0x2a wcampbell0x2a force-pushed the 223-duplicate-data-should-be-removed branch from 8dc2129 to 6a10006 Compare August 29, 2024 02:17
Copy link

Benchmark for f9cb352

Click to view benchmark
Test Base PR %
only_read/netgear_ax6100v2 2.4±0.00ms 2.3±0.00ms -4.17%
only_read/tplink_ax1800 6.2±0.00ms 6.2±0.01ms 0.00%
unsquashfs/full 10.9±0.13ms 10.9±0.16ms 0.00%
unsquashfs/full-path-filter 7.6±0.06ms 7.6±0.06ms 0.00%
unsquashfs/list 8.2±0.37ms 7.9±0.07ms -3.66%
unsquashfs/list-path-filter 7.1±0.10ms 7.0±0.08ms -1.41%
write_read/netgear_ax6100v2 1276.1±5.81ms 1275.0±2.06ms -0.09%
write_read/tplink_ax1800 7.2±0.01s 7.1±0.04s -1.39%

* Add cache to DataWriter, that if enabled will remember the file length
  and a hash of a previous file and use that instead of writing the matching
  file to disk.
* Add set_no_duplicate_files to FilesystemWriter to make this configurable.
* Make superblock Flags public
* We either already have a hash, or just want an int mapping.
@wcampbell0x2a wcampbell0x2a force-pushed the 223-duplicate-data-should-be-removed branch from 6a10006 to 6968ae3 Compare August 29, 2024 03:18
Copy link

Benchmark for d1c6030

Click to view benchmark
Test Base PR %
only_read/netgear_ax6100v2 2.3±0.00ms 2.3±0.00ms 0.00%
only_read/tplink_ax1800 6.2±0.05ms 6.2±0.01ms 0.00%
unsquashfs/full 11.0±0.25ms 11.0±0.17ms 0.00%
unsquashfs/full-path-filter 7.7±0.06ms 7.6±0.06ms -1.30%
unsquashfs/list 8.0±0.07ms 8.0±0.13ms 0.00%
unsquashfs/list-path-filter 7.1±0.07ms 7.2±0.13ms +1.41%
write_read/netgear_ax6100v2 1271.4±1.29ms 1275.0±5.79ms +0.28%
write_read/tplink_ax1800 7.0±0.01s 7.1±0.03s +1.43%

@wcampbell0x2a wcampbell0x2a merged commit 341a89e into master Aug 29, 2024
38 checks passed
@wcampbell0x2a wcampbell0x2a deleted the 223-duplicate-data-should-be-removed branch August 29, 2024 22:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant