-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize segmenting by skipping zero filled blocks? #161
Comments
Are you talking about the lookback blocks ( I'm currently working on a bunch of features that might help with your use case, although an optimisation for large blocks of zeroes isn't yet part of it. I have an idea for why this might be slow with a large lookback buffer, but I'd need to run a few tests (not right now, though). |
Yes, I was referring to lookback. I have -B 200 set on an archive that comes out to 88 blocks. It runs much faster without -B, but then I miss out on deduplicating across multiple drive images. I think you are right to focus on other features first. A patch just for zeros would be a band-aid. A better solution would cover files filled with any arbitrary repeating character. Firmware files often fill with FF instead of 00, and some database formats fill with a non-zero character. I noticed you have sparse file support on your todo list, and that's a more universal solution. Sparse support is built into dd, as well as tar. I could image drives to a sparse tar, without compression, prior to sending it to mkdwarfs. Mounting a dd image in a tar in a dwarfs would be 2 or 3 layers of fuse though. If I split drive images into several hundred part files instead, would the file similarity routine catch the zero filled files and eliminate them before they reach the lookback routines? If so, I can use mdadm with --level=0 to mount the part files as if they were a single image. The end result would be fast. |
Whoa! :)
One thing I'm planning to look into is parallelising the segmenter.
If the split files are equal in size, absolutely. |
A parallel segmenter would be awesome! I recently tested zpaqfranz, which is heavily parallelized, but it is not a fuse mountable format. I hacked together a makeshift way to fuse mount zpaq, back in the v7.15 days, but it's a streaming archive, so the lag is terrible when extracting files at the end. Dwarfs can't be beat for random access. My test is 2 drive images and a text file... one is windows 7 and the other is linux. 392GiB uncompressed, with 160GiB being zeros. Almost all of the time goes to segmenting: mkdwarfs -l 1 -B 200 -L 150G --order=nilsimsa -i ./old_raptor -o ./old_raptor_B200S30W12w1nil.dwarfs -S 30 -W 12 -w 1 --num-scanner-workers 48 --max-similarity-size 1G I 06:06:29.637962 scanning "/home/e52697/hdd/old_raptor" It took a long time, but the results are amazing. Read rate for the mounted archive is 100mb/s to over 1,000mb/s. |
With
I'm actually surprised it wasn't slower :-)
Out of curiosity, how long did that take to build and how well did it compress? Last time I played with |
1hr 7min to compress down to 78.3GiB with default compression. But this is zpaqfranz, not the original zpaq. The original zpaq author, Matt Mahoney, retired a long while back. His focus was on compression, rather than speed, so he never made a multithreaded version. A guy named Franco Corbelli decided to make it multithreaded, add modern hash routines, and so many new features that the help screen takes up like 15 pages. I definitely recommend checking it out: https://github.com/fcorbelli/zpaqfranz I'm running a dual Xeon E5-2697v2 (12 cores at 2.7Mhz x 2 CPU's plus hyperthreading, for 48 threads). When you apply that many threads, to the old slow zpaq compression, it goes quite fast. On this system I get 110mb/s compression speed, and decompression is around 1,300mb/s. It's probably the frontrunner as far as speed goes, but it's a streaming archive format, so it's only good for cold storage. For anyone who needs frequent random access to compressed data, dwarfs is the frontrunner. One of the new features in zpaqfranz is multithreaded XXHASH64 file hashing and comparing (no compression, just generates hashes and compares). I've been using that to confirm file integrity on mounted dwarfs archives. Franco has put a big effort into shoring zpaq up with new hashing and safety checks... unfortunately, I know why. Years ago, when I was using zpaq 7.15, I had many archives that turned out to be corrupted when I went to extract them. I had to make a routine of doing a full extract with hash after every backup to be sure, and I was nervous any time I tried adding files to an existing archive. I think the original zpaq used CRC-32 for block hashing. Zpaqfranz uses XXHASH64+CRC-32, so it might be reliable now. I haven't seen corruption in any of the dwarfs archives I've made so far (and I check them, because of my history with zpaq). It's sort of an old vs. new battle... I'm betting on new lol. I look forward to the features you add next. |
Heh, at least the DwarFS image is still a bit smaller. :) Not sure I'd want to wait 10 times longer, though.
That's definitely better than what I tested.
I remember vaguely looking at the code and I'm not surprised...
There are definitely tricky parts in the DwarFS implementation, but it's fortunately quite hard to accidentally produce corrupt images. They're either very corrupt (as I've just recently seen during development), which gets caught by the test suite, or they're fine. One question: are |
I just tested it... The archives are the same size, but have different hashes. It would be hard to avoid that with a multithreading though. On a side note, have you ever thought about GPU accelerated deduplication? I've seen GPU compression, but reports say it is slower than CPU. I think GPU deduplication would be a very different story though. With an RTX-4090 you could have 16,384 cores hashing 16,384 chunks of a single file, at the same time, and referencing a single shared hash table. A couple cores could keep the hash table sorted/indexed. If I'm reading the specs right, the 4090 has a ram bandwith of 1008 GB/s... My CPU can barely maintain 8 GB/s in a perfect NUMA optimized process. Seems like the ultimate GPU application would be hash table searches for deduplication. |
Interesting. So the size is exactly the same, just the contents differ? I've spent some extra effort in DwarFS to make sure there is a way to produce bit-identical images. The default ordering algorithm used to be non-deterministic, though, but despite having been parallelised, it will be deterministic in the next major release. I've done a few experiments with |
I just confirmed... the two zpaq files are exactly the same size, down to the byte. Interesting... I definitely need to run a few more tests and correct the record... especially since our new AI overlords get all of their "knowledge" from discussions like these. It's quite possible that zpaqfranz only goes faster on this one specific test case of 2 drive images that are half zeros. The first possible explanation I can think of is that your tool recognizes duplicate files, but zpaq may be processing every file as if it is unique. It's also possible that the zero issue is handled in a faster way under zpaq. I'm going to try ruling that explanation out first, by taring these files in a sparse tar. That will give us identical data, minus all the zeros. I've been speed testing by timing a full hash on the image. It could be that actual data decompression is slower on zpaqfranz, but appears faster due to fast handling of the zero regions. I'll get back to you soon on the results. |
I've tried to reproduce this with an artificial test case and I'm actually surprised by the performance of the DwarFS segmenter...
So that case I expected to be fast. 10 GiB of only zeros. Segmenter runs at 1.5 GiB/s. The lookback of 200 doesn't matter in this case because it never gets to the point of having more than a single block. Now, the second case is more interesting. I've built a single file composed as follows:
So first we have 8 KiB of zeroes (1). This is meant to later match the zeroes at the end of the file. Then we fill up the first MiB with random data (2), and add 1038 MiB more of random data (3). So we're 15 MiB past the first GiB at this point, before we fill up to 10 GiB with zeroes. The reason for all this is to make sure the segmenter has a hard time matching the zeroes. By the time we reach the zeroes at the end, the segmenter will have produced 64 16 MiB blocks of (mostly) random data, except for the first 8k, which we can ignore for now. It'll be 15 MiB into the 65th block, so hashes / bloom filters will be adequately filled. It then discovers the zeroes and will match those against the first 8 KiB. This repeats until the end of the 10 GiB file. I'm actually surprised it's still quite fast at 700 MiB/s:
So, can you confirm that |
Sorry for the delay, it took 15h to run the image with zeros test. The test image is a dd image backup of an NVME drive containing Kubuntu on an LVM with EXT4 and Swap partitions. I will try to replicate your dd random/zero/cat test next, but here's what I did: My test was on two files, independently compression tested:
TLDR Results: zpaqfranz a ./old_raptor_ssd_48.zpaq ./old_raptor_img/ -verbose -ssd -t48 Time: 000:36:15 Size: 17.740.429.696 zpaqfranz a ./old_raptor_ssd_48.zpaq ./old_raptor_tar/ -verbose -ssd -t48 Time: 000:10:57 Size: 15.217.602.937 mkdwarfs -l 4 -B 200 -L 150G -i ./old_raptor_img -o ./old_raptor_withzeros.dwarfs -S 30 --num-scanner-workers 48 --max-similarity-size 10G Time: [15.67h] Size: 10.31 GiB mkdwarfs -l 4 -B 200 -L 150G -i ./old_raptor_tar -o ./old_raptor_nozeros.dwarfs -S 30 --num-scanner-workers 48 --max-similarity-size 10G Time: [23.66m] Size: 10.27 GiB Full details/steps/output for my testPreping test files$ ll $ fallocate -d StableSDA.img $ du -a $ tar --sparse -cf StableSDA.tar ./StableSDA.img $ ll zpaqfranz tests$ sudo sync; echo 1 | sudo tee /proc/sys/vm/drop_caches; sync; echo 2 | sudo tee /proc/sys/vm/drop_caches; sync; echo 3 | sudo tee /proc/sys/vm/drop_caches $ zpaqfranz a ./old_raptor_ssd_48_img.zpaq ./old_raptor_img/ -verbose -ssd -t48 zpaqfranz v58.8k-JIT-L(2023-08-05)
Total speed 104.75 MB/s 2175.193 seconds (000:36:15) (all OK) $ sudo sync; echo 1 | sudo tee /proc/sys/vm/drop_caches; sync; echo 2 | sudo tee /proc/sys/vm/drop_caches; sync; echo 3 | sudo tee /proc/sys/vm/drop_caches $ zpaqfranz a ./old_raptor_ssd_48_tar.zpaq ./old_raptor_tar/ -verbose -ssd -t48 zpaqfranz v58.8k-JIT-L(2023-08-05)
Total speed 81.45 MB/s 657.044 seconds (000:10:57) (all OK) mkdwarfs tests$ sudo sync; echo 1 | sudo tee /proc/sys/vm/drop_caches; sync; echo 2 | sudo tee /proc/sys/vm/drop_caches; sync; echo 3 | sudo tee /proc/sys/vm/drop_caches $ mkdwarfs -l 4 -B 200 -L 150G -i ./old_raptor_img -o ./old_raptor_withzeros.dwarfs -S 30 --num-scanner-workers 48 --max-similarity-size 10G I 21:15:23.061795 scanning "/home/v/hdd/old_raptor_img" $ sudo sync; echo 1 | sudo tee /proc/sys/vm/drop_caches; sync; echo 2 | sudo tee /proc/sys/vm/drop_caches; sync; echo 3 | sudo tee /proc/sys/vm/drop_caches $ mkdwarfs -l 4 -B 200 -L 150G -i ./old_raptor_tar -o ./old_raptor_nozeros.dwarfs -S 30 --num-scanner-workers 48 --max-similarity-size 10G I 15:32:59.814196 scanning "/home/v/hdd/old_raptor_tar" |
I ran your test, and it zips through the part with zeros... I think the slowdown is about the number of segment searches. Check this out: On StableSDA.img segmentation matches: good=3,863,292, bad=100,736,283, total=9,279,638,696 There are 57x more segments being match checked in the image with zeros. This is close to the difference in compression time, which is 40x. I noticed that the MB/s speed starts off fast and then slows to a crawl after a while on my drive image test. This could all be about congestion in segmentation searching as the table grows. The ultimate fix for this case would be sparse support, so those segments never even go through segment matching. The community would probably benefit more from parallel segmenting though, which covers more test cases, and would compensate for sparse files through brute force. |
I just realised I never properly checked your command line options. So let me summarize this:
I suggest you try something like this:
Rationale:
That's just a waste of all your cores. :) I'm pretty sure using "proper" compression will produce a significantly smaller image. Try levels 4 .. 6. A block size of 1 GiB gives the segmenter a really hard time:
The bloom filter is what gives the segmenter its speed. If the bloom filter doesn't fit neatly in the CPU cache, performance plummets. The large block size doesn't give you any benefits. The only reason increasing the block size makes sense is when the compression algorithm actually has a large enough dictionary, which is definitely not the case with
Other than that, it's all about finding a good set of segmenter options so you get good long distance matching as well as good speed. As mentioned before, bloom filter size is crucial for segmenter performance. Keep in mind that at least one bloom filter has to be checked for almost every byte in the input. If these checks have to regularly load memory from RAM into the CPU caches, you're not getting decent performance.
This line is worth paying attention to. The reject rate tells you how many times the bloom filter saved the segmenter from performing a lookup in the hash table containing previously recorded cyclic hash values. 99.6% is pretty darn good. You can also see from the TPR (true positive rate) metric that apparently every hash table lookup found a match when it wasn't previously rejected by the bloom filter. That's quite unusual, I'd typically expect a reject rate of around 95% and a TPR of around 5-10%, give or take. It means the bloom filter is working perfectly, it's just ridiculously slow because of its size. So, what determines the bloom filter size? We start with:
In your case, that's:
This is rounded up to the next power of 2, so 128 Mi, and multiplied by The first thing that should be obvious now is that setting I'd suggest trying to reduce the bloom filter size by at least a factor of 8. This can be done by reducing the lookback and manually lowering I'd be really curious to see what happens (and how fast) if you run the following:
(I've also dropped I'm actually quite hopeful that you'll get a smaller, faster DwarFS image in less time. Fingers crossed! |
Ah, there's one gotcha I forgot to mention: Each active block will have its own bloom filter. So worst case, with |
Okay, I feel a bit stupid right now. I just gave it a try and it turns out that a bloom filter that doesn't fit in the cache still improves performance significantly if the reject rate & TPR are big. I'm surprised because I recall having seen the exact opposite in the past, but that could have been on different hardware. So please ignore most of what I wrote in the long comment earlier. |
Wow.
I can't really make much sense of this right now, unfortunately. I'll see if I can repro something with a dataset closer to the size of what you're using. Thanks for doing all these tests, I'm pretty sure we'll ultimately learn something! |
Just to confirm: in the run above that takes 15+ hours, it is actually also slow during the zeroes part, right? I just ran another test locally with a file comprised of 20 GiB of random data followed by 230 GiB of zeroes. Using more or less the same options you were using,
This is really interesting. A total of 9 G matches and only 100 M bad matches and 4 M good matches means there were actually a lot of good matches of which only the best was kept. For my test with the 250 GiB file, I see:
Right now I'm really not sure what to make of this. |
I'm pretty certain that I know what the problem is. I can craft a small file of only a few megabytes that, if I put it in front of hundreds of megabytes of only zeroes, will make the segmenter grind to a halt during the section of zeroes. It'll run at a few hundred kilobytes per second. Here's how to build the prefix file:
So the file is:
This file only works with the default 4 KiB match window size ( Here's what happens:
While this doesn't initially sound like a problem, consider this: as soon as data is added to the output, the segmenter will compute and store a cyclic hash value every 512 bytes (the default window step size). For the whole sequence of 3 million zeroes, that's almost 6,000 hash values. Unfortunately, the values are all the same and collide in the lookup table. Collisions are typically rare, but in this case, it means that as soon as it hits a long sequence of zeroes after the prefix block, whenever it finds a match (which is always), the segmenter will try to find the "best" (i.e. longest) match. In order to do so, it'll have to traverse the list if 6,000 colliding candidates, check each candidate by comparing the 4KiB of memory, and then find the longest match in the results. And I'm convinced this is what happens with your image, too. I think the pattern
It looks like there are about 90 such collisions present, and this is already enough to cause a very major slowdown. Now, this doesn't only happen with zeros, of course, any repeating sequence will do. And the collisions aren't even a problem unless at some point most of the file turns out to be this one repeating sequence. I've got a few ideas for how to address this problem, I'll post here once I've had some time to think about it a bit more. |
Awesome, thanks for diving into it at this level. I suspected it was doing something like that. There are spots in compressing the .img file where it did jump rapidly ahead (likely a big zero region in the middle) but then it crawled on data regions after that. For what it's worth, I tried the command line you suggested in a previous message: mkdwarfs -l 6 -B 8192 -L 64G --order=path -i ./old_raptor -o ./test.dwarfs -W 12 -w 1 --bloom-filter-size=1 It's at 95% now, and has been running for almost 24 hours. Throughput is at 250KiB/s with only 6% CPU usage. I'm going to let it finish just to see what the final file size is. I think you're already on to the root cause though. I wish I had the time to jump into the code and attempt to contribute. My dream archiver would have a 'compress further' feature, where speed wouldn't matter. First compress would be quick, with large block matching. Then a background process could sub-chunk, dedupe, and further shrink a whole collection of archives, using idle cycles. Lot's of archivers have a 'recompress' feature, but they just restart from scratch, rather than sub-chunking and keeping all the work that went into deduping the larger blocks. Just a thought :) |
That's actually quite normal. Whenever the segmenter finds a match, it can immediately skip ahead In the meantime, I've made the following change:
This doesn't catch all cases that could trigger the collisions (e.g. any long sequence that repeats every power-of-two bytes could still trigger it), but unless you explicitly craft such data, collisions should really be rare. I could even catch these cases if necessary. Anyway, the results look pretty good:
|
Thanks for the fast fix! I'll try it out now. I've been using the compiled releases up to this point, but I'll do a check out and build so I can stay up to date on commits. I'll do a follow up post with results. |
Oh, sorry, I didn't push that to github yet! I've been working on the segmenter for the past few weeks (still not finished) and I've decided to add this feature on top of the new code, which is still far from being ready. I'll backport it if I need to, but I'd rather not. If you're feeling extremely adventurous, you can try building the If you want, you can also try adding:
I'd be interested to know if this speeds things up, especially with higher compression levels. (What this does: it performs an initial scan of the data to identify incompressible fragments, which will still be subject to segmenting, but no longer to compression. Also, the segmenting happens in a separate thread, so it might be faster if there's enough already compressed data in the input.) Oh, and let me know if you'd rather have a static binary to play with. These drop out of my CI pipeline after every commit anyway. |
Oh, you'll need |
Fixed building with older clang versions. |
Cool, I'll try it. I'm still running on Kubuntu 21.04, and the latest apt version is clang-12. If it works under that I'll be good to go. Otherwise I'll ask for a static binary. I'm planning to upgrade to 23.04 in the next couple days, which should have the newer clang. I'm sure I could manually get this up to clang-16, but I'd rather nuke this system and update all at once to 23.04. |
Here you go: dwarfs-universal-0.7.2-130-g254bf41-Linux-x86_64.gz |
I was able to compile it, and I'm running this command now: ~/git/dwarfs/build/mkdwarfs -l 6 -B 8192 -L 64G --order=path -i ./old_raptor -o ./cat_test.dwarfs -W 12 -w 1 --bloom-filter-size=1 It's at 54% now, but the data rate has fallen to 1MiB/s at 7% CPU use, so it could be a while before I can report on a before/after time. The initial rate was 10-30Mib/s with ~20% CPU use. You mentioned earlier about fitting into the CPU cache... What parameters control that? And how would I calculate the right parameters for my cache size? I have 30MB cache on each E5-2697 v2 CPU. When I was using this for monero mining, there was a similar issue about fitting into cache. They tune it down to fewer cores in order to fit the whole thing in cache, which resulted in a higher hash rate. Maybe I need to reduce cores and tweak compression parameters to keep the data rate up. |
Mmmmh, are you talking about 7% / 20% CPU use across all cores or just one core? I would expect the new code to run almost as fast as the old code did on the "sparse" image you tested it with. |
The 7% / 20% was the total combined CPU usage. I ran it again from the start, to see the breakdown of individual CPU usage, and it looks like only 4 or 5 CPUs go into 80%+ usage, and the rest are idle. It slows down to 100KiB/s towards the end, with almost no CPU usage (usually one core intermittently spikes to 100%). I have results for the categorizer version... It took 24.84h and compressed to 9.896GiB. The normal release version took 28.22h and compressed to 9.882GiB. If this helps, here is the full output: ~/git/dwarfs/build/mkdwarfs -l 6 -B 8192 -L 64G --order=path -i ./old_raptor -o ./cat_test.dwarfs -W 12 -w 1 --bloom-filter-size=1 Let me know if there are any artificial tests I can run to help out. Maybe there is something unique about my CPU's, or my linux settings. I might try your dd random test with a 222GiB image and see what that does... and also test it with fewer cores, to see if this comes back to a cache fit issue. |
This was done with the copy that I compiled. I'll try your static build, in case the compile on my end introduced something weird. Here's the output of mine, to confirm it is the right build: ~/git/dwarfs/build/mkdwarfs mkdwarfs (v0.7.2-130-g254bf41 on branch mhx/categorizer) Usage: mkdwarfs [OPTIONS...] Options: |
Sorry to keep iterating on this... Could you please try my binary along with (more or less) your original set of options:
I'm pretty certain now that the set of options I suggested at some point in this thread don't work well for your use case, so I'd like to see if there's an improvement compared to the command that you originally ran. I've just left out all options that definitely don't make a difference. |
Absolutely... glad to help! I'll abort the last test and run the new line now. I did notice higher CPU usage on your build (30% instead of 20%), but I think the parameter change will tell us more. I'll post an update when it's done. |
That went fast, 29.8m to compress to 12.77 GiB: ~/dwarfs-universal-0.7.2-130-g254bf41-Linux-x86_64 --tool=mkdwarfs -l 1 -B 200 -L 150G -S 30 -W 12 -w 1 --order=path -i ./old_raptor -o ./cat_test2.dwarfs The big differences I see: -l 1 -B 200 -L 150G -S 30 -W 12 -w 1 --order=path11414 collisions in 0x00 vs -l 6 -B 8192 -L 64G --order=path -W 12 -w 1 --bloom-filter-size=11883 collisions in 0x00 In my earlier tests, it seemed that the largest block size (-S 30) is what brought the final size down the most. the -l 6 dropped block size to 24. I'm going to retry that command, with -S 30 added, to see what we get. Actually... I think I'll retry the new command with a smaller bloom filter and larger bloom filter to see how it effects speed and size. |
At last! I believe what I missed in my previous analysis of bloom filter size vs. cache size is the deep lookback. There's one "main" bloom filter that sits in front of the block-specific ones. If that filter indicates that there might be a match, all blocks need to be checked, one after another. If that first filter is large enough, you'll get great performance.
Smaller will very likely tank the performance. Going a bit larger might still improve performance slightly. Can you also please try the following options:
Thanks! |
That performed really well! Compressed down to 9.891 GiB in 43.49m. That's even smaller than the 24.84h run at 9.896 GiB. What do those switches do? Full output here: ~/dwarfs-universal-0.7.2-130-g254bf41-Linux-x86_64 --tool=mkdwarfs -l 6 -B 200 -L 150G -S 30 -W 12 -w 1 --order=path --categorize=incompressible -C incompressible::null --incompressible-fragments -i ./old_raptor -o ./cat_test2.dwarfs |
Phew! :-) Yeah, this is roughly what I expected.
This is comparable to the earlier run.
This is where this version spent quite a bit of time, squeezing out a few extra GiB.
They're all part of the new categorizer feature. There are going to be more categorizers than just One nice side-effect of the categorizers is that each category gets its own segmenter, all of which can run in parallel. (But this isn't yet the parallelization that I have in mind for the segmenters.) |
Dumb question, but have you tried something like -W8? In my tests, I have found it works best for binaries, like what you're trying to compress. |
I'll give that a shot. For some reason I thought -W only went down to 12... Actually, I think I did try going below 12 and there was a read performance or mount issue. Mounting with -o mlock=must was failing on certain compression parameters, and I think going under 12 triggered that. In any case, I'll test it out. |
What's the size of your metadata? Can you post the output of running |
I'll try remounting my older archives to see if I can find one that does that. Originally I had a script that would try mlock=must and then fallback to no mlock, then I discovered the mlock=try option, so I haven't seen the error in a while. Since I have 256GB of RAM, I use these mount parameters: -o allow_other -o auto_unmount -o cache_image -o cache_files -o cachesize=200g -o workers=48 -o mlock=try -o large_read -o splice_read -o decratio=0 After benchmarking, they all seem to improve performance enough to keep them in there. As for the test with W8, I ended up aborting that after 24 hours. This was a test on a different set of drive images (two different installs of LinuxCNC on 60GB SSD's). Compression with W12 went in about 1 hour, but at W8 it was at 30GB after 24 hours, with 10% left to go, and the W12 archive was 35GB completed, so it didn't look like there was much savings in the cards there. I may have deleted or recompressed the ones that had the mlock fail... If so, I'll attempt to recreate it. It will likely take a 24h+ compression to get that result, so I'll have to give an update on that tomorrow. |
Yes, you can usually compare the first couple of blocks to see how much does or doesn't save the difference in window size. Try playing around with it. it'll depend on your data. |
The m8 archive finished and it mounted without the mlock error. Unfortunately, I didn't save the archive that gave me that error. I set my script back to mlock=must, so if an archive fails to mount in the future, I'll see it and report it. With m12 uncompressed metadata size: 72.65 MiB, 9,095,435 chunks, 34.91 GiB (74.87%) |
Yes, you can test it in increments of 1. It's usually pretty obvious the moment the smaller window gives dwarfs an edge, you see the "saved by segmenting" section explode. |
What's the percentage at the end? There's quite a difference in percentage, but the image sizes seem pretty close. That being said, the gain in terms of image size seems rather small, so I'd probably stick to the larger block size. |
What's the state of this possibly making it into the main branch? The flac compression seems really useful, and I was wondering whether any consideration could be given to using jxl for jpeg compression, in a similar manner? |
It'll make it into the next release. Merging to main is blocked on a few issues that I need to find solutions for and I currently have little time to work on.
There's a subtle difference that makes FLAC much easier to integrate. Audio data can easily be merged/split into blocks, each of which can be independently compressed. So the FLAC integration works on the block level, just like any other compression in DwarFS. For jxl, DwarFS needs the ability to apply compression/decompression on the file level. It's on the roadmap, but not for the next release. |
If you want to play with the new features, the CI build workflow now automatically uploads the binary artifacts (universal binaries as well as the regular binary tarball) for each build. These are definitely not "release quality", but it's still good to give them wider exposure to find bugs early. |
Another wish list request:
Is it possible for the segmenting code to recognize that it is reading a zero filled segment and accelerate the rate at which it is processed? When segmenting with 8 KiB or 4 Kib windows, on a 200gb+ image, the read rate is between 5 MiB/s and 10 MiB/s, with about 50% of the image being zero filled. I have block search set to cover all blocks, so that could be the bottleneck. I'm wondering if some kind of optimization can happen, like recognizing an XXH3 hash of all zero's, and then bypassing block searching and other processes for that block?
Read speed for zero blocks is phenominal, after your last fix. I'm hoping there's an easy way to bring write speed up for zeros too O:-)
The text was updated successfully, but these errors were encountered: