Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added multithreading to zstd compression #689

Merged
merged 3 commits into from
Jul 27, 2024
Merged

Conversation

nalabrie
Copy link
Contributor

Hello, this is my first PR here and my Rust is a bit rusty, so forgive me if this is a foolish change or if PRs are unwelcome. I created issue #688 and decided to take a stab at implementing it.

…age version. Added num_cpus package to get core count.
@nalabrie
Copy link
Contributor Author

Whoops I think I put the wrong pull number in the CHANGELOG.md

fixed pull number
@nalabrie
Copy link
Contributor Author

This needs more testing, in my opinion. It works, but the speedup isn't as significant as it probably could be. Here is a result of a quick test using a testdir folder of 57 files totaling ~200MB:

  • Using current ouch stable version, ouch c --level 20 testdir output.tar.zst takes about 1 minute, 46 seconds (106910 ms).
  • Using the version in this PR, the same command takes about 1 minute, 7 seconds (67829 ms).
  • Using zstd version 1.5.5, zstd -T0 --ultra -20 testdir.tar takes about 35 seconds (35106 ms).
    • Note that in this case the testdir was already a tar file, which I timed separately to be about 400 ms.
    • The --ultra flag simply lets the level go above 19.

Copy link
Member

@marcospb19 marcospb19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good PR! Thanks for taking the time to contribute, I added a small requested change here.

src/commands/compress.rs Show resolved Hide resolved
@marcospb19
Copy link
Member

the speedup isn't as significant as it probably could be

I see, to be honest I'm not 100% sure why our version is slower.

I'll let you make the call tho, if you want to investigate further or you think we can merge this as it is.

If you decide to investigate, I highly highly recommend looking into flamegraph which can be installed from the crate cargo-flamegraph (you don't need to use it through cargo, it's a binary), with it, you can run both zstd and Ouch to see what's taking most of the time.

But it's a valuable improvement regardless! So feel incentivized and not pressured to do so 😅 .

Copy link
Member

@marcospb19 marcospb19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops can you also add the ² from ✓² in README to tell that compression is done in parallel? :v

https://github.com/ouch-org/ouch?tab=readme-ov-file#supported-formats

@nalabrie
Copy link
Contributor Author

Thank you, I am happy that this change is appreciated and I'm glad to contribute. I will make the changes that you requested in the coming hours.

As for investigating why the speedup isn't on par with the zstd tool, I do want to take a closer look at it but no promises on finding a solution. I enjoy Rust as a hobby and consider myself to be at an intermediate level, but I'm not super confident in my abilities if I'm being honest. I've heard of flamegraph and have wanted to try it out so this gives me a good excuse to learn it. I will work on this some today and tomorrow but if I cannot find anything then I'm fine with merging it as is (after the changes mentioned above, of course).

@nalabrie
Copy link
Contributor Author

nalabrie commented Jul 17, 2024

I'm not at my Linux machine right now, so I can't run flamegraph. Apparently it doesn't work on Windows, or at least I can't get it to work. I should have time to do that tomorrow, but for now I wanted to at least provide some proper benchmarks. I have also made the requested changes to the PR @marcospb19.

I did some research and found this issue on the repo for the zstd crate that ouch is using. The owner mentions that there is overhead involved in using the "streaming API." Perhaps that is a cause of the slowdown compared to the zstd binary?

Benchmark Notes

  • All results measured in seconds.
  • Files used for compressing are from the Silesia compression corpus. This is what Facebook uses for their zstd benchmarks.
  • All files compressed into a .tar.zst archive using level 20.
  • st = single-threaded, mt = multi-threaded (using all available physical cores).
  • System details:
    • AMD Ryzen 5 3600X (6 cores, 12 threads)
    • zstd 1.5.6
    • ouch 0.5.1 (for single-threaded benchmarks)
    • ouch this commit from my fork (for multi-threaded benchmarks) (release profile)

Results

ouch (st) ouch (mt) zstd (mt)
Run 1 111.9764401 69.7551597 47.8371113
Run 2 112.2669172 69.3189215 47.6084749
Run 3 112.0817931 69.1581166 47.5799035
Run 4 111.8815577 69.2234262 47.9138429
Run 5 111.7506892 69.5047793 47.7037464
Run 6 112.0945928 69.2046794 47.5450366
Run 7 112.1228552 69.196134 47.8802425
Run 8 111.7783502 69.2298791 47.5509878
Run 9 112.0262394 69.2156705 48.0269953
Run 10 111.7842633 69.3845321 47.8702028
Average 111.97636982 69.31912984 47.7516544

In short, multi-threaded ouch takes about 38% less time than single-threaded ouch and the zstd binary takes about 31% less time than multi-threaded ouch.

@nalabrie
Copy link
Contributor Author

Here is the output of flamegraph --root -- ./ouch c --level 20 silesia silesia.tar.zst. Ouch was compiled with the dev profile.
flamegraph
Judging by the comment preview, it looks like the svg isn't interactive like its supposed to be. If that is the case for anyone reading this, I have also uploaded the file here.

I can somewhat understand how to parse that information but I do not know where to begin when it comes to using that to optimize further. At least not at the moment, but when I have more free time I will possibly investigate further. I'm fine with merging it as is.

As a side note, here is more terminal output that flamegraph spit out. Not sure how important it is.

[ perf record: Woken up 5997 times to write data ]
Warning:
Processed 101962 events and lost 6 chunks!

Check IO/CPU overload!

[ perf record: Captured and wrote 1504.683 MB perf.data (95002 samples) ]
writing flamegraph to "flamegraph.svg"

Another side note, it seems to run better on Linux compared to Windows (same hardware), but that isn't too surprising to me. The gap is still there between ouch and zstd, however. I only timed one test, so no averages.

  • ./ouch c --level 20 silesia silesia.tar.zst took 54.9 seconds (release profile).
  • tar cf silesia.tar silesia && zstdmt --ultra -20 silesia.tar took 43.888 seconds.

@marcospb19
Copy link
Member

Oh thank you so much for all the investigation and effort you put into this! I highly appreciate it.

I did some research and found gyscos/zstd-rs#266 on the repo for the zstd crate that ouch is using. The owner mentions that there is overhead involved in using the "streaming API." Perhaps that is a cause of the slowdown compared to the zstd binary?

That'd make sense! I didn't know that, and yeah we use the streaming API to chain encoders and decoders.

It also makes sense that it adds overhead since we make lots of small calls instead of just one.

I can somewhat understand how to parse that information but I do not know where to begin when it comes to using that to optimize further.

Yeah if you run flamegraph just against Ouch, you'll be a little lost.

To compare both tools you'd need to run flamegraph against ouch and zstd, and if we use the same implementation (and functions, I'm not entirely sure), then you'd see the same bars but with different proportions.

With that, by comparing the size of the bars you could get a rough comparison of where the extra time is spent.

However, on second thought, I don't advise you to investigate further, flamegraph sampling with multithreading can be unpredictable, and we don't know if we are calling the exact same functions, and you'd likely find out at the end that it's indeed a problem in the Streaming API.

Well, you can do it if you're still curious 😄 but you've done a lot for this PR, ty!

@marcospb19 marcospb19 merged commit 3867fa3 into ouch-org:main Jul 27, 2024
82 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants