How to control how rayon par_iter splits work? #1134

VorpalBlade · 2024-02-22T13:51:25Z

VorpalBlade
Feb 22, 2024

I have a list of files I want to perform an expensive operation on (hashing them with sha256 and comparing to known good values). I have been using rayon into_par_iter for this and it works.

However, when profiling I noticed a long tail of just a single thread doing work at the end. My files have very wildly varying sizes (from tens of bytes to hundreds of megabytes, I'm checking files installed by Linux distro package managers) so this makes sense: I thought I got unlucky and got a big file at the end.

My first idea was to sort by size and put the big files first in the iterator that I run rayon on. Thus all the big slow jobs would get started early I thought. This did not work: one thread runs for an exceedingly long time instead.

It appears that rayon splits the work in chunks and schedule those (and that if a chunk is long running, work can't be stolen from it). This makes sense to reduce the overhead of course, but I would like to adjust this behaviour somehow.

Perhaps have smaller chunks, or associate a cost function with each entry in the iterator. Maybe let me provide a function where I manually chunk the work. I know ahead of time how costly each unit of work is (for large sizes it scales with the file size, which I already know from package metadata). So I could just take a fixed number of MB and declare that as a chunk.

Or maybe you have another idea for how to deal with this.

Answered by adamreichold

Feb 22, 2024

This would be a quick workaround, but it sounds like you do not really benefit from the work done by ParallelIterator do as much as sequential as possible while keeping all cores occupied since every single work item is sufficiently large in your case. Hence maybe just using par_bridge would immediately yield the maximum granularity best suited to this use case?

View full answer

adamreichold · 2024-02-22T14:15:21Z

adamreichold
Feb 22, 2024
Collaborator

This would be a quick workaround, but it sounds like you do not really benefit from the work done by ParallelIterator do as much as sequential as possible while keeping all cores occupied since every single work item is sufficiently large in your case. Hence maybe just using par_bridge would immediately yield the maximum granularity best suited to this use case?

3 replies

VorpalBlade Feb 22, 2024
Author

I'm not sure about that, I will have to profile it when I'm back at my computer (on my phone currently). My intuition would be that for smaller files, and things like symlinks, directories etc (where I only check if properties like owner/group/modes match) I do probably benefit from batching.

Maybe I could produce my own batches and then par_bridge those? I gather from your description that par_bridge does not perform any batching?

adamreichold Feb 22, 2024
Collaborator

I'm not sure about that, I will have to profile it when I'm back at my computer (on my phone currently). My intuition would be that for smaller files, and things like symlinks, directories etc (where I only check if properties like owner/group/modes match) I do probably benefit from batching.

I would be surprised insofar you probably do system calls etc. per work item. The batching in the sense of having Producer::into_iter means that e.g. serial parts can still be vectorized as a fully serial implementation would be.

Maybe I could produce my own batches and then par_bridge those? I gather from your description that par_bridge does not perform any batching?

Not in the sense of actually producing an honest-to-goodness serial iterator for the chunks, it just puts the source iterator in a mutex and then pulls the items out one-by-one, c.f. https://docs.rs/rayon/latest/src/rayon/iter/par_bridge.rs.html#155.

(It does ensure sufficient splitting to keep all threads occupied though so it seems like an easy alternative to avoid manually spawning one thread per core pulling work items from a shared queue in this particular scenario.)

VorpalBlade Feb 23, 2024
Author

I did some benchmarking today. The difference was small enough that I had to mess around with cpu frequency and govenors etc to get statistically significant results (that doesn't usually happen).

The key insight however was when I realised par_bridge had significantly lower variance. This makes sense: The order of my input collection is also built via rayon, from reading a set of metadata files. So it will not always have the same order. So best case, into_iter().par_bridge() and into_par_iter() are comparable (difference in best case is not statistically significant). Worst case and average case par_bridge is better.

This points to batching not being a significant factor in my case (as you correctly suspected, thanks I have learnt something!). There is still of course a risk that some big slow files (the largest one is 1.2 GB) gets picked near the end. But at least you avoid a case of a batch with multiple really large files in them.

So I decided to sort by descending size (using par_sort) and if see that would possibly reduce the variance further. I had some predictions going in:

Sorting itself will have overhead (obviously).
Maybe it will be worse for cache reasons (nearby files in the directory structure will no longer be opened close in time to each other).
But perhaps it will be better if all directories and symlinks are processed near the end (more predictable branches for the CPU since if I sort by descending size non-regular files that I don't track size for all ends up at the end).
Maybe those micro-optimisations (cache and branch prediction) won't actually matter due noise, or the aforementioned sorting overhead.

And the results were: Not sorting is faster in the base case, and even on average. And comparable to sorting in worst case. But sorting had lower standard deviation (as expected). So sorting adds enough overhead that I won't be using it. (Oh and I also tried some other sorting implementations (such as glidesort, which had even lower variance but was slightly slower, just barely statistically significantly so)).

As for if it was due to the micro-effects or sorting overhead, I'm guessing sorting overhead. But I don't have a really good way of measuring that (I lack the deep knowledge of perf needed to tease that information out of it).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to control how rayon par_iter splits work? #1134

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How to control how rayon par_iter splits work? #1134

VorpalBlade Feb 22, 2024

Replies: 1 comment · 3 replies

adamreichold Feb 22, 2024 Collaborator

VorpalBlade Feb 22, 2024 Author

adamreichold Feb 22, 2024 Collaborator

VorpalBlade Feb 23, 2024 Author

VorpalBlade
Feb 22, 2024

Replies: 1 comment 3 replies

adamreichold
Feb 22, 2024
Collaborator

VorpalBlade Feb 22, 2024
Author

adamreichold Feb 22, 2024
Collaborator

VorpalBlade Feb 23, 2024
Author