-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduling fairness between spawn and par_iter #1054
Comments
Hello RemiFontan and Rayon developpers, I was going to try to open a very similar problem.
I logged the whole and found the collector get stuck at random time trying to insert Temporary work around: |
Rayon can barely be said to have a scheduler at all -- it just greedily looks for any available work to keep busy as much as possible. There's some heuristic to this, like local stealing defaults to LIFO to work on stuff that's hopefully in cache, while cross-thread stealing is FIFO to take stuff that the "victim" perhaps hasn't touched in a while. And the global queue (injected from outside the pool) is considered last of all, with the idea that we should finish what the pool is working on before starting anything else.
With your example of The |
Not a rayon developer here, but was working on a similar scenario. I would guess the reason could be
I have some workarounds for some limited scenarios but not all:
Edit: The above workarounds requires an external producer threads in terms of architecture. |
Thanks for the hint to lifo aspect. It helped me a lot. |
Hi @cuviper, on a second thought, for a GUI system where tasks are generated continuously (or basically any system that generates tasks without waiting for them), is it possible that they can form a growing chain of blocking that eventually overflow the stack? I feel like some diagnostic APIs maybe needed to help avert this problem. |
Yes, I think that once it's possible for this to happen between two events, it's also possible for that to happen again indefinitely. My intuition is that the repeated probability of that will grow vanishingly small, but I'm not sure.
Any idea what that might look like? |
Since this problem is beyond the scope of rayon, I was thinking of a backpressure mechanism to notify the users to stop congesting the threadpool. A crude idea is that, if the user is continuously generating tasks, then they should also continuously (though less frequently) check how congested the threadpool is. In this scenario, rayon may only need to provide one/more atomic variable to roughly indicate the stack depth of the most congested worker. This is a very crude idea and I haven't really meet with this kind of problem, we can go for the long shot. |
Thanks for all your replies, this is very interesting. Out of curiosity, I reproduced that example in C++ with TBB. My assumptions are that Rayon and TBB work similarly, at least that's what I was thinking. it seems that TBB suffers less from that problem. nested parallel_for do tend to block parent tasks a little less. Their documentation explains heuristics used. I find that snippet interesting: Is rayon following the same logic? Here's my ugly c++ equivalent: that gave me, on my mac:
|
I believe so. For more detail you can refer to https://github.com/rayon-rs/rfcs/blob/master/accepted/rfc0001-scope-scheduling.md. From what I have read, TBB is using exactly the same strategy as default LIFO rayon, where local thread always tries to execute its newest task, but a thief would always steal the oldest task. |
thanks for sharing the rayon doc. I can see the parallel between both documentation. So, if I understand correctly, with rayon, I should expect a thread to steal the oldest task (front of the deque) as soon as its local task list is empty. My little script (see the playground above) does not seem to show sign of work stealing. I tried to initialising the thread pool into using more threads for force stealing. My machine has 10 cores. I tried with 20 threads, 40 threads. The behaviour did not change much. Most of the spawn tasks only completed once all nested par_iter completed. However in the provided C++/TBB example, As soon as a parallel_for is done, its parent task resumes and has a chance to complete. Does that mean in my example, rayon does not do any work-stealing, while TBB does? Is there a way to visualise what rayon does? like tracing of some sort? edit:
I believe we can see in the rayon run, two successful work stealing happening, but all other tasks are stuck. While in the TBB example every tasks gets process fairly soon after their inner parallel_for complete. Apologies for repeating the same thing again, but I wish I would understand what is happening. :-) |
First I want to point out that there is work stealing happening in your rayon results . It just seems that in that scenario, rayon steals "sub"-items produced by the par_iter much more frequently. And by "stealing", it does not imply that a thief can complete a task on behalf of the original adopter thread. On the contrary, if the orignial adopter is busy stealing from others, then the completion of that task may be indefinitely postponed as previous comments has pointed out. I have no experience with TBB but I would point to some potential causes for the differences
|
Note, I'm not an expert at all in TBB and obviously I don't know much about the internals of rayon. Looking at TBB's source code, this is my very limited understanding. parallel_for seems to be creating and registering some sort of "wait_node" object that creates a parent-children relationship between the original stack and the task : run(...) @ static void run(const Range& range, const Body& body, Partitioner& partitioner, task_group_context& context) {
if ( !range.empty() ) {
small_object_allocator alloc{};
start_for& for_task = *alloc.new_object<start_for>(range, body, partitioner, alloc);
// defer creation of the wait node until task allocation succeeds
wait_node wn;
for_task.my_parent = &wn;
execute_and_wait(for_task, context, wn.m_wait, context);
}
} the finalize(...) @ template<typename Range, typename Body, typename Partitioner>
void start_for<Range, Body, Partitioner>::finalize(const execution_data& ed) {
// Get the current parent and allocator an object destruction
node* parent = my_parent;
auto allocator = my_allocator;
// Task execution finished - destroy it
this->~start_for();
// Unwind the tree decrementing the parent`s reference count
fold_tree<tree_node>(parent, ed);
allocator.deallocate(this, ed);
} now... I am clearly not understanding how it actually works under the hood. but I'm wondering whether this parent waiter object is the reason the parent tasks are resumed as soon as their parallel_for tasks are done... this is all speculation at this point. |
Interesting -- I think that is a plausible trail. Rayon runs parallel iterators with recursive So in essence, one waiter has less chance of getting "distracted" than a bunch of recursive waiters. Or that one may come back around to notice completion more timely than a bunch would. We couldn't do that in general, because anything that has a meaningful reduction of results will need to apply that at each step. But cases where we use the internal |
Here's a proof of concept for the This passes all tests, and reaches "Done" in your example much sooner. I haven't checked any benchmarks yet to see what other effect it has, but at the very least it does add allocation for the spawns. |
this is great, it behaves exactly like I was hoping. I very much appreciate you took the time to make a proof of concept. Also, I pulled your branch and tried it on my GUI, works great from the point of view of user experience. My widgets do get updated as soon as their spawn tasks are done, no more tasks getting stuck until all tasks are completed. |
Unfortunately, it does appear to have high overhead. From the
But with that
There's one improvement (!), but three much worse regressions. This is even after I tweaked #1057 and #1058 trying to address some of the shortcoming. And while micro-benchmarks are always suspect, I also found that one of my "real world" Project Euler solutions using parallel I'm trying to think of ways to make this opt-in -- perhaps a Alternatively, you could try to write your GUI code directly using |
Ouch. Is the overhead in allocating the task objects for |
Looking at perf for I think the main lesson is that |
I've gotten rid of this in #1059, so using
The shared counter still dominates perf -- now 45% |
Out of curiosity, by "distributed counter", did you mean something along this? |
Something like that, yeah, but I'm not sure if those designs apply because they're de-prioritizing the reader, and we still need to know when it reaches zero. |
Hi,
I have noticed a surprising behaviour. My progream is spawning multiple tasks, and each of them is calling into some par_iter functions.
I would expect a spawn task to complete as soon as its par_iter has completed. However all tasks do seem to get stuck until all par_iter loops are completed.
It is a little tricky to explain, but fortunately I could reproduce this behaviour on rust playground.
Permalink to the playground
10 tasks are spawn. Each of them will increment a counter in parallel, and finally switch their state to "done" once they are done.
The main thread polls those tasks and prints their progress until all tasks are done.
I would expect each task to switch to "done" as soon as its inner iteration has completed. However, depending on the number of threads, number of par_iter iterations and possibly the architecture, some tasks get stuck until are done.
On my MacBook M1 Max Ventura 13.4. all tasks are stuck until they are all done.
On rust playground, the behaviour is a similar.
I noticed this behaviour in my GUI.
Each of those tasks is doing some heavy calculations for a widget. I was hoping each widget will be able to update its UI as soon as its computations are done. But unfortunately they all get stuck until all computations are done.
I tried to use
spawn_fifo
but that did not make a difference in my case. Are there other options to change this behaviour.Regards.
The text was updated successfully, but these errors were encountered: