Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Adding API for parallel block to task_arena to warm-up/retain/release worker threads #1522

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
188 changes: 188 additions & 0 deletions rfcs/proposed/parallel_block_for_task_arena/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
# Adding API for parallel block to task_arena to warm-up/retain/release worker threads

## Introduction

In oneTBB, there has never been an API that allows users to block worker threads within the arena.
This design choice was made to preserve the composability of the application.<br>
Before PR#1352, workers moved to the thread pool to sleep once there were no arenas with active
demand. However, PR#1352 introduced a delayed leave behavior to the library that
results in blocking threads for an _implementation-defined_ duration inside an arena
if there is no active demand arcoss all arenas. This change significantly
improved performance for various applications on high thread count systems.<br>
The main idea is that usually, after one parallel computation ends,
another will start after some time. The delayed leave behavior is a heuristic to utilize this,
covering most cases within _implementation-defined_ duration.

However, the new behavior is not the perfect match for all the scenarios:
* The heuristic of delayed leave is unsuitable for the tasks that are submitted
in unpredictable pattern and/or duration.
* If oneTBB is used in composable scenarios it is not behaving as
a good citizen consuming CPU resources.
* For example, if an application builds a pipeline where oneTBB is used for one stage
and OpenMP is used for a subsequent stage, there is a chance that oneTBB workers will
interfere with OpenMP threads. This interference might result in slight oversubscription,
which in turn might lead to underperformance.

So there are two related problems but with different resolutions:
* Completely disable new behavior for scenarios where the heuristic of delayed leave is unsuitable.
* Optimize library behavior so customers can benefit from the heuristic of delayed leave but
make it possible to indicate that "it is the time to release threads".

## Proposal

Let's tackle these problems one by one.

### Completely disable new behavior

Let’s consider both “Delayed leave” and “Fast leave” as 2 different states in state machine.<br>
* Therefore, workloads that knows that they cannot benefit from the heuristic of delayed leave
but rather, it brings performance problems can create an arena in “Fast leave” state.
* And the opposite by default arena will be created in “Delayed leave” state because
the delayed leave behavior is a heuristic that benefit most of the workloads.
isaevil marked this conversation as resolved.
Show resolved Hide resolved

<img src="completely_disable_new_behavior.png" width=800>

There will be a question that we need to answer:
* Do we see any value if arena potentially can transition from one to another state?
* What if different types of workloads are mixed in one application?
* Different types of arenas can be used for different types of workloads.
akukanov marked this conversation as resolved.
Show resolved Hide resolved

### When threads should leave?

oneTBB itself can only guess when the ideal time to release threads from the arena is.
Therefore, it do the best effort to preserve and enhance performance without completely
isaevil marked this conversation as resolved.
Show resolved Hide resolved
messing composability guarantees (that is how delayed leave is implemented).

As we already discussed, there are cases where it does not work perfectly,
therefore customers that want to further optimize this
aspect of oneTBB behavior should be able to do it.

This problem can be considered from another angle. Essentially, if the user can indicate
where parallel computation ends, they can also indicate where it starts.

<img src="parallel_block_introduction.png" width=800>

With this approach, the user not only releases threads when necessary but also specifies a
programmable block where worker threads should expected new work coming regularly
to the executing arena.

Let’s add new state to the existing state machine. To represent "Parallel Block" state.

> **_NOTE:_** The "Fast leave" state is colored Grey just for simplicity of the chart.
Assume, that arena was created with the "Delayed leave" and the same logic
is applicable to the "Fast leave".
isaevil marked this conversation as resolved.
Show resolved Hide resolved

<img src="parallel_block_state_initial.png" width=800>

This state diagram leads to several questions. There are some of them:
* What if there are multiple Parallel Blocks?
* If “End of Parallel Block” leads back to “Delayed leave” how soon threads
will be released from arena?
* What if we indicated that threads should leave arena after the "Parallel Block"?
* What if we just indicated the end of the "Parallel Block"?

The extended state machine aims to answer these questions.
* The first call to the “Start of PB” will transition into the “Parallel Block” state.
* The last call to the “End of PB” will transition back to the “Delayed leave” state
or into the "One-time Fast leave" if it is indicated that threads should leave sooner.
* Concurrent or nested calls to the “Start of PB” or the “End of PB”
increment/decrement reference counter.

<img src="parallel_block_state_final.png" width=800>

Let's consider the semantics that an API for explicit parallel blocks can provide:
* Start of a parallel block:
* Indicates the point from which the scheduler can use a hint and keep threads in the arena
for longer.
* Serves as a warm-up hint to the scheduler:
* Makes some worker threads immediately available at the start of the real computation.
isaevil marked this conversation as resolved.
Show resolved Hide resolved
* Should have similar guarantees as `task_arena::enqueue` from a signal standpoint.
isaevil marked this conversation as resolved.
Show resolved Hide resolved
* "Parallel block" itself:
* Scheduler can implement different policies to retain threads in the arena.
* The semantic for retaining threads is a hint to the scheduler;
isaevil marked this conversation as resolved.
Show resolved Hide resolved
thus, no real guarantee is provided. The scheduler can ignore the hint and
move threads to another arena or to sleep if conditions are met.
* End of a parallel block:
* Indicates the point from which the scheduler may drop a hint and
no longer retain threads in the arena.
* Indicates that arena should enter the “One-time Fast leave” thus workers can leave sooner.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording "should enter..." might create/add to confusion. As already mentioned, from the usage standpoint "one time fast leave" is not a state to enter, but rather a command to execute when the parallel phase has ended and no other one is active.
So I would change to something like "Indicates that worker threads should avoid busy-waiting once there is no more work in the arena".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced to the suggested one.

* If work was submitted immediately after the end of the parallel block,
the default arena "workers leave" state will be restored.
* If the default "workers leave" state was the "Fast leave" the result is NOP.
isaevil marked this conversation as resolved.
Show resolved Hide resolved


### Proposed API

```cpp
isaevil marked this conversation as resolved.
Show resolved Hide resolved
akukanov marked this conversation as resolved.
Show resolved Hide resolved
class task_arena {
enum class workers_leave : /* unspecified type */ {
fast = /* unspecifed */,
delayed = /* unspecifed */
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to find better names to the enum class and its values.
I am specifically concerned about the use of "delayed" in case the actual behavior might be platform specific, not always delayed. But also workers_leave is not a very good name.

Copy link
Contributor

@isaevil isaevil Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking... Combined with your previous comment about automatic, perhaps we could have 3 modes instead:

  • automatic - platform specific default setting
  • fast (or any other more appropriate name)
  • delayed (or any other more appropriate name)

If we assume that we have these 3 modes now, fast and delayed modes would enforce behavior regardless of the platform. That would give user more control while preserving usability (for example, automatic would be translated to fast option for hybrid systems).

What do you think?

Copy link
Contributor

@akukanov akukanov Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's an option we can consider. We'd need to think though what the "enforced" delayed mode will guarantee.

For the "automatic" mode, we say that after work completion threads might be retained in the arena for unspecified time chosen by internal heuristics. How would the definition of "enforced" delayed leave mode differ, what additional guarantees it would provide to make sense for users to choose it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like if we have a dedicated automatic policy, that means that delayed policy should guarantee at least some level of threads' retention relatively to fast policy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the whole structure is just a hint to the scheduler no real guarantees provided therefore from the description we get:

  1. threads will leave without delay with "Fast"
  2. threads might have a delay before leaving with "Delayed"
  3. automatic will decide what state to choose

From the implementation stand point it has a lot of sense since we will have clear invariants for arena e.g., default arena on hybrid platform will have "Fast" leave state.
So it defiantly improves implementation logic while bringing some potential value to the users ("Delayed" will behave as thread retantion if user explicitly specified).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akukanov Regarding enumeration class name. Do you find leave_policy a better name than workers_leave? Seems more natural to me when using it during arena construction:

tbb::task_arena ta{..., tbb::task_arena::leave_policy::fast};

Copy link
Contributor

@akukanov akukanov Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pavelkumbrasev please describe the semantics of the delayed hint in a way that is meaningful for users.

For example, the description I used above

after work completion threads might be retained in the arena for unspecified time chosen by internal heuristics.

as well as what you mentioned

threads might have a delay before leaving with "Delayed"

are good for automatic but rather bad for delayed because all aspects of the decision are left to the implementation. Even if changed to "will be retained for unspecified time", it would still be rather weak because the time can be any, including arbitrarily close to 0 - that is, it's not really different from automatic, and there is no clear reason to prefer it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Perhaps we are not on the same page. I would like it to be:

  1. delayed - after work completion threads might be retained in the arena for unspecified time chosen by internal heuristics.
  2. automatic - implementation will choose between "fast" and "delayed"

Automatic is basically another heuristic on choosing "fast" or "leave" based on underlying HW. Perhaps, automatic not the best name.

Copy link
Contributor

@akukanov akukanov Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this definition, I see no difference for users between "automatic" and "delayed", because in both cases the decision of whether to delay or not to delay, and for how long, is left to the implementation. If that is the intended behavior, let's not complicate the API with a redundant enum value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let's have fast and automatic policies then since we're not sure right now whether we can provide some meaningful guarantees for delayed policy to user.


task_arena(int max_concurrency = automatic, unsigned reserved_for_masters = 1,
priority a_priority = priority::normal,
workers_leave a_workers_leave = workers_leave::delayed);

task_arena(const constraints& constraints_, unsigned reserved_for_masters = 1,
priority a_priority = priority::normal,
workers_leave a_workers_leave = workers_leave::delayed);
akukanov marked this conversation as resolved.
Show resolved Hide resolved

void initialize(int max_concurrency, unsigned reserved_for_masters = 1,
priority a_priority = priority::normal,
workers_leave a_workers_leave = workers_leave::delayed);

void initialize(constraints a_constraints, unsigned reserved_for_masters = 1,
priority a_priority = priority::normal,
workers_leave a_workers_leave = workers_leave::delayed);

void start_parallel_block();
void end_parallel_block(bool set_one_time_fast_leave = false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. To make the new API more composable I would indicate that the setting affects primarily this parallel block. While in the absence of other parallel blocks with conflicting requests, it affects the behavior of arena in a whole.
  2. It looks as if it is tailored to a single scenario. Along with the first bullet I believe this is the reason why there is that "NOP" thing.

Therefore, my suggestion is to address both of these by changing the API (here and in other places) to something like the following:

Suggested change
void start_parallel_block();
void end_parallel_block(bool set_one_time_fast_leave = false);
void start_parallel_block();
void end_parallel_block(workers_leave this_block_leave = workers_leave::delayed);

Then to add somewhere the explanation how this affects/changes the behavior of the current parallel block and how this composes with the arena's setting and other parallel blocks within it. For example, it may be like:

This start and end of parallel block API allows making one time change in the behavior of the arena setting with which it was initialized. If this behavior matches the arena's setting, then the workers' leave behavior does not change. In case of a conflicting requests coming from multiple parallel blocks simultaneously the scheduler chooses the behavior it considers optimal

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no composability problem really, as all but the last end-of-block calls are simply ignored, and only the last one has the one-time impact on the leave policy. Also it does not affect the arena settings, according to the design.

Of course if the calls come from different threads, in general it is impossible to predict which one will be the last. However, even if the code is designed to create parallel blocks in the same arena by multiple threads, all these blocks might have the same leave policy so that it does not matter which one is the last to end.

Using the same enum for the end of block as for the construction of the arena seems more confusing than helpful to me, as it may be perceived as changing the arena state permanently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general it is impossible to predict which one will be the last

My guess it is the last one that decreases the ref counter to zero. I don't see any issue with this. Later blocks use the arena's policy if not specified explicitly.

Using the same enum for the end of block as for the construction of the arena seems more confusing than helpful to me, as it may be perceived as changing the arena state permanently.

I indicated the difference in the parameter naming this_block_leave, but if it is not enough, we can also indicate that more explicitly with an additional type: arena_workers_leave and phase_workers_leave. Nevertheless, my opinion is that it would not be a problem if documentation/specification includes explanation of this.

Copy link
Contributor

@aleksei-fedotov aleksei-fedotov Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more question here is - what if the default value would be the opposite to the one specified for the arena setting? Meaning that if arena is constructed with the "fast leave" policy, then each parallel block/phase would have "delayed leave". I understand that it might be perceived as even more confusing, but I just don't quite understand the idea of having additional possibility for the user to specify a parallel phase that ends with the same arena's workers leave policy. What the user wanted to say by this? Why to use "parallel block" API in this case at all? It might be even more confusing.

Since we only have two policies, perhaps, it would be better to introduce something like:

class task_arena: {
//... current declarations go here, including the constructor with the new parameter
task_arena(/*...*/, workers_leave wl = workers_leave::delayed);

// Denote a parallel phase that has alternative, in this case "fast leave", workers behavior.
// If the arena was initialized with "fast leave" setting, then such alternative phase 
// will have alternative, i.e., "delayed leave" behavior.
alternative_parallel_phase_begin();
alternative_parallel_phase_end();

// or even (in addition?)
template <typename F>
alternative_parallel_phase(F&& functor);
};

If later demand appears, other parameters could be added to these functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that this discussion is not just about API semantics but really about a different architecture, where each parallel block/stage might have its own customizable retention policy. It differs significantly from what is proposed, so I think it needs deeper elaboration, perhaps with new state change diagrams etc.

Copy link
Contributor

@akukanov akukanov Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More on this:

I just don't quite understand the idea of having additional possibility for the user to specify a parallel phase that ends with the same arena's workers leave policy.

The primary point of a parallel phase is not to set a certain leave policy when the phase ends (for which it would be sufficient to have a single "switch the state" method). The parallel phase allows to use a distinct retention policy during the phase - for example, to prolong the default busy-wait duration or to utilize different heuristics. I.e., it does not switch between "fast" and "delayed" but introduces a third possible state of thread retention.

Once all initiated parallel phases end, the retention policy returns, according to the proposed design, to the state set at arena construction. However the use case for threads to leave as soon as possible still remains. For that reason, the extra argument at the end of the block is useful to indicate this "one time fast leave" request.

Hope that helps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the arena's workers_leave behavior and scoped_parallel_block both specified in the constructors, this change in behavior set at the end of a parallel block looks inconsistent.

Would it be better to have this setting be specified at the start of a parallel block rather than at its end?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make the API harder from usability standpoint. User will need somehow link this parameter from the start of the block to the end of the block.

akukanov marked this conversation as resolved.
Show resolved Hide resolved

class scoped_parallel_block {
scoped_parallel_block(task_arena& ta, bool set_one_time_fast_leave = false);
};
};

namespace this_task_arena {
void start_parallel_block();
void end_parallel_block(bool set_one_time_fast_leave = false);
}
```

By the contract, users should indicate the end of _parallel block_ for each
previous start of _parallel block_.<br>
Let's introduce RAII scoped object that will help to manage the contract.

If the end of the parallel block is not indicated by the user, it will be done automatically when
the last public reference is removed from the arena (i.e., task_arena is destroyed or a thread
is joined for an implicit arena). This ensures correctness is
preserved (threads will not be retained forever).

## Considerations

The alternative approaches were also considered.<br>
We can express this state machine as complete graph and provide low-level interface that
will give control over state transition.

<img src="alternative_proposal.png" width=600>

We considered this approach too low-level. Plus, it leaves a question: "How to manage concurrent changes of the state?".

The retaining of worker threads should be implemented with care because
it might introduce performance problems if:
* Threads cannot migrate to another arena because they are
retained in the current arena.
* Compute resources are not homogeneous, e.g., the CPU is hybrid.
Heavier involvement of less performant core types might result in artificial work
imbalance in the arena.


## Open Questions in Design

Some open questions that remain:
* Are the suggested APIs sufficient?
* Are there additional use cases that should be considered that we missed in our analysis?
* Do we see any value if arena potentially can transition from one to another state?
* What if different types of workloads are mixed in one application?
* What if there concurrent calls to this API?
Comment on lines +198 to +199
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above about making the approach to be a bit more generic. Essentially, I think we can write something like "implementation-defined" in case of a concurrent calls to this API. However, it seems to me that the behavior should be kind of relaxed, so to say. Meaning that if there is at least one "delayed leave" request happening concurrently with possibly a number of "fast leave" requests, then it, i.e., "delayed leave" policy prevails.

Also, having the request stated up front allows scheduler to know the runtime situation earlier, hence making better decisions about optimality of the workers' behavior.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
isaevil marked this conversation as resolved.
Show resolved Hide resolved
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading