Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial sender/receiver adaptations for tile algorithms #371

Merged
merged 20 commits into from
Nov 9, 2021

Conversation

msimberg
Copy link
Collaborator

@msimberg msimberg commented May 11, 2021

This is not meant to be anywhere near complete. It mostly contains a bunch of TODOs and notes for myself about what needs/can be adapted where. However, I'm opening this to start exposing others to the sender/receiver interfaces, and try to collect feedback on what sort of utilities/interfaces/functionality is useful/acceptable/harmful in your opinion.

I've changed the gemm task in one of the triangular solver implementations to use senders/receivers and it passes tests.

I see two primary benefits to using senders/receivers in DLAF:

  • More flexible sender/receiver customization points should allow us to remove some of the hacks currently in place e.g. for tile lifetime management, which can with senders/receivers be managed in real customization points, rather than our ad-hoc dataflow_finalize.
  • Potential optimizations when chaining work (I think there are actually very few, if no cases where this applies in DLAF), and potentially fewer heap allocations (although dataflow already does a decent job of this).

Short summary of the "sender adapters" that appear here:

  • template <Backend B, typename Sender> auto transform(Sender&& sender) {...}: this is a backend-specific version of future::then(executor, ...)/dataflow(executor, ...)/execution::transform (the last one is from P1897). The reason it does not take an executor/scheduler/policy or similar is that the current proposals don't really talk about the kind of customization we want to do (inserting cuda streams etc.) and what is there is semantically a bit shaky/bound to change at least for this use case. The custom CUDA/MPI executors are already a bit questionable.
  • keep_future: currently the default behaviour of a future as a sender is to get the value from the future and pass that to the continuation. keep_future is an alternative which passes the future itself, similar to how future::then takes a callable which takes a future as its argument. We might swap the default to be the keep_future behaviour. This functionality is mostly intended for the transition phase from futures to senders, and should not be needed in the end.
  • when_all_lift: P1897 proposes a when_all that takes a pack of senders and triggers a continuation when all the input senders are done. However, it takes only senders, which means that if one has ready values one would have to wrap them in execution::just (this is equivalent to how one would have to do dataflow(f, make_ready_future(3)) if dataflow only supported futures as arguments). It would be nice if the proposed when_all was a bit more flexible in this regard (including accepting void senders), and we'll follow up on that. If the standard when_all ends up accepting only senders, this is still a useful extension for HPX/DLAF.

@msimberg msimberg force-pushed the sender-receiver-adaptation branch from 87cf65f to 34ffb5d Compare June 3, 2021 15:26
@msimberg
Copy link
Collaborator Author

msimberg commented Jun 3, 2021

bors try

bors bot added a commit that referenced this pull request Jun 3, 2021
@bors
Copy link

bors bot commented Jun 3, 2021

try

Build failed:

@msimberg
Copy link
Collaborator Author

msimberg commented Jun 15, 2021

Still a draft, but I've updated a few things based on some of the things I presented yesterday.

I've introduced a policy class, templated on the backend. The policy is supposed to be similar to an HPX execution policy, but is not currently one (not necessary at the moment, not for use with HPX parallel algoritms). In the future it could be a proper execution policy, or it could be replaced by an HPX execution policy. The reason for making it a proper class, rather than just the Backend enum is that it can hold properties, such as priorities. Since the executor concept in the latest P0443 is very simplified compared to the executors we're currently using I'm not using the new executors for anything here.

I've also added a few more overloads to the tile algorithms (trsm, gemm, etc.). That means that they currently have the following overloads:

  • plain blocking CPU overload dispatching to blaspp/lapackpp
  • plain asynchronous GPU overload dispatching to cuBLAS/cuSOLVER, which takes a stream/handle
  • an overload that takes a policy and is blocking (this is equivalent to e.g. hpx::for_loop(hpx::execution::par, ...)
  • an overload that takes a policy and one sender and returns a sender (somewhat close to hpx::dataflow(hpx::for_loop, hpx::execution::par, ...) except it has lazy semantics; with senders it would be used e.g. as hpx::sync_wait(hpx::for_loop(hpx::execution::par, hpx::execution::when_all(...)))
  • an overload that takes only a policy (this is only for use with operator|, e.g. hpx::execution::when_all(...) | hpx::for_loop(hpx::execution::par) | hpx::execution::sync_wait())

With these changes I think the first two overloads could be moved to internal, and the trsm_o etc. callable objects would never show up in the algorithms (they can be kept around for convenience, but still be internal).

Overall I myself quite like this direction. It aligns quite well with where I think the standard library is possibly headed. The policy overloads are also higher level than the dataflow + executor versions, and gives us room to change things below those overloads (either because we want to or because the proposals change). The higher level ideas are unlikely to change as much. We've also started looking at adding exactly these kinds of overloads to the HPX parallel algorithms.

I think from an algorithm-writers perspective the most important thing is how e.g. this looks: https://github.com/msimberg/DLA-Future/blob/e578e959819142021f12439414a30d7aad193e78/include/dlaf/solver/triangular/impl.h#L39-L42. If this looks like a decent direction I'll start tidying up loose ends (for this iteration...).

Let me know what you think (including @aurianer, @biddisco).

@albestro
Copy link
Collaborator

I think from an algorithm-writers perspective the most important thing is how e.g. this looks: https://github.com/msimberg/DLA-Future/blob/e578e959819142021f12439414a30d7aad193e78/include/dlaf/solver/triangular/impl.h#L39-L42. If this looks like a decent direction I'll start tidying up loose ends (for this iteration...).

I like how the code looks, I think it is quite clear and easy to read.
For me it's still a long way to fully understand all these things in detail, so maybe I'm not very helpful on commenting the internal mecanisms, but from the user-perspective they're really nice.

Thanks for the effort!

@msimberg
Copy link
Collaborator Author

bors try

bors bot added a commit that referenced this pull request Jul 12, 2021
@bors
Copy link

bors bot commented Jul 12, 2021

try

Build failed:

@msimberg
Copy link
Collaborator Author

bors try

@msimberg
Copy link
Collaborator Author

bors try

@bors
Copy link

bors bot commented Jul 13, 2021

try

Already running a review

bors bot added a commit that referenced this pull request Jul 13, 2021
@bors
Copy link

bors bot commented Jul 13, 2021

try

Build failed:

@msimberg
Copy link
Collaborator Author

bors try

@bors
Copy link

bors bot commented Jul 13, 2021

try

Merge conflict.

@msimberg
Copy link
Collaborator Author

bors try

bors bot added a commit that referenced this pull request Jul 13, 2021
@bors
Copy link

bors bot commented Jul 13, 2021

try

Build failed:

@msimberg msimberg force-pushed the sender-receiver-adaptation branch 2 times, most recently from d0c2340 to b6f3370 Compare July 14, 2021 09:25
@msimberg
Copy link
Collaborator Author

bors try

@msimberg
Copy link
Collaborator Author

bors try

@bors
Copy link

bors bot commented Jul 14, 2021

try

Already running a review

@msimberg
Copy link
Collaborator Author

msimberg commented Nov 8, 2021

bors try

bors bot added a commit that referenced this pull request Nov 8, 2021
@msimberg
Copy link
Collaborator Author

msimberg commented Nov 8, 2021

I've added (back in many cases) the priorities as explicit arguments to helper functions.

To summarize, the algorithms that have been senderified in this PR are:

  • solver/triangular
  • factorization/cholesky
  • eigensolver/gen_to_std
  • copy for Matrix

As a prerequisite all tile algorithms used in the above matrix algorithms have overloads that work with senders and policies.

@msimberg
Copy link
Collaborator Author

msimberg commented Nov 8, 2021

bors try

@bors
Copy link

bors bot commented Nov 8, 2021

try

Already running a review

@msimberg
Copy link
Collaborator Author

msimberg commented Nov 8, 2021

bors try-

Copy link
Contributor

@teonnik teonnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@msimberg
Copy link
Collaborator Author

msimberg commented Nov 8, 2021

bors try

bors bot added a commit that referenced this pull request Nov 8, 2021
@bors
Copy link

bors bot commented Nov 8, 2021

try

Build succeeded:

Copy link
Collaborator

@rasolca rasolca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msimberg Are you going to try @albestro 's suggestion (https://godbolt.org/z/6TcGvr7e8 i.e. delete instead of unimplemented macro) or the PR is ready to merge?

@msimberg
Copy link
Collaborator Author

msimberg commented Nov 9, 2021

@msimberg Are you going to try @albestro 's suggestion (https://godbolt.org/z/6TcGvr7e8 i.e. delete instead of unimplemented macro) or the PR is ready to merge?

See my response here: #371 (comment). I tried it in a standalone setting and it does in principle the right thing. However, my preference is still the static_assert because it let's us put a pretty obvious error message there.

@rasolca rasolca merged commit 89ca6a0 into eth-cscs:master Nov 9, 2021
@msimberg msimberg deleted the sender-receiver-adaptation branch November 9, 2021 17:04
@teonnik teonnik mentioned this pull request Nov 10, 2021
@msimberg msimberg mentioned this pull request Nov 22, 2021
7 tasks
rasolca pushed a commit that referenced this pull request Nov 22, 2021
…7.X and HPX master (#428)

This allows building DLA-Future with HPX 1.7.X and master (that was not the case with #371). Two reasons:

- In 1.7.X sender and receiver customizations could be done as either member functions (these have priority) or tag_dispatch. On master we've made tag_dispatch the only customization mechanism (simplifies internals).
- In 1.7.X we made the bad decision of renaming tag_invoke to tag_dispatch (because of concerns of us using it in non-intended ways). We realized this was a mistake and renamed it back to tag_invoke on master.

So this PR changes to use tag_invoke/dispatch as the customization mechanism since that works on 1.7.X and master. Additionally, it puts a little local compatibility definition for tag_invoke/dispatch so that the correct function name is used depending on the HPX version.

Finally, because of the move to tag_invoke/dispatch the customization for set_value requires SFINAE to not even be considered as an overload for set_error and set_done.
@msimberg msimberg mentioned this pull request Dec 6, 2021
12 tasks
@rasolca rasolca mentioned this pull request Sep 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants