-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Refactor of open address data structures #110
Comments
Something that came up with @PointKernel in the review of the I also want to be able to disable using the CG algorithms all together via the |
Another improvement we should try: using one CG to insert/retrieve multiple keys instead of a single key to amortize CG overhead. |
Status of the latest ideas we have discussed:
Additional host interface functions:
Experimental work in defining these entities:
|
Some open questions that still need answers:
|
We could use different signatures for
The latter could implement a WCWS overload for when the key/value types can be shuffled around among group ranks. |
I'm not sure this solves the problem. Would the semantics of this function be such that each thread in the group is providing a distinct I think a |
I don't see the benefit of the former variant, as you can simply call the function in a loop. I was referring to the latter variant: A coop group is assigned to a range of keys, implementing a "mini" parallel bulk version of the operation. If the data types allow for warp shuffles, we can make use of WCWS, i.e., coalesced load from the input range then iteratively broadcast each loaded datum to all ranks and subsequently call the cooperative insert/contains. |
I want APIs that don't require me to think about launching I just want to launch In other words, I want to keep the simple "one work item per thread" model of work assignment, but benefit from the cooperative execution by threads donating themselves to carrying out the work for another threads work item. |
This is the first PR related to #110. It introduces the concept of: - New probing scheme via probing iterator - Array of Windows storage instead of flat storage to better deal with memory bandwidth-bound workload when hash collisions are present - Dynamic and static extent type for efficient probing - Mixin to encode concurrent device operators - Synchronous and asynchronous host bulk APIs This PR also adds `cuco::static_set` to evaluate the new design. For now, only 2 basic operations, `insert` and `contains`, are supported. --------- Co-authored-by: Daniel Juenger <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Contributes to #110 This PR adds `experimental::static_map` and involves several changes to the existing code: - Extracts common `open_addressing_impl` and `open_addressing_ref_impl` classes to minimize duplicates between map and set implementations - Updates the existing code and fixes bugs: invalid type conversion in `attemp_insert`, narrow conversions inside probing scheme, doc improvement, etc. --------- Co-authored-by: Daniel Jünger <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Is your feature request related to a problem? Please describe.
There is a significant amount of redundancy among the
static_map/static_multimap/static_reduction_map
classes. This is a large maintenance overhead and means optimizations made to one data structure do not translate to the others.Furthermore, there are several configuration options we'd like to enable, like using AoS vs SOA, scalar vs. CG operations, etc.
I'd also like to enable adding a
static_set
andstatic_multiset
classes that could share the same backend.Describe the solution you'd like
List of things I'd like to address:
cuda::atomic
tocuda::atomic_ref
#183atomic_ref
instead ofatomic
.std::span
withstd::dynamic_extent
to support both dynamic and statically sized capacities.static_set
andstatic_multiset
bitwise_equal
My current thinking is to create an
open_address_impl
class that provides an abstraction for a logical array of "slots" and exposes operations on those slots. All the core logic and switching for things like AoS/SoA, atomic_ref/atomic can/should be implemented in this common impl class.The text was updated successfully, but these errors were encountered: