Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Refactor of open address data structures #110

Open
jrhemstad opened this issue Oct 4, 2021 · 8 comments
Open

[FEA] Refactor of open address data structures #110

jrhemstad opened this issue Oct 4, 2021 · 8 comments
Labels
helps: rapids Helps or needed by RAPIDS topic: performance Performance related issue type: feature request New feature request

Comments

@jrhemstad
Copy link
Collaborator

jrhemstad commented Oct 4, 2021

Is your feature request related to a problem? Please describe.

There is a significant amount of redundancy among the static_map/static_multimap/static_reduction_map classes. This is a large maintenance overhead and means optimizations made to one data structure do not translate to the others.

Furthermore, there are several configuration options we'd like to enable, like using AoS vs SOA, scalar vs. CG operations, etc.

I'd also like to enable adding a static_set and static_multiset classes that could share the same backend.

Describe the solution you'd like

List of things I'd like to address:

My current thinking is to create an open_address_impl class that provides an abstraction for a logical array of "slots" and exposes operations on those slots. All the core logic and switching for things like AoS/SoA, atomic_ref/atomic can/should be implemented in this common impl class.

@jrhemstad
Copy link
Collaborator Author

Something that came up with @PointKernel in the review of the static_multimap that we ultimately decided to wait on was enabling user configuration of using "scalar" vs "vector" loading. I think this should be exposed via the ProbingScheme type and adding an items_per_thread non-type template parameter that controls how many slots are loaded per-thread.

I also want to be able to disable using the CG algorithms all together via the ProbingScheme. You can current set CGSize == 1, but that will be inefficient as it still uses the CG code paths instead of the scalar code paths. We can either specialize for CGSize == 1 or add a special sentinel value for CGSize (like std::numerc_limits<size_t>::max()) that selects using non-CG code paths.

@PointKernel
Copy link
Member

Another improvement we should try: using one CG to insert/retrieve multiple keys instead of a single key to amortize CG overhead.

@jrhemstad
Copy link
Collaborator Author

jrhemstad commented Aug 5, 2022

Status of the latest ideas we have discussed:

  • To make it convenient to construct a static_set or other data structures with a variety of configuration parameters, we want to use a variadic constructor pattern that is similar to this example: https://godbolt.org/z/T4oP1nsoW
    • To make this more generic, instead of the get_or_default<T> function looking for a concrete type T, it could look for the first type that satisfies a given concept C, e.g., get_or_default<C>.
  • We want distinct "ProbingScheme" and "Storage" concepts
  • Storage is the thing that manages the slots, a probing scheme tells you which slots to search for a given key.
    • Storage
      • A conceptual list of <key,payload> elements that need not necessarily be contiguous in memory. For example, vector< pair<key,value> > or vector<key> and vector<value> (i.e., AoS vs SoA) could be valid implementations of Storage.
      • Open questions
        • Own the mechanism for atomically updating a given slot
          • Versions of this that return a bool and/or iterator
        • Enable querying if concurrent insert/find operations are possible
        • It should provide some mechanism for loading an immutable "window" of slots. The intention is to provide a standard way of using vector load/store operations.
    • ProbingScheme
      • Given a key k, a ProbingSequence provides a sequence of N potentially non-contiguous locations (or values) [i0, i1, i2, ... iN, EMPTY_SENTINEL] where if k exists it is present in [i0, iN]. Optionally, the sequence may be provided as a set of "windows" that partition the space into 1D "windows" of a fixed-size "W"
[ {i0,    i1,    ..., iW},
  {iW+1,  iW+2,  ..., i2W},
  {i2W+1, i2W+2, ..., i3W},
   ...
   {inW+1, inW+2, ..., i(n+1)W} // EMPTY_SENTINEL may appear anywhere in the  last window
]

Additional host interface functions:

  • is_present (rename contains)
  • is_absent
  • for_each
  • clear
  • reserve

Experimental work in defining these entities:

@jrhemstad
Copy link
Collaborator Author

Some open questions that still need answers:

  • Where is the CG size/window size defined?

    • Is it specified by the Storage? ProbingScheme?
  • Different flavors of cooperative device-side overloads. E.g., contains:

    1. N threads & 1 key: N threads in group call contains with the exact same k. All N threads cooperate to find the single k. The same result value is returned to all threads.
    2. N threads & N keys: N threads in group call contains each with potentially distinct values for k (k0, k1, ..., kN). All N threads cooperate to find each ki. The returned result is potentially unique to each thread ti depending on the existence of ki.

@sleeepyjack
Copy link
Collaborator

sleeepyjack commented Aug 9, 2022

Different flavors of cooperative device-side overloads. E.g., contains

We could use different signatures for

  1. contains(CG g, key_type key, ..)
  2. contains(CG g, KeyIter first, KeyIter last, ..)

The latter could implement a WCWS overload for when the key/value types can be shuffled around among group ranks.
If the input range is wider than the group (notice this also implements the device bulk API idea we had), we would need to write the results to an output range instead of placing them as return values of the function.

@jrhemstad
Copy link
Collaborator Author

contains(CG g, KeyIter first, KeyIter last, ..)

I'm not sure this solves the problem. Would the semantics of this function be such that each thread in the group is providing a distinct [first,last) range? Or the same?

I think a contains function that takes an iterator range is orthogonal to the cooperative semantics of the function.

@sleeepyjack
Copy link
Collaborator

Would the semantics of this function be such that each thread in the group is providing a distinct [first,last) range? Or the same?

I don't see the benefit of the former variant, as you can simply call the function in a loop. I was referring to the latter variant: A coop group is assigned to a range of keys, implementing a "mini" parallel bulk version of the operation. If the data types allow for warp shuffles, we can make use of WCWS, i.e., coalesced load from the input range then iteratively broadcast each loaded datum to all ranks and subsequently call the cooperative insert/contains.

@jrhemstad
Copy link
Collaborator Author

I want APIs that don't require me to think about launching input_size * CG_size number of threads and wrangle the CG throughout the whole kernel.

I just want to launch input_size threads and then create an ad hoc CG to do a cooperative insert/find.

In other words, I want to keep the simple "one work item per thread" model of work assignment, but benefit from the cooperative execution by threads donating themselves to carrying out the work for another threads work item.

PointKernel added a commit that referenced this issue Apr 6, 2023
This is the first PR related to #110.

It introduces the concept of:

- New probing scheme via probing iterator
- Array of Windows storage instead of flat storage to better deal with
memory bandwidth-bound workload when hash collisions are present
- Dynamic and static extent type for efficient probing
- Mixin to encode concurrent device operators
- Synchronous and asynchronous host bulk APIs

This PR also adds `cuco::static_set` to evaluate the new design. For
now, only 2 basic operations, `insert` and `contains`, are supported.

---------

Co-authored-by: Daniel Juenger <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
PointKernel added a commit that referenced this issue Jun 23, 2023
Contributes to #110 

This PR adds `experimental::static_map` and involves several changes to
the existing code:

- Extracts common `open_addressing_impl` and `open_addressing_ref_impl`
classes to minimize duplicates between map and set implementations
- Updates the existing code and fixes bugs: invalid type conversion in
`attemp_insert`, narrow conversions inside probing scheme, doc
improvement, etc.

---------

Co-authored-by: Daniel Jünger <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
PointKernel added a commit that referenced this issue Jun 27, 2023
Contributes to #110 
Depends on #314 

This PR:
- deprecates `cuco::pair_type` alias
- fixes issues with `cuco::make_pair`
- separates `pair` declarations and implementation details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
helps: rapids Helps or needed by RAPIDS topic: performance Performance related issue type: feature request New feature request
Projects
None yet
Development

No branches or pull requests

3 participants