[FEA] Refactor of open address data structures #110

jrhemstad · 2021-10-04T15:43:13Z

Is your feature request related to a problem? Please describe.

There is a significant amount of redundancy among the static_map/static_multimap/static_reduction_map classes. This is a large maintenance overhead and means optimizations made to one data structure do not translate to the others.

Furthermore, there are several configuration options we'd like to enable, like using AoS vs SOA, scalar vs. CG operations, etc.

I'd also like to enable adding a static_set and static_multiset classes that could share the same backend.

Describe the solution you'd like

List of things I'd like to address:

Use atomic_ref when possible (4B/8B key/value types) [FEA] Migrate from cuda::atomic to cuda::atomic_ref #183
Falls back on atomic when necessary (<4B, >8B key/value types)
- For <4B, we should probably just widen them ourselves and still use atomic_ref instead of atomic.
Eliminates redundancy among static_map/reduction_map/multimap
Enables AoS vs SoA layout ([FEA] Add option to select between Struct of Arrays vs Array of Structs #103)
Enables statically sized device views
- We should use a pattern like std::span with std::dynamic_extent to support both dynamic and statically sized capacities.
Enables adding static_set and static_multiset
Supports the various insert schemes: packed/back2back/cas+dependent write
Switch between scalar/CG operations
Stream support everywhere ([FEA] CUDA stream support #65)
Consistent use of bitwise_equal
Asynchronous size computation ([FEA] On-demand size computation to solve #65 and #39 #102, Size computation slows bulk insert significantly #237 )
rehashing ([FEA] static_map::rehash #21)

My current thinking is to create an open_address_impl class that provides an abstraction for a logical array of "slots" and exposes operations on those slots. All the core logic and switching for things like AoS/SoA, atomic_ref/atomic can/should be implemented in this common impl class.

The text was updated successfully, but these errors were encountered:

jrhemstad · 2021-11-12T16:38:28Z

Something that came up with @PointKernel in the review of the static_multimap that we ultimately decided to wait on was enabling user configuration of using "scalar" vs "vector" loading. I think this should be exposed via the ProbingScheme type and adding an items_per_thread non-type template parameter that controls how many slots are loaded per-thread.

I also want to be able to disable using the CG algorithms all together via the ProbingScheme. You can current set CGSize == 1, but that will be inefficient as it still uses the CG code paths instead of the scalar code paths. We can either specialize for CGSize == 1 or add a special sentinel value for CGSize (like std::numerc_limits<size_t>::max()) that selects using non-CG code paths.

PointKernel · 2021-12-06T18:18:02Z

Another improvement we should try: using one CG to insert/retrieve multiple keys instead of a single key to amortize CG overhead.

jrhemstad · 2022-08-05T18:56:25Z

Status of the latest ideas we have discussed:

To make it convenient to construct a static_set or other data structures with a variety of configuration parameters, we want to use a variadic constructor pattern that is similar to this example: https://godbolt.org/z/T4oP1nsoW
- To make this more generic, instead of the get_or_default<T> function looking for a concrete type T, it could look for the first type that satisfies a given concept C, e.g., get_or_default<C>.
We want distinct "ProbingScheme" and "Storage" concepts
Storage is the thing that manages the slots, a probing scheme tells you which slots to search for a given key.
- Storage
  - A conceptual list of <key,payload> elements that need not necessarily be contiguous in memory. For example, vector< pair<key,value> > or vector<key> and vector<value> (i.e., AoS vs SoA) could be valid implementations of Storage.
  - Open questions
    - Own the mechanism for atomically updating a given slot
      - Versions of this that return a bool and/or iterator
    - Enable querying if concurrent insert/find operations are possible
    - It should provide some mechanism for loading an immutable "window" of slots. The intention is to provide a standard way of using vector load/store operations.
- ProbingScheme
  - Given a key k, a ProbingSequence provides a sequence of N potentially non-contiguous locations (or values) [i0, i1, i2, ... iN, EMPTY_SENTINEL] where if k exists it is present in [i0, iN]. Optionally, the sequence may be provided as a set of "windows" that partition the space into 1D "windows" of a fixed-size "W"

[ {i0,    i1,    ..., iW},
  {iW+1,  iW+2,  ..., i2W},
  {i2W+1, i2W+2, ..., i3W},
   ...
   {inW+1, inW+2, ..., i(n+1)W} // EMPTY_SENTINEL may appear anywhere in the  last window
]

Additional host interface functions:

is_present (rename contains)
is_absent
for_each
clear
reserve

Experimental work in defining these entities:

https://godbolt.org/z/9bosKoW4v (distinct "ProbingScheme" and "Storage" concepts)
https://godbolt.org/z/Maxe7j1o4 (probe iterator)
https://godbolt.org/z/557fjcvvE (equality wrapper)

jrhemstad · 2022-08-08T15:09:19Z

Some open questions that still need answers:

Where is the CG size/window size defined?
- Is it specified by the Storage? ProbingScheme?
Different flavors of cooperative device-side overloads. E.g., contains:
1. N threads & 1 key: N threads in group call contains with the exact same k. All N threads cooperate to find the single k. The same result value is returned to all threads.
2. N threads & N keys: N threads in group call contains each with potentially distinct values for k (k0, k1, ..., kN). All N threads cooperate to find each ki. The returned result is potentially unique to each thread ti depending on the existence of ki.

sleeepyjack · 2022-08-09T07:14:55Z

Different flavors of cooperative device-side overloads. E.g., contains

We could use different signatures for

contains(CG g, key_type key, ..)
contains(CG g, KeyIter first, KeyIter last, ..)

The latter could implement a WCWS overload for when the key/value types can be shuffled around among group ranks.
If the input range is wider than the group (notice this also implements the device bulk API idea we had), we would need to write the results to an output range instead of placing them as return values of the function.

jrhemstad · 2022-08-09T22:01:17Z

contains(CG g, KeyIter first, KeyIter last, ..)

I'm not sure this solves the problem. Would the semantics of this function be such that each thread in the group is providing a distinct [first,last) range? Or the same?

I think a contains function that takes an iterator range is orthogonal to the cooperative semantics of the function.

sleeepyjack · 2022-08-09T22:46:23Z

Would the semantics of this function be such that each thread in the group is providing a distinct [first,last) range? Or the same?

I don't see the benefit of the former variant, as you can simply call the function in a loop. I was referring to the latter variant: A coop group is assigned to a range of keys, implementing a "mini" parallel bulk version of the operation. If the data types allow for warp shuffles, we can make use of WCWS, i.e., coalesced load from the input range then iteratively broadcast each loaded datum to all ranks and subsequently call the cooperative insert/contains.

jrhemstad · 2022-08-11T22:20:14Z

I want APIs that don't require me to think about launching input_size * CG_size number of threads and wrangle the CG throughout the whole kernel.

I just want to launch input_size threads and then create an ad hoc CG to do a cooperative insert/find.

In other words, I want to keep the simple "one work item per thread" model of work assignment, but benefit from the cooperative execution by threads donating themselves to carrying out the work for another threads work item.

This is the first PR related to #110. It introduces the concept of: - New probing scheme via probing iterator - Array of Windows storage instead of flat storage to better deal with memory bandwidth-bound workload when hash collisions are present - Dynamic and static extent type for efficient probing - Mixin to encode concurrent device operators - Synchronous and asynchronous host bulk APIs This PR also adds `cuco::static_set` to evaluate the new design. For now, only 2 basic operations, `insert` and `contains`, are supported. --------- Co-authored-by: Daniel Juenger <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Contributes to #110 This PR adds `experimental::static_map` and involves several changes to the existing code: - Extracts common `open_addressing_impl` and `open_addressing_ref_impl` classes to minimize duplicates between map and set implementations - Updates the existing code and fixes bugs: invalid type conversion in `attemp_insert`, narrow conversions inside probing scheme, doc improvement, etc. --------- Co-authored-by: Daniel Jünger <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Contributes to #110 Depends on #314 This PR: - deprecates `cuco::pair_type` alias - fixes issues with `cuco::make_pair` - separates `pair` declarations and implementation details

vyasr mentioned this issue Nov 12, 2021

Add static_map::insert_if. #118

Merged

vyasr mentioned this issue Nov 19, 2021

Replace cudf's concurrent_ordered_map with cuco::static_map in semi/anti joins rapidsai/cudf#9666

Merged

PointKernel added improvement topic: performance Performance related issue type: feature request New feature request labels Dec 3, 2021

vyasr mentioned this issue Jan 5, 2022

[FEA] Address performance regression in semi/anti joins from switching to cuco rapidsai/cudf#9973

Open

PointKernel added the helps: rapids Helps or needed by RAPIDS label Jan 18, 2022

jrhemstad removed the improvement label May 19, 2022

PointKernel mentioned this issue Jun 14, 2022

Fully support nested types in cudf::contains rapidsai/cudf#10656

Merged

sleeepyjack mentioned this issue Jun 30, 2022

[FEA] Migrate from cuda::atomic to cuda::atomic_ref #183

Closed

sleeepyjack mentioned this issue Jul 26, 2022

[FEA] Allow to customize cooperative group size (tile_size) for static_map #194

Closed

jrhemstad added this to the Refactor Open Address Data Structures milestone Aug 5, 2022

PointKernel mentioned this issue Oct 3, 2022

Size computation slows bulk insert significantly #237

Open

PointKernel mentioned this issue Nov 29, 2022

[FEA] Refactor hash-based algorithms with new cuco data structures rapidsai/cudf#12261

Open

PointKernel mentioned this issue Apr 3, 2023

Add data structure base classes and cuco::static_set #278

Merged

PointKernel mentioned this issue Jun 4, 2023

Add experimental static_map #314

Merged

PointKernel mentioned this issue Jun 20, 2023

Clean up cuco::pair #319

Merged

PointKernel added a commit that referenced this issue Jun 27, 2023

Clean up cuco::pair (#319)

88ff1e4

Contributes to #110 Depends on #314 This PR: - deprecates `cuco::pair_type` alias - fixes issues with `cuco::make_pair` - separates `pair` declarations and implementation details

sleeepyjack mentioned this issue Jul 1, 2023

Get rid of get_ prefixes #157

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Refactor of open address data structures #110

[FEA] Refactor of open address data structures #110

jrhemstad commented Oct 4, 2021 •

edited by PointKernel

Loading

jrhemstad commented Nov 12, 2021

PointKernel commented Dec 6, 2021

jrhemstad commented Aug 5, 2022 •

edited by PointKernel

Loading

jrhemstad commented Aug 8, 2022

sleeepyjack commented Aug 9, 2022 •

edited

Loading

jrhemstad commented Aug 9, 2022

sleeepyjack commented Aug 9, 2022

jrhemstad commented Aug 11, 2022

[FEA] Refactor of open address data structures #110

[FEA] Refactor of open address data structures #110

Comments

jrhemstad commented Oct 4, 2021 • edited by PointKernel Loading

jrhemstad commented Nov 12, 2021

PointKernel commented Dec 6, 2021

jrhemstad commented Aug 5, 2022 • edited by PointKernel Loading

jrhemstad commented Aug 8, 2022

sleeepyjack commented Aug 9, 2022 • edited Loading

jrhemstad commented Aug 9, 2022

sleeepyjack commented Aug 9, 2022

jrhemstad commented Aug 11, 2022

jrhemstad commented Oct 4, 2021 •

edited by PointKernel

Loading

jrhemstad commented Aug 5, 2022 •

edited by PointKernel

Loading

sleeepyjack commented Aug 9, 2022 •

edited

Loading