static_reduction_map #98

sleeepyjack · 2021-08-04T22:47:28Z

This is an extension to PR #82 and closes #58

Adds a new class called static_reduction_map.

When inserting a key/value pair, static_reduction_map performs an aggregation operation between the newly inserted payload and the existing value in the map. The slots in the map are initialized such that the identity value of the aggregation is the initial value of a slot's payload.

The following functionality has been added

CUDA stream support
Sync with current dev branch.
Unit tests
Exponential backoff strategy for CAS loop based custom_op functor. [WIP]
Benchmarks for insert bulk operation
Reduce-by-key benchmarks including a comparison against CUB and Thrust.

Reduce-by-key benchmark results

In this benchmark scenario, we generate 100'000'000 uniformly distributed key-value pairs, where each distinct key has a multiplicity of m, i.e. each key occurs on average m times in the input data. The task is to sum up all values associated to the same key, where the input data, as well as the result reside in the GPU's global memory space.
Note that for our hash-based implementation (CUCO) we included two measurements with different target load factors (50% and 80%).

NVIDIA Tesla V100 32GB

4+4 byte key/value pairs

8+8 byte key/value pairs

NVIDIA Tesla A100 40GB

4+4 byte key/value pairs

8+8 byte key/value pairs

Copied existing static_map files and just renamed all references to static_map to static_reduction_map.

the derived types.

We need to return a bool so we can keep track of how many unique keys were inserted in a bulk insert.

The mapped value is updated in the case of a new insert or updating an existing key, but we need to track if the insert was the first time that key was inserted.

benchmarks/utils.hpp

PointKernel

Thanks for the great work! It's a large PR and I just had a quick look over examples, tests and benchmarks. Will look into implementations shortly.

include/cuco/detail/traits.hpp

tests/CMakeLists.txt

sleeepyjack · 2022-03-10T01:55:07Z

Thanks for the great work! It's a large PR and I just had a quick look over examples, tests and benchmarks. Will look into implementations shortly.

Thanks so much for the review so far! And I have to apologize for the unnecessary large merge commit. I just wanted it done as quickly as possible so you guys don't have to wait for it to get merged. I will incorporate the requested changes in the next couple of days.

include/cuco/detail/reduction_ops.cuh

include/cuco/detail/static_reduction_map_kernels.cuh

jrhemstad · 2022-03-23T13:36:50Z

include/cuco/static_reduction_map.cuh

+#if defined(CUDART_VERSION) && (CUDART_VERSION >= 11000) && defined(__CUDA_ARCH__) && \
+  (__CUDA_ARCH__ >= 700)
+#define CUCO_HAS_CUDA_BARRIER
+#endif


Note to self: We should make a detail/__config file for this kind of thing.

jrhemstad · 2022-03-23T13:37:43Z

benchmarks/CMakeLists.txt

@@ -59,10 +59,10 @@ function(ConfigureNVBench BENCH_NAME)
    add_executable(${BENCH_NAME} ${ARGN})
    set_target_properties(${BENCH_NAME} PROPERTIES
                                        POSITION_INDEPENDENT_CODE ON
-                                        RUNTIME_OUTPUT_DIRECTORY "${CMAKE_BINARY_DIR}/nvbenchmarks")
+                                        RUNTIME_OUTPUT_DIRECTORY "${CMAKE_BINARY_DIR}/benchmarks"
+                                        COMPILE_FLAGS -DNVBENCH_MODULE)


What is the NVBENCH_MODULE definition?

The idea was to reuse the key_generator.hppfor both gbench and nvbench setups. See:

#if defined(NVBENCH_MODULE) #include <nvbench/nvbench.cuh> NVBENCH_DECLARE_ENUM_TYPE_STRINGS( // Enum type: dist_type, // Callable to generate input strings: // Short identifier used for tables, command-line args, etc. // Used when context is available to figure out the enum type. [](dist_type d) { switch (d) { case dist_type::GAUSSIAN: return "GAUSSIAN"; case dist_type::GEOMETRIC: return "GEOMETRIC"; case dist_type::UNIFORM: return "UNIFORM"; case dist_type::UNIQUE: return "UNIQUE"; case dist_type::SAME: return "SAME"; default: return "ERROR"; } }, // Callable to generate descriptions: // If non-empty, these are used in `--list` to describe values. // Used when context may not be available to figure out the type from the // input string. // Just use `[](auto) { return std::string{}; }` if you don't want these. [](auto) { return std::string{}; }) #endif

here: https://github.com/sleeepyjack/cuCollections/blob/5f244292990dbde9d5311d28ede72e74803250ac/benchmarks/key_generator.hpp#L25

Is there another way of detecting if nvbench is included? I initially thought I could use the include guard definition but nvbench uses #pragma once iirc.

Ah, I see. This is fine then. I'd suggest renaming to CUCO_USING_NVBENCH.

include/cuco/detail/reduction_ops.cuh

jrhemstad · 2022-03-23T13:39:43Z

include/cuco/static_reduction_map.cuh

+ * pairs that reduces the values associated to the same key according to a
+ * functor.


Suggested change

* pairs that reduces the values associated to the same key according to a

* functor.

* pairs where insertion aggregates the values associated to the same key according to a

* binary reduction operator.

jrhemstad · 2022-03-23T13:45:04Z

include/cuco/static_reduction_map.cuh

+ * individual threads.
+ * @tparam Allocator Type of allocator used for device storage
+ */
+template <typename ReductionOp,


We should do something to enforce requirements on ReductionOp. Basically, it needs to be one of the operators provided in reduction_ops.cuh or if a custom operation, needs to use custom_op.

How about a member tag aka an empty struct cuco::tags::reduction_op? See e1361a3

Here's what I was thinking. A person has 3 options for the ReductionOp

Use one of the provided cuco::reduce_* types.

No additional work should be required. Partial specialization could/should remove the ReductionOp argument from the constructor

Provide a unsynchronized binary callable T F(T, T) and Identity value

This needs to be wrapped by custom_op to apply F in a CAS loop

Ideally we could detect this kind of callable and implicitly wrap it in custom_op

Provide a synchronized binary callable T F(atomic_ref<T, Scope>, T) and Identity value

User responsible for correct synchronization through atomic_ref

Examples:

// 1. // no need to provide `reduce_add{}` // No need to provide identity value cuco::static_reduction_map<cuco::reduce_add<int>, int, int> add_map{capacity, empty_key, alloc}; // 2. Unsynchronized binary callable must be wrapped in `custom_op` struct unsync_add{ int identity = 0; // Must provide identity value int operator()(int a, int b){ return a + b; } }; // internally should wrap `unsync_add` in `custom_op` cuco::static_reduction_map<unsync_add, int, int> custom_unsync_add_map(capacity, empty_key, unsync_add{}, alloc); // 3. stuct sync_add{ int identity = 0; // Must provide identity value template <thread_scope Scope> int operator()(atomic_ref<int, Scope> a, int b){ return a.fetch_add(b, memory_order_relaxed); } }; cuco::static_reduction_map<sync_add, int, int> custom_sync_add_map(capacity, empty_key, sync_add{}, alloc);

1 & 3 could effectively be merged.

One thing that occurred to me is that the identity value need not be known statically. Not sure what kind of binop would have a runtime determined identity value, but who knows?

How about a member tag aka an empty struct cuco::tags::reduction_op?

@sleeepyjack I'd prefer to create a base class: https://godbolt.org/z/6KqenYenT

@jrhemstad @PointKernel
Re-examining this question again:

I am a bit puzzled about how to distinguish between cases 1) and 3) as it involves extracting the type of the first argument of the operator() and check if it is atomic_ref<T> or just T 🙃. Maybe something like this?:

template <typename> struct first_arg; template <typename F, typename A, typename... Args> struct first_arg<F(A, Args...)> { using type = A; }; template <typename T> using first_arg_t = typename first_arg<T>::type;

Also, implicitly switching between sync and non-sync implementations may lead to confusion on the user side.

How about defining a common base class for all built-in (synchronizing) functors. If a user passes a functor that doesn't inherit from this base, it is automatically wrapped in e.g. a CAS loop.
This way we put the user in charge of deciding whether the functor needs synchronization or not.
Additionally, we could use CRTP to add some convenient type checks to the base class.

Let me know what you think.

PointKernel · 2022-05-05T16:37:10Z

include/cuco/detail/static_reduction_map.inl

+namespace detail {
+template <typename Key, typename Value>
+struct slot_to_tuple {
+  template <typename S>
+  __device__ thrust::tuple<Key, Value> operator()(S const& s)
+  {
+    return thrust::tuple<Key, Value>(s.first, s.second);
+  }
+};
+
+template <typename Key>
+struct slot_is_filled {
+  Key empty_key_sentinel;
+  template <typename S>
+  __device__ bool operator()(S const& s)
+  {
+    return thrust::get<0>(s) != empty_key_sentinel;
+  }
+};
+}  // namespace detail
+


Suggested change

namespace detail {

template <typename Key, typename Value>

struct slot_to_tuple {

template <typename S>

__device__ thrust::tuple<Key, Value> operator()(S const& s)

{

return thrust::tuple<Key, Value>(s.first, s.second);

}

};

template <typename Key>

struct slot_is_filled {

Key empty_key_sentinel;

template <typename S>

__device__ bool operator()(S const& s)

{

return thrust::get<0>(s) != empty_key_sentinel;

}

};

} // namespace detail

@sleeepyjack This can be removed since I've moved them to detail/utils.cuh in #150

jrhemstad · 2022-05-19T15:34:36Z

@sleeepyjack to work on breaking this up into smaller PRs to make it easier to review.

sleeepyjack · 2024-07-08T22:46:52Z

Superseeded by #515

jrhemstad added 30 commits December 16, 2020 09:54

Added initial static_reduction_map files.

033d2bc

Copied existing static_map files and just renamed all references to static_map to static_reduction_map.

Add template parameter for reduction binary op.

fe606cd

Fix static_assert for ReductionOp::value_type.

fd3b98f

CG reduction insert implementation.

a3678fb

Cleanup of CG insert.

5a65bf6

Pass reduction op to device view ctors.

28e0995

Add pair ctor for constructing from two elements.

8dc64ee

Allow bulk insert kernel to work on iterators over tuples.

573bce2

Add device decorator to reduction op definition.

d9236e5

Add get_op function to allow accessing the op from

89ed44e

the derived types.

Make insert return a bool after all.

e28db80

We need to return a bool so we can keep track of how many unique keys were inserted in a bulk insert.

Use get_op in implementation.

0eeac20

Make insert return a bool.

fa31c81

Correct insert to return if the key was the first key inserted.

ab81b2b

First test verifying size passed.

46f9b73

Update CG insert logic.

8aebabb

The mapped value is updated in the case of a new insert or updating an existing key, but we need to track if the insert was the first time that key was inserted.

Add more tests.

9fb930e

Add test for inserting all unique keys.

24261b2

Use relaxed fetch_add.

e635e31

Update the slot references each iteration.

d749445

Increase size of unique key test.

ca9f7d6

Make map size function of number of keys.

9eebd17

Add other agg ops.

212b8f6

Add custom binary op.

cda527a

Return old value in custom op.

7c1af0f

reduction map benchmarks.

3f1b59d

Merge remote-tracking branch 'origin/dev' into reduction-map

71a0122

Remove redundant ctor.

2a38d70

Add initial static_reduction_map example.

f2d1a26

Remove cuda_memcmp header.

3c79701

PointKernel reviewed Mar 10, 2022

View reviewed changes

benchmarks/utils.hpp Outdated Show resolved Hide resolved

PointKernel requested changes Mar 10, 2022

View reviewed changes

include/cuco/detail/traits.hpp Outdated Show resolved Hide resolved

tests/CMakeLists.txt Outdated Show resolved Hide resolved

PointKernel reviewed Mar 14, 2022

View reviewed changes

sleeepyjack mentioned this pull request Mar 16, 2022

Erase Functionality for static_map #142

Merged

Minor fixes addressing reviewer comments

5f24429

sleeepyjack changed the title ~~[REVIEW] static_reduction_map (extension to PR #82)~~ static_reduction_map (extension to PR #82) Mar 21, 2022

jrhemstad reviewed Mar 23, 2022

View reviewed changes

jrhemstad mentioned this pull request Mar 23, 2022

Create a detail/__config file for common configuration definitions #143

Closed

jrhemstad reviewed Mar 23, 2022

View reviewed changes

include/cuco/detail/reduction_ops.cuh Outdated Show resolved Hide resolved

jrhemstad reviewed Mar 23, 2022

View reviewed changes

sleeepyjack added 2 commits March 23, 2022 15:38

Move reduction operators to include/cuco/

c2e4e62

Added a tag to ensure that only valid reduction functors can be used

e1361a3

PointKernel reviewed May 5, 2022

View reviewed changes

jrhemstad added the Needs Review Awaiting reviews before merging label May 19, 2022

jrhemstad mentioned this pull request May 19, 2022

[WIP] Add static_reduction_map for hash-based reduce-by-key functionality #82

Closed

jrhemstad changed the title ~~static_reduction_map (extension to PR #82)~~ static_reduction_map May 19, 2022

PointKernel added 2 commits May 26, 2022 14:53

Move common kernels to a new file

d196de5

Updates: incorporate new kernel header

e904dca

sleeepyjack mentioned this pull request Jun 30, 2022

[FEA] Migrate from cuda::atomic to cuda::atomic_ref #183

Closed

sleeepyjack mentioned this pull request Jul 11, 2022

Reduction functors (cuco::static_reduction_map refactoring 1/N) #187

Closed

ttnghia mentioned this pull request Apr 6, 2023

[FEA] first and last as hash based aggregates rapidsai/cudf#11141

Open

PointKernel mentioned this pull request Apr 17, 2023

[FEA] Improve cudf::distinct with cuco reduction map rapidsai/cudf#13157

Open

4 tasks

sleeepyjack mentioned this pull request Oct 14, 2023

Add static_map::insert_or_apply aka reduce-by-key #384

Closed

srinivasyadav18 mentioned this pull request Jun 25, 2024

Add static_map::insert_or_apply aka reduce-by-key #515

Merged

sleeepyjack closed this Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

static_reduction_map #98

static_reduction_map #98

sleeepyjack commented Aug 4, 2021 •

edited by PointKernel

Loading

PointKernel left a comment

sleeepyjack commented Mar 10, 2022

jrhemstad Mar 23, 2022

jrhemstad Mar 23, 2022

sleeepyjack Mar 23, 2022 •

edited

Loading

sleeepyjack Mar 23, 2022

jrhemstad Mar 23, 2022

jrhemstad Mar 23, 2022 •

edited

Loading

jrhemstad Mar 23, 2022

sleeepyjack Mar 23, 2022 •

edited

Loading

jrhemstad Mar 23, 2022

jrhemstad Mar 23, 2022

PointKernel Apr 27, 2022

sleeepyjack Jun 29, 2022 •

edited

Loading

PointKernel May 5, 2022

jrhemstad commented May 19, 2022

sleeepyjack commented Jul 8, 2024

		* pairs that reduces the values associated to the same key according to a
		* functor.

static_reduction_map #98

static_reduction_map #98

Conversation

sleeepyjack commented Aug 4, 2021 • edited by PointKernel Loading

This is an extension to PR #82 and closes #58

The following functionality has been added

Reduce-by-key benchmark results

NVIDIA Tesla V100 32GB

4+4 byte key/value pairs

8+8 byte key/value pairs

NVIDIA Tesla A100 40GB

4+4 byte key/value pairs

8+8 byte key/value pairs

PointKernel left a comment

Choose a reason for hiding this comment

sleeepyjack commented Mar 10, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sleeepyjack Mar 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrhemstad Mar 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sleeepyjack Mar 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sleeepyjack Jun 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrhemstad commented May 19, 2022

sleeepyjack commented Jul 8, 2024

sleeepyjack commented Aug 4, 2021 •

edited by PointKernel

Loading

sleeepyjack Mar 23, 2022 •

edited

Loading

jrhemstad Mar 23, 2022 •

edited

Loading

sleeepyjack Mar 23, 2022 •

edited

Loading

sleeepyjack Jun 29, 2022 •

edited

Loading