Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

static_reduction_map #98

Closed
wants to merge 75 commits into from
Closed

Conversation

sleeepyjack
Copy link
Collaborator

@sleeepyjack sleeepyjack commented Aug 4, 2021

This is an extension to PR #82 and closes #58

Adds a new class called static_reduction_map.

When inserting a key/value pair, static_reduction_map performs an aggregation operation between the newly inserted payload and the existing value in the map. The slots in the map are initialized such that the identity value of the aggregation is the initial value of a slot's payload.

The following functionality has been added

  1. CUDA stream support
  2. Sync with current dev branch.
  3. Unit tests
  4. Exponential backoff strategy for CAS loop based custom_op functor. [WIP]
  5. Benchmarks for insert bulk operation
  6. Reduce-by-key benchmarks including a comparison against CUB and Thrust.

Reduce-by-key benchmark results

In this benchmark scenario, we generate 100'000'000 uniformly distributed key-value pairs, where each distinct key has a multiplicity of m, i.e. each key occurs on average m times in the input data. The task is to sum up all values associated to the same key, where the input data, as well as the result reside in the GPU's global memory space.
Note that for our hash-based implementation (CUCO) we included two measurements with different target load factors (50% and 80%).

NVIDIA Tesla V100 32GB

4+4 byte key/value pairs

rbk_uniform_distribution_i32_v100

8+8 byte key/value pairs

rbk_uniform_distribution_i64_v100

NVIDIA Tesla A100 40GB

4+4 byte key/value pairs

rbk_uniform_distribution_i32_a100

8+8 byte key/value pairs

rbk_uniform_distribution_i64_a100

Copied existing static_map files and just renamed
all references to static_map to static_reduction_map.
We need to return a bool so we can keep track of
how many unique keys were inserted in a bulk
insert.
The mapped value is updated in the case of a new
insert or updating an existing key, but we need to
track if the insert was the first time that key was
inserted.
benchmarks/utils.hpp Outdated Show resolved Hide resolved
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work! It's a large PR and I just had a quick look over examples, tests and benchmarks. Will look into implementations shortly.

include/cuco/detail/traits.hpp Outdated Show resolved Hide resolved
tests/CMakeLists.txt Outdated Show resolved Hide resolved
@sleeepyjack
Copy link
Collaborator Author

Thanks for the great work! It's a large PR and I just had a quick look over examples, tests and benchmarks. Will look into implementations shortly.

Thanks so much for the review so far! And I have to apologize for the unnecessary large merge commit. I just wanted it done as quickly as possible so you guys don't have to wait for it to get merged. I will incorporate the requested changes in the next couple of days.

include/cuco/detail/reduction_ops.cuh Outdated Show resolved Hide resolved
include/cuco/detail/static_reduction_map_kernels.cuh Outdated Show resolved Hide resolved
include/cuco/detail/static_reduction_map_kernels.cuh Outdated Show resolved Hide resolved
include/cuco/detail/static_reduction_map_kernels.cuh Outdated Show resolved Hide resolved
include/cuco/detail/static_reduction_map_kernels.cuh Outdated Show resolved Hide resolved
include/cuco/detail/static_reduction_map_kernels.cuh Outdated Show resolved Hide resolved
include/cuco/detail/static_reduction_map_kernels.cuh Outdated Show resolved Hide resolved
include/cuco/detail/static_reduction_map_kernels.cuh Outdated Show resolved Hide resolved
include/cuco/detail/static_reduction_map_kernels.cuh Outdated Show resolved Hide resolved
@sleeepyjack sleeepyjack changed the title [REVIEW] static_reduction_map (extension to PR #82) static_reduction_map (extension to PR #82) Mar 21, 2022
Comment on lines +31 to +34
#if defined(CUDART_VERSION) && (CUDART_VERSION >= 11000) && defined(__CUDA_ARCH__) && \
(__CUDA_ARCH__ >= 700)
#define CUCO_HAS_CUDA_BARRIER
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: We should make a detail/__config file for this kind of thing.

@@ -59,10 +59,10 @@ function(ConfigureNVBench BENCH_NAME)
add_executable(${BENCH_NAME} ${ARGN})
set_target_properties(${BENCH_NAME} PROPERTIES
POSITION_INDEPENDENT_CODE ON
RUNTIME_OUTPUT_DIRECTORY "${CMAKE_BINARY_DIR}/nvbenchmarks")
RUNTIME_OUTPUT_DIRECTORY "${CMAKE_BINARY_DIR}/benchmarks"
COMPILE_FLAGS -DNVBENCH_MODULE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the NVBENCH_MODULE definition?

Copy link
Collaborator Author

@sleeepyjack sleeepyjack Mar 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to reuse the key_generator.hppfor both gbench and nvbench setups. See:

#if defined(NVBENCH_MODULE)
#include <nvbench/nvbench.cuh>

NVBENCH_DECLARE_ENUM_TYPE_STRINGS(
  // Enum type:
  dist_type,
  // Callable to generate input strings:
  // Short identifier used for tables, command-line args, etc.
  // Used when context is available to figure out the enum type.
  [](dist_type d) {
    switch (d) {
      case dist_type::GAUSSIAN: return "GAUSSIAN";
      case dist_type::GEOMETRIC: return "GEOMETRIC";
      case dist_type::UNIFORM: return "UNIFORM";
      case dist_type::UNIQUE: return "UNIQUE";
      case dist_type::SAME: return "SAME";
      default: return "ERROR";
    }
  },
  // Callable to generate descriptions:
  // If non-empty, these are used in `--list` to describe values.
  // Used when context may not be available to figure out the type from the
  // input string.
  // Just use `[](auto) { return std::string{}; }` if you don't want these.
  [](auto) { return std::string{}; })
#endif

here: https://github.com/sleeepyjack/cuCollections/blob/5f244292990dbde9d5311d28ede72e74803250ac/benchmarks/key_generator.hpp#L25

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there another way of detecting if nvbench is included? I initially thought I could use the include guard definition but nvbench uses #pragma once iirc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. This is fine then. I'd suggest renaming to CUCO_USING_NVBENCH.

Comment on lines +54 to +55
* pairs that reduces the values associated to the same key according to a
* functor.
Copy link
Collaborator

@jrhemstad jrhemstad Mar 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* pairs that reduces the values associated to the same key according to a
* functor.
* pairs where insertion aggregates the values associated to the same key according to a
* binary reduction operator.

* individual threads.
* @tparam Allocator Type of allocator used for device storage
*/
template <typename ReductionOp,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do something to enforce requirements on ReductionOp. Basically, it needs to be one of the operators provided in reduction_ops.cuh or if a custom operation, needs to use custom_op.

Copy link
Collaborator Author

@sleeepyjack sleeepyjack Mar 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a member tag aka an empty struct cuco::tags::reduction_op? See e1361a3

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's what I was thinking. A person has 3 options for the ReductionOp

  1. Use one of the provided cuco::reduce_* types.
    • No additional work should be required. Partial specialization could/should remove the ReductionOp argument from the constructor
  2. Provide a unsynchronized binary callable T F(T, T) and Identity value
  • This needs to be wrapped by custom_op to apply F in a CAS loop
  • Ideally we could detect this kind of callable and implicitly wrap it in custom_op
  1. Provide a synchronized binary callable T F(atomic_ref<T, Scope>, T) and Identity value
  • User responsible for correct synchronization through atomic_ref

Examples:

// 1.
// no need to provide `reduce_add{}` 
// No need to provide identity value
cuco::static_reduction_map<cuco::reduce_add<int>, int, int> add_map{capacity, empty_key, alloc}; 

// 2. Unsynchronized binary callable must be wrapped in `custom_op`
struct unsync_add{ 
   int identity = 0; // Must provide identity value
   int operator()(int a, int b){ return a + b; }
};

// internally should wrap `unsync_add` in `custom_op`
cuco::static_reduction_map<unsync_add, int, int> custom_unsync_add_map(capacity, empty_key, unsync_add{}, alloc);

// 3.
stuct sync_add{
   int identity = 0; // Must provide identity value
   template <thread_scope Scope>
   int operator()(atomic_ref<int, Scope> a, int b){ return a.fetch_add(b, memory_order_relaxed); }
};

cuco::static_reduction_map<sync_add, int, int> custom_sync_add_map(capacity, empty_key, sync_add{}, alloc);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 & 3 could effectively be merged.

One thing that occurred to me is that the identity value need not be known statically. Not sure what kind of binop would have a runtime determined identity value, but who knows?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a member tag aka an empty struct cuco::tags::reduction_op?

@sleeepyjack I'd prefer to create a base class: https://godbolt.org/z/6KqenYenT

Copy link
Collaborator Author

@sleeepyjack sleeepyjack Jun 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrhemstad @PointKernel
Re-examining this question again:

I am a bit puzzled about how to distinguish between cases 1) and 3) as it involves extracting the type of the first argument of the operator() and check if it is atomic_ref<T> or just T 🙃. Maybe something like this?:

template <typename> struct first_arg;

template <typename F, typename A, typename... Args>
struct first_arg<F(A, Args...)>
{
   using type = A;
};

template <typename T>
using first_arg_t = typename first_arg<T>::type;

Also, implicitly switching between sync and non-sync implementations may lead to confusion on the user side.

How about defining a common base class for all built-in (synchronizing) functors. If a user passes a functor that doesn't inherit from this base, it is automatically wrapped in e.g. a CAS loop.
This way we put the user in charge of deciding whether the functor needs synchronization or not.
Additionally, we could use CRTP to add some convenient type checks to the base class.

Let me know what you think.

Comment on lines +111 to +131
namespace detail {
template <typename Key, typename Value>
struct slot_to_tuple {
template <typename S>
__device__ thrust::tuple<Key, Value> operator()(S const& s)
{
return thrust::tuple<Key, Value>(s.first, s.second);
}
};

template <typename Key>
struct slot_is_filled {
Key empty_key_sentinel;
template <typename S>
__device__ bool operator()(S const& s)
{
return thrust::get<0>(s) != empty_key_sentinel;
}
};
} // namespace detail

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
namespace detail {
template <typename Key, typename Value>
struct slot_to_tuple {
template <typename S>
__device__ thrust::tuple<Key, Value> operator()(S const& s)
{
return thrust::tuple<Key, Value>(s.first, s.second);
}
};
template <typename Key>
struct slot_is_filled {
Key empty_key_sentinel;
template <typename S>
__device__ bool operator()(S const& s)
{
return thrust::get<0>(s) != empty_key_sentinel;
}
};
} // namespace detail

@sleeepyjack This can be removed since I've moved them to detail/utils.cuh in #150

@jrhemstad jrhemstad added the Needs Review Awaiting reviews before merging label May 19, 2022
@jrhemstad jrhemstad changed the title static_reduction_map (extension to PR #82) static_reduction_map May 19, 2022
@jrhemstad
Copy link
Collaborator

@sleeepyjack to work on breaking this up into smaller PRs to make it easier to review.

@sleeepyjack
Copy link
Collaborator Author

Superseeded by #515

@sleeepyjack sleeepyjack closed this Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Review Awaiting reviews before merging topic: build CMake build issue topic: performance Performance related issue type: feature request New feature request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Hash map design that supports reduce by key
7 participants