-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
static_reduction_map #98
Conversation
Copied existing static_map files and just renamed all references to static_map to static_reduction_map.
the derived types.
We need to return a bool so we can keep track of how many unique keys were inserted in a bulk insert.
The mapped value is updated in the case of a new insert or updating an existing key, but we need to track if the insert was the first time that key was inserted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work! It's a large PR and I just had a quick look over examples, tests and benchmarks. Will look into implementations shortly.
Thanks so much for the review so far! And I have to apologize for the unnecessary large merge commit. I just wanted it done as quickly as possible so you guys don't have to wait for it to get merged. I will incorporate the requested changes in the next couple of days. |
#if defined(CUDART_VERSION) && (CUDART_VERSION >= 11000) && defined(__CUDA_ARCH__) && \ | ||
(__CUDA_ARCH__ >= 700) | ||
#define CUCO_HAS_CUDA_BARRIER | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: We should make a detail/__config
file for this kind of thing.
@@ -59,10 +59,10 @@ function(ConfigureNVBench BENCH_NAME) | |||
add_executable(${BENCH_NAME} ${ARGN}) | |||
set_target_properties(${BENCH_NAME} PROPERTIES | |||
POSITION_INDEPENDENT_CODE ON | |||
RUNTIME_OUTPUT_DIRECTORY "${CMAKE_BINARY_DIR}/nvbenchmarks") | |||
RUNTIME_OUTPUT_DIRECTORY "${CMAKE_BINARY_DIR}/benchmarks" | |||
COMPILE_FLAGS -DNVBENCH_MODULE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the NVBENCH_MODULE
definition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was to reuse the key_generator.hpp
for both gbench
and nvbench
setups. See:
#if defined(NVBENCH_MODULE)
#include <nvbench/nvbench.cuh>
NVBENCH_DECLARE_ENUM_TYPE_STRINGS(
// Enum type:
dist_type,
// Callable to generate input strings:
// Short identifier used for tables, command-line args, etc.
// Used when context is available to figure out the enum type.
[](dist_type d) {
switch (d) {
case dist_type::GAUSSIAN: return "GAUSSIAN";
case dist_type::GEOMETRIC: return "GEOMETRIC";
case dist_type::UNIFORM: return "UNIFORM";
case dist_type::UNIQUE: return "UNIQUE";
case dist_type::SAME: return "SAME";
default: return "ERROR";
}
},
// Callable to generate descriptions:
// If non-empty, these are used in `--list` to describe values.
// Used when context may not be available to figure out the type from the
// input string.
// Just use `[](auto) { return std::string{}; }` if you don't want these.
[](auto) { return std::string{}; })
#endif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there another way of detecting if nvbench is included? I initially thought I could use the include guard definition but nvbench uses #pragma once
iirc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. This is fine then. I'd suggest renaming to CUCO_USING_NVBENCH
.
* pairs that reduces the values associated to the same key according to a | ||
* functor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* pairs that reduces the values associated to the same key according to a | |
* functor. | |
* pairs where insertion aggregates the values associated to the same key according to a | |
* binary reduction operator. |
* individual threads. | ||
* @tparam Allocator Type of allocator used for device storage | ||
*/ | ||
template <typename ReductionOp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do something to enforce requirements on ReductionOp
. Basically, it needs to be one of the operators provided in reduction_ops.cuh
or if a custom operation, needs to use custom_op
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about a member tag aka an empty struct cuco::tags::reduction_op
? See e1361a3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's what I was thinking. A person has 3 options for the ReductionOp
- Use one of the provided
cuco::reduce_*
types.- No additional work should be required. Partial specialization could/should remove the
ReductionOp
argument from the constructor
- No additional work should be required. Partial specialization could/should remove the
- Provide a unsynchronized binary callable
T F(T, T)
andIdentity
value
- This needs to be wrapped by
custom_op
to applyF
in a CAS loop - Ideally we could detect this kind of callable and implicitly wrap it in
custom_op
- Provide a synchronized binary callable
T F(atomic_ref<T, Scope>, T)
andIdentity
value
- User responsible for correct synchronization through
atomic_ref
Examples:
// 1.
// no need to provide `reduce_add{}`
// No need to provide identity value
cuco::static_reduction_map<cuco::reduce_add<int>, int, int> add_map{capacity, empty_key, alloc};
// 2. Unsynchronized binary callable must be wrapped in `custom_op`
struct unsync_add{
int identity = 0; // Must provide identity value
int operator()(int a, int b){ return a + b; }
};
// internally should wrap `unsync_add` in `custom_op`
cuco::static_reduction_map<unsync_add, int, int> custom_unsync_add_map(capacity, empty_key, unsync_add{}, alloc);
// 3.
stuct sync_add{
int identity = 0; // Must provide identity value
template <thread_scope Scope>
int operator()(atomic_ref<int, Scope> a, int b){ return a.fetch_add(b, memory_order_relaxed); }
};
cuco::static_reduction_map<sync_add, int, int> custom_sync_add_map(capacity, empty_key, sync_add{}, alloc);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 & 3 could effectively be merged.
One thing that occurred to me is that the identity value need not be known statically. Not sure what kind of binop would have a runtime determined identity value, but who knows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about a member tag aka an empty struct cuco::tags::reduction_op?
@sleeepyjack I'd prefer to create a base class: https://godbolt.org/z/6KqenYenT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrhemstad @PointKernel
Re-examining this question again:
I am a bit puzzled about how to distinguish between cases 1) and 3) as it involves extracting the type of the first argument of the operator()
and check if it is atomic_ref<T>
or just T
🙃. Maybe something like this?:
template <typename> struct first_arg;
template <typename F, typename A, typename... Args>
struct first_arg<F(A, Args...)>
{
using type = A;
};
template <typename T>
using first_arg_t = typename first_arg<T>::type;
Also, implicitly switching between sync and non-sync implementations may lead to confusion on the user side.
How about defining a common base class for all built-in (synchronizing) functors. If a user passes a functor that doesn't inherit from this base, it is automatically wrapped in e.g. a CAS loop.
This way we put the user in charge of deciding whether the functor needs synchronization or not.
Additionally, we could use CRTP to add some convenient type checks to the base class.
Let me know what you think.
namespace detail { | ||
template <typename Key, typename Value> | ||
struct slot_to_tuple { | ||
template <typename S> | ||
__device__ thrust::tuple<Key, Value> operator()(S const& s) | ||
{ | ||
return thrust::tuple<Key, Value>(s.first, s.second); | ||
} | ||
}; | ||
|
||
template <typename Key> | ||
struct slot_is_filled { | ||
Key empty_key_sentinel; | ||
template <typename S> | ||
__device__ bool operator()(S const& s) | ||
{ | ||
return thrust::get<0>(s) != empty_key_sentinel; | ||
} | ||
}; | ||
} // namespace detail | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
namespace detail { | |
template <typename Key, typename Value> | |
struct slot_to_tuple { | |
template <typename S> | |
__device__ thrust::tuple<Key, Value> operator()(S const& s) | |
{ | |
return thrust::tuple<Key, Value>(s.first, s.second); | |
} | |
}; | |
template <typename Key> | |
struct slot_is_filled { | |
Key empty_key_sentinel; | |
template <typename S> | |
__device__ bool operator()(S const& s) | |
{ | |
return thrust::get<0>(s) != empty_key_sentinel; | |
} | |
}; | |
} // namespace detail |
@sleeepyjack This can be removed since I've moved them to detail/utils.cuh
in #150
@sleeepyjack to work on breaking this up into smaller PRs to make it easier to review. |
Superseeded by #515 |
This is an extension to PR #82 and closes #58
The following functionality has been added
dev
branch.custom_op
functor. [WIP]insert
bulk operationReduce-by-key benchmark results
In this benchmark scenario, we generate 100'000'000 uniformly distributed key-value pairs, where each distinct key has a multiplicity of m, i.e. each key occurs on average m times in the input data. The task is to sum up all values associated to the same key, where the input data, as well as the result reside in the GPU's global memory space.
Note that for our hash-based implementation (CUCO) we included two measurements with different target load factors (50% and 80%).
NVIDIA Tesla V100 32GB
4+4 byte key/value pairs
8+8 byte key/value pairs
NVIDIA Tesla A100 40GB
4+4 byte key/value pairs
8+8 byte key/value pairs