Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize compaction operations #10030

Merged
merged 55 commits into from
Feb 2, 2022
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
2f33e04
Rename existing compaction APIs
PointKernel Jan 12, 2022
3212706
Merge remote-tracking branch 'upstream/branch-22.02' into optimize-co…
PointKernel Jan 12, 2022
7ce9549
Update cython code to accommodate renaming
PointKernel Jan 12, 2022
5c5a415
Update copyrights
PointKernel Jan 12, 2022
0a62ade
Refactor unordered_distinct_count with hash-based algorithms
PointKernel Jan 12, 2022
e6acd8f
Merge remote-tracking branch 'upstream/branch-22.02' into optimize-co…
PointKernel Jan 12, 2022
fba851c
Refactor unordered_drop_duplicates with hash-based algorithms
PointKernel Jan 13, 2022
05ee85f
Update cython code
PointKernel Jan 13, 2022
8ab22a4
Optimize distinct count: insert valid rows only if nulls are equal
PointKernel Jan 13, 2022
bba7b57
Fill column via mutable view + update comments
PointKernel Jan 13, 2022
6746f28
Minor corrections
PointKernel Jan 13, 2022
70292bc
Update benchmarks and unit tests
PointKernel Jan 13, 2022
46d83b9
Add reminder for further optimization in distinct count
PointKernel Jan 13, 2022
f07d3d0
Fix transform test failure
PointKernel Jan 13, 2022
8fcfbae
Fix dictionary test failures
PointKernel Jan 13, 2022
a8fe478
Merge remote-tracking branch 'upstream/branch-22.04' into optimize-co…
PointKernel Jan 13, 2022
f2ac25d
Add sort-based implementations back to the repo
PointKernel Jan 14, 2022
0ed5712
Update copyright
PointKernel Jan 14, 2022
e372144
Add consecutive distinct_count
PointKernel Jan 14, 2022
fc57b29
Remove nan control in distinct_count
PointKernel Jan 15, 2022
d810e0b
Update unit tests
PointKernel Jan 15, 2022
40cc410
Add nan handling to distinct_count + update unit tests
PointKernel Jan 17, 2022
dd91e64
Rename drop_duplicates as sort_and_drop_duplicates
PointKernel Jan 17, 2022
d80911c
Add consecutive drop_duplicates
PointKernel Jan 17, 2022
7ced995
Optimize unordered_distinct_count: insert non-null rows only to impro…
PointKernel Jan 17, 2022
2ea5d8e
Update cuco git tag
PointKernel Jan 17, 2022
4bb7b16
Slience unused argument warning via function prototyping
PointKernel Jan 18, 2022
012ca8b
Refactor compaction benchmark with nvbench
PointKernel Jan 18, 2022
d489e2e
Update copyright
PointKernel Jan 18, 2022
3e47ffd
Get rid of nvbench primitive types
PointKernel Jan 19, 2022
a0a10e5
Update docs & comments
PointKernel Jan 19, 2022
5fb92c7
Address review comments
PointKernel Jan 19, 2022
3af4fd0
Address more review comments
PointKernel Jan 19, 2022
a587511
Split tests
PointKernel Jan 19, 2022
b062eb5
Use null masks in tests
PointKernel Jan 19, 2022
a5f881f
Split benchmarks
PointKernel Jan 19, 2022
df36e77
Fix a bug + update tests
PointKernel Jan 21, 2022
20ed6ea
Update docs
PointKernel Jan 21, 2022
e401690
Merge remote-tracking branch 'upstream/branch-22.04' into optimize-co…
PointKernel Jan 21, 2022
ecc1d7e
Add should_check_nan predicate to avoid unnecessary type-dispatching
PointKernel Jan 21, 2022
a151443
Rename benchmark according to benchmarking guide
PointKernel Jan 21, 2022
024d7e0
Remove std::unique-like drop_duplicates
PointKernel Jan 24, 2022
b6c1634
Style fixing
PointKernel Jan 24, 2022
3ad0f76
Fix test failures: sort the output
PointKernel Jan 24, 2022
58f6cb6
Minor cleanups
PointKernel Jan 24, 2022
e381815
Minor cleanup
PointKernel Jan 24, 2022
fa796aa
Address review comments
PointKernel Jan 24, 2022
3915134
Merge remote-tracking branch 'upstream/branch-22.04' into optimize-co…
PointKernel Jan 25, 2022
118468e
Address review comments
PointKernel Jan 27, 2022
d1535d5
Simply if logic
PointKernel Jan 27, 2022
0b0d015
Minor updates
PointKernel Jan 27, 2022
906f469
Add early exit
PointKernel Jan 27, 2022
c8a3e87
Fix cuco pair issues with the latest cuco tag
PointKernel Jan 28, 2022
070d5ce
Address review comments
PointKernel Feb 2, 2022
a60c128
Address review + update comments
PointKernel Feb 2, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 50 additions & 16 deletions cpp/benchmarks/stream_compaction/drop_duplicates_benchmark.cpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2020, NVIDIA CORPORATION.
* Copyright (c) 2020-2022, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -27,9 +27,14 @@

class Compaction : public cudf::benchmark {
};
class HashCompaction : public cudf::benchmark {
};

enum class algorithm { SORT_BASED, HASH_BASED };

template <typename Type>
void BM_compaction(benchmark::State& state, cudf::duplicate_keep_option keep)
template <typename Type, algorithm Algo>
void BM_compaction(benchmark::State& state,
cudf::duplicate_keep_option keep = cudf::duplicate_keep_option::KEEP_FIRST)
{
auto const n_rows = static_cast<cudf::size_type>(state.range(0));

Expand All @@ -45,34 +50,63 @@ void BM_compaction(benchmark::State& state, cudf::duplicate_keep_option keep)

for (auto _ : state) {
cuda_event_timer timer(state, true);
auto result = cudf::drop_duplicates(input_table, {0}, keep);
auto const result = [&]() {
if constexpr (Algo == algorithm::HASH_BASED) {
return cudf::unordered_drop_duplicates(input_table, {0});
} else {
return cudf::sort_and_drop_duplicates(input_table, {0}, keep);
}
}();
}
}

#define concat(a, b, c) a##b##c
#define get_keep(op) cudf::duplicate_keep_option::KEEP_##op

// TYPE, OP
#define RBM_BENCHMARK_DEFINE(name, type, keep) \
BENCHMARK_DEFINE_F(Compaction, name)(::benchmark::State & state) \
{ \
BM_compaction<type>(state, get_keep(keep)); \
} \
BENCHMARK_REGISTER_F(Compaction, name) \
->UseManualTime() \
->Arg(10000) /* 10k */ \
->Arg(100000) /* 100k */ \
->Arg(1000000) /* 1M */ \
#define SORT_BENCHMARK_DEFINE(name, type, keep) \
BENCHMARK_DEFINE_F(Compaction, name)(::benchmark::State & state) \
{ \
BM_compaction<type, algorithm::SORT_BASED>(state, get_keep(keep)); \
} \
BENCHMARK_REGISTER_F(Compaction, name) \
->UseManualTime() \
->Arg(10000) /* 10k */ \
->Arg(100000) /* 100k */ \
->Arg(1000000) /* 1M */ \
->Arg(10000000) /* 10M */

#define COMPACTION_BENCHMARK_DEFINE(type, keep) \
RBM_BENCHMARK_DEFINE(concat(type, _, keep), type, keep)
SORT_BENCHMARK_DEFINE(concat(type, _, keep), type, keep)

// TYPE
#define HASH_BENCHMARK_DEFINE(type) \
BENCHMARK_DEFINE_F(HashCompaction, type)(::benchmark::State & state) \
{ \
BM_compaction<type, algorithm::HASH_BASED>(state); \
} \
BENCHMARK_REGISTER_F(HashCompaction, type) \
->UseManualTime() \
->Arg(10000) /* 10k */ \
->Arg(100000) /* 100k */ \
->Arg(1000000) /* 1M */ \
->Arg(10000000) /* 10M */

#define HASH_COMPACTION_BENCHMARK_DEFINE(type) HASH_BENCHMARK_DEFINE(type)

using cudf::timestamp_ms;

COMPACTION_BENCHMARK_DEFINE(bool, NONE);
COMPACTION_BENCHMARK_DEFINE(int8_t, NONE);
COMPACTION_BENCHMARK_DEFINE(int32_t, NONE);
COMPACTION_BENCHMARK_DEFINE(int32_t, FIRST);
COMPACTION_BENCHMARK_DEFINE(int32_t, LAST);
using cudf::timestamp_ms;
COMPACTION_BENCHMARK_DEFINE(timestamp_ms, NONE);
COMPACTION_BENCHMARK_DEFINE(float, NONE);

HASH_COMPACTION_BENCHMARK_DEFINE(bool);
HASH_COMPACTION_BENCHMARK_DEFINE(int8_t);
HASH_COMPACTION_BENCHMARK_DEFINE(int32_t);
HASH_COMPACTION_BENCHMARK_DEFINE(int64_t);
HASH_COMPACTION_BENCHMARK_DEFINE(timestamp_ms);
HASH_COMPACTION_BENCHMARK_DEFINE(float);
4 changes: 2 additions & 2 deletions cpp/cmake/thirdparty/get_cucollections.cmake
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# =============================================================================
# Copyright (c) 2021, NVIDIA CORPORATION.
# Copyright (c) 2021-2022, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
# in compliance with the License. You may obtain a copy of the License at
Expand All @@ -21,7 +21,7 @@ function(find_and_configure_cucollections)
cuco 0.0
GLOBAL_TARGETS cuco::cuco
CPM_ARGS GITHUB_REPOSITORY NVIDIA/cuCollections
GIT_TAG 193de1aa74f5721717f991ca757dc610c852bb17
GIT_TAG 922a87856aac17742fb964eeaf1b9bbc5d7a916e
OPTIONS "BUILD_TESTS OFF" "BUILD_BENCHMARKS OFF" "BUILD_EXAMPLES OFF"
)

Expand Down
46 changes: 45 additions & 1 deletion cpp/include/cudf/detail/stream_compaction.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2019-2020, NVIDIA CORPORATION.
* Copyright (c) 2019-2022, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -67,6 +67,19 @@ std::unique_ptr<table> apply_boolean_mask(
* @param[in] stream CUDA stream used for device memory operations and kernel launches.
*/
std::unique_ptr<table> drop_duplicates(
table_view const& input,
std::vector<size_type> const& keys,
duplicate_keep_option keep,
null_equality nulls_equal = null_equality::EQUAL,
rmm::cuda_stream_view stream = rmm::cuda_stream_default,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @copydoc cudf::sort_and_drop_duplicates
*
* @param[in] stream CUDA stream used for device memory operations and kernel launches.
*/
std::unique_ptr<table> sort_and_drop_duplicates(
table_view const& input,
std::vector<size_type> const& keys,
duplicate_keep_option keep,
Expand All @@ -75,6 +88,18 @@ std::unique_ptr<table> drop_duplicates(
rmm::cuda_stream_view stream = rmm::cuda_stream_default,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @copydoc cudf::unordered_drop_duplicates
*
* @param[in] stream CUDA stream used for device memory operations and kernel launches.
*/
std::unique_ptr<table> unordered_drop_duplicates(
table_view const& input,
std::vector<size_type> const& keys,
null_equality nulls_equal = null_equality::EQUAL,
rmm::cuda_stream_view stream = rmm::cuda_stream_default,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @copydoc cudf::distinct_count(column_view const&, null_policy, nan_policy)
*
Expand All @@ -94,5 +119,24 @@ cudf::size_type distinct_count(table_view const& input,
null_equality nulls_equal = null_equality::EQUAL,
rmm::cuda_stream_view stream = rmm::cuda_stream_default);

/**
* @copydoc cudf::unordered_distinct_count(column_view const&, null_policy, nan_policy)
*
* @param[in] stream CUDA stream used for device memory operations and kernel launches.
*/
cudf::size_type unordered_distinct_count(column_view const& input,
null_policy null_handling,
nan_policy nan_handling,
rmm::cuda_stream_view stream = rmm::cuda_stream_default);

/**
* @copydoc cudf::unordered_distinct_count(table_view const&, null_equality)
*
* @param[in] stream CUDA stream used for device memory operations and kernel launches.
*/
cudf::size_type unordered_distinct_count(table_view const& input,
null_equality nulls_equal = null_equality::EQUAL,
rmm::cuda_stream_view stream = rmm::cuda_stream_default);

} // namespace detail
} // namespace cudf
119 changes: 105 additions & 14 deletions cpp/include/cudf/stream_compaction.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2019, NVIDIA CORPORATION.
* Copyright (c) 2019-2022, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -214,7 +214,38 @@ enum class duplicate_keep_option {
};

/**
* @brief Create a new table without duplicate rows
* @brief Eliminates all except the row specified by `keep` from every consecutive group of
* equivalent rows.
PointKernel marked this conversation as resolved.
Show resolved Hide resolved
*
* Given an `input` table_view, one row from a group of equivalent elements is copied to
* output table depending on the value of @p keep:
PointKernel marked this conversation as resolved.
Show resolved Hide resolved
* - KEEP_FIRST: only the first of a sequence of duplicate rows is copied
* - KEEP_LAST: only the last of a sequence of duplicate rows is copied
* - KEEP_NONE: no duplicate rows are copied
*
* @throws cudf::logic_error if The `input` row size mismatches with `keys`.
PointKernel marked this conversation as resolved.
Show resolved Hide resolved
*
* @param[in] input input table_view to copy only unique rows
PointKernel marked this conversation as resolved.
Show resolved Hide resolved
* @param[in] keys vector of indices representing key columns from `input`
* @param[in] keep keep first entry, last entry, or no entries if duplicates found
PointKernel marked this conversation as resolved.
Show resolved Hide resolved
* @param[in] nulls_equal flag to denote nulls are equal if null_equality::EQUAL, nulls are not
* equal if null_equality::UNEQUAL
* @param[in] mr Device memory resource used to allocate the returned table's device
* memory
PointKernel marked this conversation as resolved.
Show resolved Hide resolved
*
* @return Table with unique rows from each sequence of equivalent rows as per specified `keep`.
PointKernel marked this conversation as resolved.
Show resolved Hide resolved
*/
std::unique_ptr<table> drop_duplicates(
table_view const& input,
std::vector<size_type> const& keys,
duplicate_keep_option keep,
null_equality nulls_equal = null_equality::EQUAL,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Create a new table without duplicate rows.
PointKernel marked this conversation as resolved.
Show resolved Hide resolved
*
* The output table is sorted according to the lexicographic ordering of the `keys` rows.
PointKernel marked this conversation as resolved.
Show resolved Hide resolved
*
* Given an `input` table_view, each row is copied to output table if the corresponding
* row of `keys` columns is unique, where the definition of unique depends on the value of @p keep:
Expand All @@ -233,9 +264,9 @@ enum class duplicate_keep_option {
* @param[in] mr Device memory resource used to allocate the returned table's device
* memory
*
* @return Table with unique rows as per specified `keep`.
* @return Table with sorted unique rows as per specified `keep`.
PointKernel marked this conversation as resolved.
Show resolved Hide resolved
*/
std::unique_ptr<table> drop_duplicates(
std::unique_ptr<table> sort_and_drop_duplicates(
table_view const& input,
std::vector<size_type> const& keys,
duplicate_keep_option keep,
Expand All @@ -244,37 +275,97 @@ std::unique_ptr<table> drop_duplicates(
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Count the unique elements in the column_view
* @brief Create a new table without duplicate rows with hash-based algorithms.
*
* Given an input column_view, number of unique elements in this column_view is returned
* Given an `input` table_view, each row is copied to output table if the corresponding
* row of `keys` columns is unique. If duplicate rows are present, it is unspecified which
* row is copied.
*
* Elements in the output table are in a random order.
PointKernel marked this conversation as resolved.
Show resolved Hide resolved
*
* @throws cudf::logic_error if The `input` row size mismatches with `keys`.
*
* @param[in] input input table_view to copy only unique rows
* @param[in] keys vector of indices representing key columns from `input`
* @param[in] nulls_equal flag to denote nulls are equal if null_equality::EQUAL, nulls are not
* equal if null_equality::UNEQUAL
* @param[in] mr Device memory resource used to allocate the returned table's device
* memory
*
* @return Table with unique rows in an unspecified order.
*/
std::unique_ptr<table> unordered_drop_duplicates(
table_view const& input,
std::vector<size_type> const& keys,
null_equality nulls_equal = null_equality::EQUAL,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Count the number of consecutive groups of equivalent elements in a column.
*
* If `null_handling` is null_policy::EXCLUDE and `nan_handling` is nan_policy::NAN_IS_NULL, both
* `NaN` and `null` values are ignored. If `null_handling` is null_policy::EXCLUDE and
* `nan_handling` is nan_policy::NAN_IS_VALID, only `null` is ignored, `NaN` is considered in unique
* count.
* `nan_handling` is nan_policy::NAN_IS_VALID, only `null` is ignored, `NaN` is considered in count.
*
* `null`s are handled as equal.
*
* @param[in] input The column_view whose unique elements will be counted.
* @param[in] input View of the input column
* @param[in] null_handling flag to include or ignore `null` while counting
* @param[in] nan_handling flag to consider `NaN==null` or not.
* @param[in] nan_handling flag to consider `NaN==null` or not
*
* @return number of unique elements
* @return number of consecutive groups in the column
*/
cudf::size_type distinct_count(column_view const& input,
null_policy null_handling,
nan_policy nan_handling);
PointKernel marked this conversation as resolved.
Show resolved Hide resolved

/**
* @brief Count the unique rows in a table.
* @brief Count the number of consecutive groups of equivalent elements in a table.
*
*
* @param[in] input Table whose unique rows will be counted.
* @param[in] input Table whose number of consecutive groups will be counted
PointKernel marked this conversation as resolved.
Show resolved Hide resolved
* @param[in] nulls_equal flag to denote if null elements should be considered equal
bdice marked this conversation as resolved.
Show resolved Hide resolved
* nulls are not equal if null_equality::UNEQUAL
PointKernel marked this conversation as resolved.
Show resolved Hide resolved
*
* @return number of unique rows in the table
* @return number of consecutive groups in the table
PointKernel marked this conversation as resolved.
Show resolved Hide resolved
*/
cudf::size_type distinct_count(table_view const& input,
null_equality nulls_equal = null_equality::EQUAL);

/**
* @brief Count the unique elements in the column_view.
*
* Given an input column_view, number of unique elements in this column_view is returned.
PointKernel marked this conversation as resolved.
Show resolved Hide resolved
*
* If `null_handling` is null_policy::EXCLUDE and `nan_handling` is nan_policy::NAN_IS_NULL, both
* `NaN` and `null` values are ignored. If `null_handling` is null_policy::EXCLUDE and
* `nan_handling` is nan_policy::NAN_IS_VALID, only `null` is ignored, `NaN` is considered in unique
* count.
*
* `null`s are handled as equal.
*
* @param[in] input The column_view whose unique elements will be counted
* @param[in] null_handling flag to include or ignore `null` while counting
* @param[in] nan_handling flag to consider `NaN==null` or not
*
* @return number of unique elements
*/
cudf::size_type unordered_distinct_count(column_view const& input,
null_policy null_handling,
nan_policy nan_handling);

/**
* @brief Count the unique rows in a table.
*
*
* @param[in] input Table whose unique rows will be counted
* @param[in] nulls_equal flag to denote if null elements should be considered equal
* nulls are not equal if null_equality::UNEQUAL
*
* @return number of unique rows in the table
*/
cudf::size_type unordered_distinct_count(table_view const& input,
null_equality nulls_equal = null_equality::EQUAL);

/** @} */
} // namespace cudf
Loading