Optimize compaction operations #10030

PointKernel · 2022-01-12T23:14:00Z

Related to #9413.

This PR adds unordered_drop_duplicates/unordered_distinct_count APIs by using hash-based algorithms. It doesn't close the original issue since adding std::unique-like drop_duplicates is not addressed in this PR. It involves several changes:

Change the behavior of the existing distinct_count: counting the number of consecutive groups of equivalent rows instead of total unique.
Add hash-based unordered_distinct_count: this new API counts unique rows across the whole table by using a hash map. It requires a newer version of cuco with bug fixing: Fix an insert count bug NVIDIA/cuCollections#132 and Get rid of std::move when using cuco::make_pair NVIDIA/cuCollections#138.
Add hash-based unordered_drop_duplicates: similar to drop_duplicates, but this API doesn't support keep option and the output is in an unspecified order.
Replace all the cpp-side drop_duplicates/distinct_count use cases with unordered_ versions.
Update and replace the existing compaction benchmark with nvbench.

…mpaction

bdice

@PointKernel I started another round of review on Friday. I'm going to submit these comments now, but I haven't completed the second pass of review yet. Some of these comments may be outdated now, I apologize!

cpp/src/stream_compaction/drop_duplicates.cu

cpp/benchmarks/stream_compaction/drop_duplicates_benchmark.cpp

cpp/include/cudf/stream_compaction.hpp

cpp/src/stream_compaction/drop_duplicates.cu

cpp/tests/stream_compaction/distinct_count_tests.cpp

codecov · 2022-01-24T20:21:48Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.04@baff5cf). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.04   #10030   +/-   ##
===============================================
  Coverage                ?   10.48%           
===============================================
  Files                   ?      122           
  Lines                   ?    20496           
  Branches                ?        0           
===============================================
  Hits                    ?     2148           
  Misses                  ?    18348           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update baff5cf...a60c128. Read the comment docs.

…mpaction

bdice

Another round of feedback attached. Thanks for your persistence, it is a large PR!

cpp/src/stream_compaction/distinct_count.cu

cpp/tests/stream_compaction/distinct_count_tests.cpp

PointKernel · 2022-01-28T18:49:20Z

rerun tests

bdice

I have only a few minor comments for style/clarity left. I am really happy with the progress this has seen through several iterations, and I appreciate your effort on this PR! I will approve this once the minor comments have been resolved.

cpp/include/cudf/stream_compaction.hpp

cpp/src/dictionary/add_keys.cu

cpp/src/stream_compaction/distinct_count.cu

cpp/tests/stream_compaction/drop_duplicates_tests.cpp

bdice · 2022-02-02T17:31:36Z

Suggested edit:

diff --git a/cpp/src/stream_compaction/distinct_count.cu b/cpp/src/stream_compaction/distinct_count.cu
--- cpp/src/stream_compaction/distinct_count.cu
+++ cpp/src/stream_compaction/distinct_count.cu
@@ -233,21 +233,24 @@
                                          rmm::cuda_stream_view stream)
 {
   if (0 == input.size() or input.null_count() == input.size()) { return 0; }
 
-  // Check for NaNs
-  // Checking for nulls in input and flag nan_handling, as the count will
-  // only get affected if these two conditions are true. NaN will only be
-  // double-counted as a null if nan_handling was NAN_IS_NULL and input also
-  // had null values. If so, we decrement the count.
+  auto count = detail::unordered_distinct_count(table_view{{input}}, null_equality::EQUAL, stream);
+
+  // Check for nulls. If the null policy is EXCLUDE and null values were found,
+  // we decrement the count.
+  auto const has_null = input.has_nulls();
+  if (has_null and null_handling == null_policy::EXCLUDE) { --count; }
+
+  // Check for NaNs. There are two cases that can lead to decrementing the
+  // count. The first case is when the input has no nulls, but has NaN values
+  // handled as a null via NAN_IS_NULL and has a policy to EXCLUDE null values
+  // from the count. The second case is when the input has null values and NaN
+  // values handled as nulls via NAN_IS_NULL. Regardless of whether the null
+  // policy is set to EXCLUDE, we decrement the count to avoid double-counting
+  // null and NaN as distinct entities.
   auto const has_nan_as_null = (nan_handling == nan_policy::NAN_IS_NULL) and
                                cudf::type_dispatcher(input.type(), has_nans{}, input, stream);
-  auto const has_null = input.has_nulls();
-
-  auto count = detail::unordered_distinct_count(table_view{{input}}, null_equality::EQUAL, stream);
-
-  // if nan is considered null and there are already null values
-  if (null_handling == null_policy::EXCLUDE and has_null) { --count; }
   if (has_nan_as_null and (has_null or null_handling == null_policy::EXCLUDE)) { --count; }
   return count;
 }
 }  // namespace detail

bdice

@PointKernel I had one final suggestion relating to #10030 (comment). Diff is above. Approving.

robertmaynard

CMake changes LGTM

PointKernel · 2022-02-02T19:39:06Z

@gpucibot merge

PointKernel added 5 commits January 12, 2022 12:24

Rename existing compaction APIs

2f33e04

Merge remote-tracking branch 'upstream/branch-22.02' into optimize-co…

3212706

…mpaction

Update cython code to accommodate renaming

7ce9549

Update copyrights

5c5a415

Refactor unordered_distinct_count with hash-based algorithms

0a62ade

PointKernel added feature request New feature or request 2 - In Progress Currently a work in progress code quality libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue breaking Breaking change labels Jan 12, 2022

PointKernel self-assigned this Jan 12, 2022

PointKernel requested review from a team as code owners January 12, 2022 23:14

PointKernel requested review from bdice, ttnghia and brandon-b-miller January 12, 2022 23:14

PointKernel marked this pull request as draft January 12, 2022 23:14

github-actions bot added Java Affects Java cuDF API. Python Affects Python cuDF API. labels Jan 12, 2022

PointKernel added 8 commits January 12, 2022 18:30

Merge remote-tracking branch 'upstream/branch-22.02' into optimize-co…

e6acd8f

…mpaction

Refactor unordered_drop_duplicates with hash-based algorithms

fba851c

Update cython code

05ee85f

Optimize distinct count: insert valid rows only if nulls are equal

8ab22a4

Fill column via mutable view + update comments

bba7b57

Minor corrections

6746f28

Update benchmarks and unit tests

70292bc

Add reminder for further optimization in distinct count

46d83b9

Minor cleanup

e381815

bdice reviewed Jan 24, 2022

View reviewed changes

Address review comments

fa796aa

Merge remote-tracking branch 'upstream/branch-22.04' into optimize-co…

3915134

…mpaction

bdice requested changes Jan 26, 2022

View reviewed changes

PointKernel added 4 commits January 27, 2022 15:52

Address review comments

118468e

Simply if logic

d1535d5

Minor updates

0b0d015

Add early exit

906f469

PointKernel requested a review from bdice January 27, 2022 23:35

Fix cuco pair issues with the latest cuco tag

c8a3e87

bdice mentioned this pull request Feb 1, 2022

[FEA] Unify distinct_count column/table APIs. #10183

Open

bdice requested changes Feb 1, 2022

View reviewed changes

devavret mentioned this pull request Feb 1, 2022

[FEA] Story - Supporting row operators on nested types #10186

Closed

Address review comments

070d5ce

PointKernel requested a review from bdice February 2, 2022 15:38

bdice approved these changes Feb 2, 2022

View reviewed changes

ttnghia approved these changes Feb 2, 2022

View reviewed changes

Address review + update comments

a60c128

robertmaynard approved these changes Feb 2, 2022

View reviewed changes

rapids-bot bot merged commit b6bb463 into rapidsai:branch-22.04 Feb 2, 2022

PointKernel mentioned this pull request Mar 8, 2022

Refactor stream compaction APIs #10370

Merged

GregoryKimball mentioned this pull request Mar 25, 2022

[DOC] RAPIDS 22.04 Release Blog Outline #10383

Closed

PointKernel deleted the optimize-compaction branch May 26, 2022 17:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize compaction operations #10030

Optimize compaction operations #10030

PointKernel commented Jan 12, 2022 •

edited by bdice

Loading

bdice left a comment •

edited

Loading

codecov bot commented Jan 24, 2022 •

edited

Loading

bdice left a comment

PointKernel commented Jan 28, 2022

bdice left a comment

bdice commented Feb 2, 2022 •

edited

Loading

bdice left a comment

robertmaynard left a comment

PointKernel commented Feb 2, 2022

Optimize compaction operations #10030

Optimize compaction operations #10030

Conversation

PointKernel commented Jan 12, 2022 • edited by bdice Loading

bdice left a comment • edited Loading

Choose a reason for hiding this comment

codecov bot commented Jan 24, 2022 • edited Loading

Codecov Report

bdice left a comment

Choose a reason for hiding this comment

PointKernel commented Jan 28, 2022

bdice left a comment

Choose a reason for hiding this comment

bdice commented Feb 2, 2022 • edited Loading

bdice left a comment

Choose a reason for hiding this comment

robertmaynard left a comment

Choose a reason for hiding this comment

PointKernel commented Feb 2, 2022

PointKernel commented Jan 12, 2022 •

edited by bdice

Loading

bdice left a comment •

edited

Loading

codecov bot commented Jan 24, 2022 •

edited

Loading

bdice commented Feb 2, 2022 •

edited

Loading