-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize compaction operations #10030
Optimize compaction operations #10030
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PointKernel I started another round of review on Friday. I'm going to submit these comments now, but I haven't completed the second pass of review yet. Some of these comments may be outdated now, I apologize!
Codecov Report
@@ Coverage Diff @@
## branch-22.04 #10030 +/- ##
===============================================
Coverage ? 10.48%
===============================================
Files ? 122
Lines ? 20496
Branches ? 0
===============================================
Hits ? 2148
Misses ? 18348
Partials ? 0 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another round of feedback attached. Thanks for your persistence, it is a large PR!
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have only a few minor comments for style/clarity left. I am really happy with the progress this has seen through several iterations, and I appreciate your effort on this PR! I will approve this once the minor comments have been resolved.
Suggested edit: diff --git a/cpp/src/stream_compaction/distinct_count.cu b/cpp/src/stream_compaction/distinct_count.cu
--- cpp/src/stream_compaction/distinct_count.cu
+++ cpp/src/stream_compaction/distinct_count.cu
@@ -233,21 +233,24 @@
rmm::cuda_stream_view stream)
{
if (0 == input.size() or input.null_count() == input.size()) { return 0; }
- // Check for NaNs
- // Checking for nulls in input and flag nan_handling, as the count will
- // only get affected if these two conditions are true. NaN will only be
- // double-counted as a null if nan_handling was NAN_IS_NULL and input also
- // had null values. If so, we decrement the count.
+ auto count = detail::unordered_distinct_count(table_view{{input}}, null_equality::EQUAL, stream);
+
+ // Check for nulls. If the null policy is EXCLUDE and null values were found,
+ // we decrement the count.
+ auto const has_null = input.has_nulls();
+ if (has_null and null_handling == null_policy::EXCLUDE) { --count; }
+
+ // Check for NaNs. There are two cases that can lead to decrementing the
+ // count. The first case is when the input has no nulls, but has NaN values
+ // handled as a null via NAN_IS_NULL and has a policy to EXCLUDE null values
+ // from the count. The second case is when the input has null values and NaN
+ // values handled as nulls via NAN_IS_NULL. Regardless of whether the null
+ // policy is set to EXCLUDE, we decrement the count to avoid double-counting
+ // null and NaN as distinct entities.
auto const has_nan_as_null = (nan_handling == nan_policy::NAN_IS_NULL) and
cudf::type_dispatcher(input.type(), has_nans{}, input, stream);
- auto const has_null = input.has_nulls();
-
- auto count = detail::unordered_distinct_count(table_view{{input}}, null_equality::EQUAL, stream);
-
- // if nan is considered null and there are already null values
- if (null_handling == null_policy::EXCLUDE and has_null) { --count; }
if (has_nan_as_null and (has_null or null_handling == null_policy::EXCLUDE)) { --count; }
return count;
}
} // namespace detail
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PointKernel I had one final suggestion relating to #10030 (comment). Diff is above. Approving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMake changes LGTM
@gpucibot merge |
Related to #9413.
This PR adds
unordered_drop_duplicates
/unordered_distinct_count
APIs by using hash-based algorithms. It doesn't close the original issue since addingstd::unique
-likedrop_duplicates
is not addressed in this PR. It involves several changes:distinct_count
: counting the number of consecutive groups of equivalent rows instead of total unique.unordered_distinct_count
: this new API counts unique rows across the whole table by using a hash map. It requires a newer version ofcuco
with bug fixing: Fix an insert count bug NVIDIA/cuCollections#132 and Get rid ofstd::move
when usingcuco::make_pair
NVIDIA/cuCollections#138.unordered_drop_duplicates
: similar todrop_duplicates
, but this API doesn't supportkeep
option and the output is in an unspecified order.drop_duplicates
/distinct_count
use cases withunordered_
versions.nvbench
.