-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not add nulls to the hash table when null_equality::NOT_EQUAL is passed to left_semi_join and left_anti_join #8277
Do not add nulls to the hash table when null_equality::NOT_EQUAL is passed to left_semi_join and left_anti_join #8277
Conversation
…assed to left_anti_join and left_semi_join. Major performance gains. Added benchmarks.
I noticed as I was doing this that we have no tests for left_anti_join. Will add. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice speedup!
cpp/src/join/semi_join.cu
Outdated
[hash_table] __device__(size_type idx) mutable { | ||
hash_table.insert(thrust::make_pair(idx, true)); | ||
}); | ||
// if compare_nulls == NOT_EQUAL, we can simply ignore any rows that are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also adjust the hash table size based on the number of valid rows? Not sure how important this is for join operations, just a thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this seems to work. I have to do a count_unset_bits()
on the aggregate bitmask though. This also feels mildly spooky to change as I'm trying to treat lightly in this unfamiliar module :)
auto const row_bitmask = (compare_nulls == null_equality::EQUAL)
? rmm::device_buffer{}
: cudf::detail::bitmask_and(right_flattened_keys, stream);
size_type right_row_ignore_count = compare_nulls == null_equality::EQUAL ? 0 :
count_unset_bits(static_cast<const bitmask_type*>(row_bitmask.data()), 0, right_num_rows);
...
size_t const hash_table_size = compute_hash_table_size(right_num_rows - right_row_ignore_count);
…ith both nulls EQUAL and UNEQUAL.
Codecov Report
@@ Coverage Diff @@
## branch-21.06 #8277 +/- ##
===============================================
Coverage ? 82.86%
===============================================
Files ? 105
Lines ? 17861
Branches ? 0
===============================================
Hits ? 14801
Misses ? 3060
Partials ? 0 Continue to review full report at Codecov.
|
@gpucibot merge |
Fixes #7300
This is fundamentally the same issue and fix as https://github.com/rapidsai/cudf/pull/6943/files from @hyperbolic2346
When nulls are considered not equal (
null_equality::NOT_EQUAL
) there is no point in adding them to the hash table used for the join as they will never compare as true against anything. Adding large numbers of nulls was causing huge performance issues.Includes a fix to doxygen comments for
left_anti_join
Performance gain is tremendous.
Before:
After: