-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow hash_partition to take a seed value #7771
Allow hash_partition to take a seed value #7771
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but this behavior needs testing. Please add a gtest that exercises the seed parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good to me, but I wonder whether we should document somewhere that the seed in IdentityHash
and MD5Hash
is not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems ok to me. Is there any concern here of old calls to this accidentally getting the seed and stream parameters silently crossed?
I don't think there is. The compiler should raise a compiler error in such cases. (just tried it locally and got invalid conversion errors.) |
I think we shouldn't nail that down to the documentation, because we can later change them to use the seed value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this can happen, right?
// user intended default stream, but is now getting a seed of 0
hash_partition(t, columns_to_hash, num_partitions, hash_function, 0);
Although maybe there aren't many cases where we're passing 0 manually.
That's a possibility, but like you said, the user wouldn't override the default argument just to pass the default stream. Doing so would accidentally do what he wanted, for a different reason (i.e., because the default value to the stream argument is 0). I feel a custom stream being silently crossed is what we should really worry about, and such a case would be rejected by the compiler. |
Hmmm. Streams are strongly typed in libcudf -- zero is not a valid argument for the stream parameter, it won't compile. But zero will work for this new seed parameter. |
I see. then there is no case that the stream argument would be silently crossed with the seed. |
Codecov Report
@@ Coverage Diff @@
## branch-0.19 #7771 +/- ##
===============================================
+ Coverage 81.86% 82.68% +0.81%
===============================================
Files 101 103 +2
Lines 16884 17566 +682
===============================================
+ Hits 13822 14524 +702
+ Misses 3062 3042 -20
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On further review, I think this is actually what we want. The seed should be used as the seed for the first hash, and not just combined with the first hash:
// Hash the first column w/ the seed
auto const initial_hash = type_dispatcher(_table.column(0).type(), element_hasher_with_seed<hash_function, has_nulls>{_seed}, _table.column(0), row_index);
auto hasher = [=](size_type column_index) {
return cudf::type_dispatcher(_table.column(column_index).type(),
element_hasher<hash_function, has_nulls>{},
_table.column(column_index),
row_index);
};
// Hash each element and combine all the hash values together
return thrust::transform_reduce(thrust::seq,
thrust::make_counting_iterator(1), // note that this starts at 1 and not 0 now since we already hashed the first column
thrust::make_counting_iterator(_table.num_columns()),
hasher,
initial_hash,
hash_combiner);
I think you're right. I just pushed the suggested change. Please take a look. |
@gpucibot merge |
rerun tests |
Just pushed a fix to recover the default behavior; the row hasher was originally doing 0 ⊕ hf(col0) ⊕ hf(col1) ⊕ ..., where operator ⊕ is Though I fixed the code to recover the original behavior, I don't think the failing tests were good ones; the tests are checking if |
Can you put this in an issue? Thanks! |
Here is the issue: #7819 |
This PR is to allow hash partitioning to configure the seed of its hash function. As noted in #6307, using the same hash function in hash partitioning and join leads to a massive hash collision and severely degrades join performance on multiple GPUs. There was an initial fix (#6726) to this problem, but it added only the code path to use identity hash function in hash partitioning, which doesn't support complex data types and thus cannot be used in general. In fact, using the same general Murmur3 hash function with different seeds in hash partitioning and join turned out to be a sufficient fix. This PR is to enable such configurations by making
hash_partition
accept an optional seed value.