-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix SparkMurmurHash3_32 hash inconsistencies with Apache Spark #7672
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-0.19 #7672 +/- ##
===============================================
+ Coverage 81.86% 82.09% +0.22%
===============================================
Files 101 101
Lines 16884 17064 +180
===============================================
+ Hits 13822 14008 +186
+ Misses 3062 3056 -6
Continue to review full report at Codecov.
|
@gpucibot merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look fine.
Is this intended to work with decimal? There are no tests for it, but it is also not explicitly disallowed either.
It feels like we can add that in as a simple specialization. The raw int value of the DECIMAL32 should be cast to a long and then hashed. The raw long value of a DECIMAL64 should just be hashed.
Excellent point. I'll add support for properly hashing decimal32 and decimal64 along with tests. |
hash_value_type CUDA_DEVICE_CALLABLE | ||
SparkMurmurHash3_32<numeric::decimal32>::operator()(numeric::decimal32 const& key) const | ||
{ | ||
return this->compute<uint64_t>(key.value()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I won't hold up the PR for this, but there's a trick to reducing the number of specializations, by checking if there's a .value()
method available on the key
type. :]
#7024 added a Spark variant of Murmur3 hashing, but it is inconsistent with Apache Spark's hash calculations in a few areas:
-0.0
and0.0
are not treated the same by Apache Spark for floats and doublesIn addition libcudf allows hashing of timestamp columns but the JNI bindings asserted if timestamp columns were passed in, disabling the ability to hash on timestamps directly.