Fix SparkMurmurHash3_32 hash inconsistencies with Apache Spark #7672

jlowe · 2021-03-22T20:59:15Z

#7024 added a Spark variant of Murmur3 hashing, but it is inconsistent with Apache Spark's hash calculations in a few areas:

-0.0 and 0.0 are not treated the same by Apache Spark for floats and doubles
byte and short integral values are upcast to a 32-bit unsigned int (i.e.: zero-filled) before calculating the hash

In addition libcudf allows hashing of timestamp columns but the JNI bindings asserted if timestamp columns were passed in, disabling the ability to hash on timestamps directly.

cpp/include/cudf/detail/utilities/hash_functions.cuh

cpp/tests/hashing/hash_test.cpp

codecov · 2021-03-23T00:01:57Z

Codecov Report

Merging #7672 (50a58e0) into branch-0.19 (7871e7a) will increase coverage by 0.22%.
The diff coverage is n/a.

❗ Current head 50a58e0 differs from pull request most recent head e070822. Consider uploading reports for the commit e070822 to get more accurate results

@@               Coverage Diff               @@
##           branch-0.19    #7672      +/-   ##
===============================================
+ Coverage        81.86%   82.09%   +0.22%     
===============================================
  Files              101      101              
  Lines            16884    17064     +180     
===============================================
+ Hits             13822    14008     +186     
+ Misses            3062     3056       -6

Impacted Files	Coverage Δ
python/cudf/cudf/core/column/categorical.py	`91.62% <ø> (+0.23%)`	⬆️
python/cudf/cudf/core/column/column.py	`87.77% <ø> (+0.01%)`	⬆️
python/cudf/cudf/core/column/datetime.py	`89.09% <ø> (ø)`
python/cudf/cudf/core/column/decimal.py	`92.75% <ø> (-2.12%)`	⬇️
python/cudf/cudf/core/column/lists.py	`89.60% <ø> (-1.80%)`	⬇️
python/cudf/cudf/core/column/numerical.py	`94.83% <ø> (-0.20%)`	⬇️
python/cudf/cudf/core/column/string.py	`86.58% <ø> (+0.08%)`	⬆️
python/cudf/cudf/core/column/timedelta.py	`88.23% <ø> (ø)`
python/cudf/cudf/core/column_accessor.py	`95.87% <ø> (+0.55%)`	⬆️
python/cudf/cudf/core/dataframe.py	`90.78% <ø> (+0.31%)`	⬆️
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 267d29b...e070822. Read the comment docs.

jrhemstad · 2021-03-24T01:35:39Z

@gpucibot merge

revans2

These changes look fine.

Is this intended to work with decimal? There are no tests for it, but it is also not explicitly disallowed either.

It feels like we can add that in as a simple specialization. The raw int value of the DECIMAL32 should be cast to a long and then hashed. The raw long value of a DECIMAL64 should just be hashed.

jlowe · 2021-03-24T15:12:41Z

Is this intended to work with decimal? There are no tests for it, but it is also not explicitly disallowed either.

Excellent point. I'll add support for properly hashing decimal32 and decimal64 along with tests.

mythrocks · 2021-03-24T17:46:47Z

cpp/include/cudf/detail/utilities/hash_functions.cuh

+hash_value_type CUDA_DEVICE_CALLABLE
+SparkMurmurHash3_32<numeric::decimal32>::operator()(numeric::decimal32 const& key) const
+{
+  return this->compute<uint64_t>(key.value());


I won't hold up the PR for this, but there's a trick to reducing the number of specializations, by checking if there's a .value() method available on the key type. :]

java/src/test/java/ai/rapids/cudf/ColumnVectorTest.java

Fix SparkMurmurHash3_32 hash inconsistencies with Apache Spark

d01cc1c

jlowe added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Java Affects Java cuDF API. Spark Functionality that helps Spark RAPIDS 4 - Needs cuDF (Java) Reviewer non-breaking Non-breaking change labels Mar 22, 2021

jlowe requested a review from a team as a code owner March 22, 2021 20:59

jlowe self-assigned this Mar 22, 2021

jlowe requested a review from a team as a code owner March 22, 2021 20:59

jlowe requested review from mythrocks and ttnghia March 22, 2021 20:59

jlowe added the bug Something isn't working label Mar 22, 2021

ttnghia reviewed Mar 22, 2021

View reviewed changes

cpp/include/cudf/detail/utilities/hash_functions.cuh Show resolved Hide resolved

cpp/tests/hashing/hash_test.cpp Show resolved Hide resolved

ttnghia approved these changes Mar 24, 2021

View reviewed changes

jrhemstad approved these changes Mar 24, 2021

View reviewed changes

revans2 reviewed Mar 24, 2021

View reviewed changes

Fix Spark hash of decimal32/decimal64

b42b6c5

abellina approved these changes Mar 24, 2021

View reviewed changes

Merge branch 'branch-0.19' into fix-spark-hash

e070822

mythrocks reviewed Mar 24, 2021

View reviewed changes

gerashegalov reviewed Mar 24, 2021

View reviewed changes

java/src/test/java/ai/rapids/cudf/ColumnVectorTest.java Show resolved Hide resolved

mythrocks approved these changes Mar 24, 2021

View reviewed changes

revans2 approved these changes Mar 24, 2021

View reviewed changes

rapids-bot bot merged commit aa7ca46 into rapidsai:branch-0.19 Mar 24, 2021

jlowe deleted the fix-spark-hash branch September 10, 2021 15:46

bdice mentioned this pull request Sep 28, 2021

Add SHA-1 and SHA-2 hash functions. #9215

Closed

vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuDF (Java) Reviewer labels Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SparkMurmurHash3_32 hash inconsistencies with Apache Spark #7672

Fix SparkMurmurHash3_32 hash inconsistencies with Apache Spark #7672

jlowe commented Mar 22, 2021

codecov bot commented Mar 23, 2021 •

edited

Loading

jrhemstad commented Mar 24, 2021

revans2 left a comment

jlowe commented Mar 24, 2021

mythrocks Mar 24, 2021

Fix SparkMurmurHash3_32 hash inconsistencies with Apache Spark #7672

Fix SparkMurmurHash3_32 hash inconsistencies with Apache Spark #7672

Conversation

jlowe commented Mar 22, 2021

codecov bot commented Mar 23, 2021 • edited Loading

Codecov Report

jrhemstad commented Mar 24, 2021

revans2 left a comment

Choose a reason for hiding this comment

jlowe commented Mar 24, 2021

mythrocks Mar 24, 2021

Choose a reason for hiding this comment

codecov bot commented Mar 23, 2021 •

edited

Loading