Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix SparkMurmurHash3_32 hash inconsistencies with Apache Spark #7672

Merged
merged 3 commits into from
Mar 24, 2021

Conversation

jlowe
Copy link
Member

@jlowe jlowe commented Mar 22, 2021

#7024 added a Spark variant of Murmur3 hashing, but it is inconsistent with Apache Spark's hash calculations in a few areas:

  • -0.0 and 0.0 are not treated the same by Apache Spark for floats and doubles
  • byte and short integral values are upcast to a 32-bit unsigned int (i.e.: zero-filled) before calculating the hash

In addition libcudf allows hashing of timestamp columns but the JNI bindings asserted if timestamp columns were passed in, disabling the ability to hash on timestamps directly.

@jlowe jlowe added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Java Affects Java cuDF API. Spark Functionality that helps Spark RAPIDS 4 - Needs cuDF (Java) Reviewer non-breaking Non-breaking change labels Mar 22, 2021
@jlowe jlowe requested a review from a team as a code owner March 22, 2021 20:59
@jlowe jlowe self-assigned this Mar 22, 2021
@jlowe jlowe requested a review from a team as a code owner March 22, 2021 20:59
@jlowe jlowe requested review from mythrocks and ttnghia March 22, 2021 20:59
@jlowe jlowe added the bug Something isn't working label Mar 22, 2021
@codecov
Copy link

codecov bot commented Mar 23, 2021

Codecov Report

Merging #7672 (50a58e0) into branch-0.19 (7871e7a) will increase coverage by 0.22%.
The diff coverage is n/a.

❗ Current head 50a58e0 differs from pull request most recent head e070822. Consider uploading reports for the commit e070822 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.19    #7672      +/-   ##
===============================================
+ Coverage        81.86%   82.09%   +0.22%     
===============================================
  Files              101      101              
  Lines            16884    17064     +180     
===============================================
+ Hits             13822    14008     +186     
+ Misses            3062     3056       -6     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/categorical.py 91.62% <ø> (+0.23%) ⬆️
python/cudf/cudf/core/column/column.py 87.77% <ø> (+0.01%) ⬆️
python/cudf/cudf/core/column/datetime.py 89.09% <ø> (ø)
python/cudf/cudf/core/column/decimal.py 92.75% <ø> (-2.12%) ⬇️
python/cudf/cudf/core/column/lists.py 89.60% <ø> (-1.80%) ⬇️
python/cudf/cudf/core/column/numerical.py 94.83% <ø> (-0.20%) ⬇️
python/cudf/cudf/core/column/string.py 86.58% <ø> (+0.08%) ⬆️
python/cudf/cudf/core/column/timedelta.py 88.23% <ø> (ø)
python/cudf/cudf/core/column_accessor.py 95.87% <ø> (+0.55%) ⬆️
python/cudf/cudf/core/dataframe.py 90.78% <ø> (+0.31%) ⬆️
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 267d29b...e070822. Read the comment docs.

@jrhemstad
Copy link
Contributor

@gpucibot merge

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look fine.

Is this intended to work with decimal? There are no tests for it, but it is also not explicitly disallowed either.

It feels like we can add that in as a simple specialization. The raw int value of the DECIMAL32 should be cast to a long and then hashed. The raw long value of a DECIMAL64 should just be hashed.

@jlowe
Copy link
Member Author

jlowe commented Mar 24, 2021

Is this intended to work with decimal? There are no tests for it, but it is also not explicitly disallowed either.

Excellent point. I'll add support for properly hashing decimal32 and decimal64 along with tests.

hash_value_type CUDA_DEVICE_CALLABLE
SparkMurmurHash3_32<numeric::decimal32>::operator()(numeric::decimal32 const& key) const
{
return this->compute<uint64_t>(key.value());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't hold up the PR for this, but there's a trick to reducing the number of specializations, by checking if there's a .value() method available on the key type. :]

@rapids-bot rapids-bot bot merged commit aa7ca46 into rapidsai:branch-0.19 Mar 24, 2021
@jlowe jlowe deleted the fix-spark-hash branch September 10, 2021 15:46
@vyasr vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuDF (Java) Reviewer labels Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond bug Something isn't working Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants