Allow hash_partition to take a seed value #7771

magnatelee · 2021-03-30T23:03:38Z

This PR is to allow hash partitioning to configure the seed of its hash function. As noted in #6307, using the same hash function in hash partitioning and join leads to a massive hash collision and severely degrades join performance on multiple GPUs. There was an initial fix (#6726) to this problem, but it added only the code path to use identity hash function in hash partitioning, which doesn't support complex data types and thus cannot be used in general. In fact, using the same general Murmur3 hash function with different seeds in hash partitioning and join turned out to be a sufficient fix. This PR is to enable such configurations by making hash_partition accept an optional seed value.

harrism

Looks good, but this behavior needs testing. Please add a gtest that exercises the seed parameter.

cpp/include/cudf/detail/utilities/hash_functions.cuh

gaohao95

The code looks good to me, but I wonder whether we should document somewhere that the seed in IdentityHash and MD5Hash is not used.

nvdbaranec

Seems ok to me. Is there any concern here of old calls to this accidentally getting the seed and stream parameters silently crossed?

magnatelee · 2021-03-31T00:12:49Z

Seems ok to me. Is there any concern here of old calls to this accidentally getting the seed and stream parameters silently crossed?

I don't think there is. The compiler should raise a compiler error in such cases. (just tried it locally and got invalid conversion errors.)

magnatelee · 2021-03-31T00:15:22Z

The code looks good to me, but I wonder whether we should document somewhere that the seed in IdentityHash and MD5Hash is not used.

I think we shouldn't nail that down to the documentation, because we can later change them to use the seed value.

nvdbaranec

I feel like this can happen, right?

// user intended default stream, but is now getting a seed of 0
hash_partition(t, columns_to_hash, num_partitions, hash_function, 0);

Although maybe there aren't many cases where we're passing 0 manually.

magnatelee · 2021-03-31T00:38:53Z

I feel like this can happen, right?
// user intended default stream, but is now getting a seed of 0
hash_partition(t, columns_to_hash, num_partitions, hash_function, 0);
Although maybe there aren't many cases where we're passing 0 manually.

That's a possibility, but like you said, the user wouldn't override the default argument just to pass the default stream. Doing so would accidentally do what he wanted, for a different reason (i.e., because the default value to the stream argument is 0). I feel a custom stream being silently crossed is what we should really worry about, and such a case would be rejected by the compiler.

harrism · 2021-03-31T00:42:12Z

That's a possibility, but like you said, the user wouldn't override the default argument just to pass the default stream.

Hmmm. Streams are strongly typed in libcudf -- zero is not a valid argument for the stream parameter, it won't compile. But zero will work for this new seed parameter.

cpp/include/cudf/partitioning.hpp

magnatelee · 2021-03-31T00:46:02Z

That's a possibility, but like you said, the user wouldn't override the default argument just to pass the default stream.

Hmmm. Streams are strongly typed in libcudf -- zero is not a valid argument for the stream parameter, it won't compile. But zero will work for this new seed parameter.

I see. then there is no case that the stream argument would be silently crossed with the seed.

codecov · 2021-03-31T04:30:32Z

Codecov Report

Merging #7771 (56d1552) into branch-0.19 (7871e7a) will increase coverage by 0.81%.
The diff coverage is n/a.

❗ Current head 56d1552 differs from pull request most recent head 7113d6a. Consider uploading reports for the commit 7113d6a to get more accurate results

@@               Coverage Diff               @@
##           branch-0.19    #7771      +/-   ##
===============================================
+ Coverage        81.86%   82.68%   +0.81%     
===============================================
  Files              101      103       +2     
  Lines            16884    17566     +682     
===============================================
+ Hits             13822    14524     +702     
+ Misses            3062     3042      -20

Impacted Files	Coverage Δ
python/cudf/cudf/utils/dtypes.py	`83.44% <0.00%> (-6.08%)`	⬇️
python/cudf/cudf/core/column/lists.py	`87.32% <0.00%> (-4.08%)`	⬇️
python/cudf/cudf/core/column/decimal.py	`92.68% <0.00%> (-2.19%)`	⬇️
python/dask_cudf/dask_cudf/backends.py	`87.50% <0.00%> (-2.13%)`	⬇️
python/cudf/cudf/core/groupby/groupby.py	`92.77% <0.00%> (-0.68%)`	⬇️
python/cudf/cudf/core/column/column.py	`87.61% <0.00%> (-0.15%)`	⬇️
python/cudf/cudf/utils/utils.py	`85.36% <0.00%> (-0.07%)`	⬇️
python/cudf/cudf/io/feather.py	`100.00% <0.00%> (ø)`
python/cudf/cudf/utils/ioutils.py	`78.71% <0.00%> (ø)`
python/cudf/cudf/comm/serialize.py	`0.00% <0.00%> (ø)`
... and 46 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ad9212b...7113d6a. Read the comment docs.

cpp/include/cudf/table/row_operators.cuh

jrhemstad

On further review, I think this is actually what we want. The seed should be used as the seed for the first hash, and not just combined with the first hash:

   // Hash the first column w/ the seed
   auto const initial_hash = type_dispatcher(_table.column(0).type(), element_hasher_with_seed<hash_function, has_nulls>{_seed}, _table.column(0), row_index);

    auto hasher = [=](size_type column_index) {
      return cudf::type_dispatcher(_table.column(column_index).type(),
                                   element_hasher<hash_function, has_nulls>{},
                                   _table.column(column_index),
                                   row_index);
    };

    // Hash each element and combine all the hash values together
    return thrust::transform_reduce(thrust::seq,
                                    thrust::make_counting_iterator(1), // note that this starts at 1 and not 0 now since we already hashed the first column 
                                    thrust::make_counting_iterator(_table.num_columns()),
                                    hasher,
                                    initial_hash,
                                    hash_combiner);

magnatelee · 2021-03-31T18:07:45Z

On further review, I think this is actually what we want. The seed should be used as the seed for the first hash, and not just combined with the first hash:
...

I think you're right. I just pushed the suggested change. Please take a look.

Addressed.

kkraus14 · 2021-03-31T18:57:06Z

@gpucibot merge

kkraus14 · 2021-03-31T22:03:30Z

rerun tests

…her does

magnatelee · 2021-04-01T06:57:49Z

Just pushed a fix to recover the default behavior; the row hasher was originally doing 0 ⊕ hf(col0) ⊕ hf(col1) ⊕ ..., where operator ⊕ is hash_combine, and the new code was doing hf(col0) ⊕ hf(col1) ⊕ ..., which yields a different result because 0 is not the identity of ⊕. I confirmed the fix locally, so hopefully it will pass the CI.

Though I fixed the code to recover the original behavior, I don't think the failing tests were good ones; the tests are checking if hash_partition on a test input produces an output of a known shape, which is bad for two reasons: First, they would pass only for a specific implementation of the row hasher instantiated with Murmur3. If any of this changes in the future, the tests will start to fail again. Second, they are not really testing properties of hash_partition. I think what the tests should really be testing are that the produced partitions are disjoint and that rows with the same key always fall under the same partition even when they are in different dataframes.

harrism · 2021-04-01T08:43:10Z

Can you put this in an issue? Thanks!

magnatelee · 2021-04-01T18:07:31Z

Can you put this in an issue? Thanks!

Here is the issue: #7819

magnatelee added 2 commits March 30, 2021 14:30

Add a seed argument to cudf::hash_partition

74a8d14

Fix for the style

66b868f

magnatelee requested review from kkraus14 and gaohao95 March 30, 2021 23:03

magnatelee requested a review from a team as a code owner March 30, 2021 23:03

magnatelee requested review from cwharris and nvdbaranec March 30, 2021 23:03

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 30, 2021

harrism requested changes Mar 30, 2021

View reviewed changes

cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved

cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved

harrism added breaking Breaking change improvement Improvement / enhancement to an existing function 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond labels Mar 30, 2021

gaohao95 approved these changes Mar 30, 2021

View reviewed changes

nvdbaranec reviewed Mar 30, 2021

View reviewed changes

magnatelee added 2 commits March 30, 2021 16:55

Tets with a custom seed value

49a4df9

Change CUDA_HOST_DEVICE_CALLABLE to constexpr

b7020b0

magnatelee requested a review from harrism March 30, 2021 23:57

nvdbaranec requested changes Mar 31, 2021

View reviewed changes

harrism approved these changes Mar 31, 2021

View reviewed changes

magnatelee requested a review from nvdbaranec March 31, 2021 00:39

harrism previously requested changes Mar 31, 2021

View reviewed changes

cpp/include/cudf/partitioning.hpp Outdated Show resolved Hide resolved

nvdbaranec approved these changes Mar 31, 2021

View reviewed changes

Jake's comments

901f8c2

magnatelee requested a review from jrhemstad March 31, 2021 02:30

jrhemstad reviewed Mar 31, 2021

View reviewed changes

cpp/include/cudf/table/row_operators.cuh Outdated Show resolved Hide resolved

jrhemstad reviewed Mar 31, 2021

View reviewed changes

Jake's fix to get the hash seed applied correctly in row_hasher

429658c

magnatelee requested a review from jrhemstad March 31, 2021 18:07

jrhemstad approved these changes Mar 31, 2021

View reviewed changes

kkraus14 requested review from harrism and removed request for harrism, cwharris and kkraus14 March 31, 2021 18:35

jrhemstad added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond labels Mar 31, 2021

harrism approved these changes Mar 31, 2021

View reviewed changes

kkraus14 added 0 - Waiting on Author Waiting for author to respond to review and removed 5 - Ready to Merge Testing and reviews complete, ready to merge labels Apr 1, 2021

Make sure to combine hash values in the same way as the other row_has…

7113d6a

…her does

rapids-bot bot merged commit 299f6cc into rapidsai:branch-0.19 Apr 1, 2021

magnatelee mentioned this pull request Apr 1, 2021

[FEA] Improve cudf tests for hash_partition #7819

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow hash_partition to take a seed value #7771

Allow hash_partition to take a seed value #7771

magnatelee commented Mar 30, 2021

harrism left a comment

gaohao95 left a comment

nvdbaranec left a comment

magnatelee commented Mar 31, 2021

magnatelee commented Mar 31, 2021

nvdbaranec left a comment •

edited

Loading

magnatelee commented Mar 31, 2021

harrism commented Mar 31, 2021 •

edited

Loading

magnatelee commented Mar 31, 2021

codecov bot commented Mar 31, 2021 •

edited

Loading

jrhemstad left a comment •

edited

Loading

magnatelee commented Mar 31, 2021 •

edited

Loading

kkraus14 commented Mar 31, 2021

kkraus14 commented Mar 31, 2021

magnatelee commented Apr 1, 2021

harrism commented Apr 1, 2021

magnatelee commented Apr 1, 2021

Allow hash_partition to take a seed value #7771

Allow hash_partition to take a seed value #7771

Conversation

magnatelee commented Mar 30, 2021

harrism left a comment

Choose a reason for hiding this comment

gaohao95 left a comment

Choose a reason for hiding this comment

nvdbaranec left a comment

Choose a reason for hiding this comment

magnatelee commented Mar 31, 2021

magnatelee commented Mar 31, 2021

nvdbaranec left a comment • edited Loading

Choose a reason for hiding this comment

magnatelee commented Mar 31, 2021

harrism commented Mar 31, 2021 • edited Loading

magnatelee commented Mar 31, 2021

codecov bot commented Mar 31, 2021 • edited Loading

Codecov Report

jrhemstad left a comment • edited Loading

Choose a reason for hiding this comment

magnatelee commented Mar 31, 2021 • edited Loading

kkraus14 commented Mar 31, 2021

kkraus14 commented Mar 31, 2021

magnatelee commented Apr 1, 2021

harrism commented Apr 1, 2021

magnatelee commented Apr 1, 2021

nvdbaranec left a comment •

edited

Loading

harrism commented Mar 31, 2021 •

edited

Loading

codecov bot commented Mar 31, 2021 •

edited

Loading

jrhemstad left a comment •

edited

Loading

magnatelee commented Mar 31, 2021 •

edited

Loading