Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow hash_partition to take a seed value #7771

Merged
merged 8 commits into from
Apr 1, 2021

Conversation

magnatelee
Copy link
Contributor

This PR is to allow hash partitioning to configure the seed of its hash function. As noted in #6307, using the same hash function in hash partitioning and join leads to a massive hash collision and severely degrades join performance on multiple GPUs. There was an initial fix (#6726) to this problem, but it added only the code path to use identity hash function in hash partitioning, which doesn't support complex data types and thus cannot be used in general. In fact, using the same general Murmur3 hash function with different seeds in hash partitioning and join turned out to be a sufficient fix. This PR is to enable such configurations by making hash_partition accept an optional seed value.

@magnatelee magnatelee requested a review from a team as a code owner March 30, 2021 23:03
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 30, 2021
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but this behavior needs testing. Please add a gtest that exercises the seed parameter.

cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved
@harrism harrism added breaking Breaking change improvement Improvement / enhancement to an existing function 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond labels Mar 30, 2021
Copy link
Contributor

@gaohao95 gaohao95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me, but I wonder whether we should document somewhere that the seed in IdentityHash and MD5Hash is not used.

Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems ok to me. Is there any concern here of old calls to this accidentally getting the seed and stream parameters silently crossed?

@magnatelee magnatelee requested a review from harrism March 30, 2021 23:57
@magnatelee
Copy link
Contributor Author

Seems ok to me. Is there any concern here of old calls to this accidentally getting the seed and stream parameters silently crossed?

I don't think there is. The compiler should raise a compiler error in such cases. (just tried it locally and got invalid conversion errors.)

@magnatelee
Copy link
Contributor Author

The code looks good to me, but I wonder whether we should document somewhere that the seed in IdentityHash and MD5Hash is not used.

I think we shouldn't nail that down to the documentation, because we can later change them to use the seed value.

Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this can happen, right?

// user intended default stream, but is now getting a seed of 0
hash_partition(t, columns_to_hash, num_partitions, hash_function, 0);

Although maybe there aren't many cases where we're passing 0 manually.

@magnatelee
Copy link
Contributor Author

I feel like this can happen, right?

// user intended default stream, but is now getting a seed of 0
hash_partition(t, columns_to_hash, num_partitions, hash_function, 0);

Although maybe there aren't many cases where we're passing 0 manually.

That's a possibility, but like you said, the user wouldn't override the default argument just to pass the default stream. Doing so would accidentally do what he wanted, for a different reason (i.e., because the default value to the stream argument is 0). I feel a custom stream being silently crossed is what we should really worry about, and such a case would be rejected by the compiler.

@harrism
Copy link
Member

harrism commented Mar 31, 2021

That's a possibility, but like you said, the user wouldn't override the default argument just to pass the default stream.

Hmmm. Streams are strongly typed in libcudf -- zero is not a valid argument for the stream parameter, it won't compile. But zero will work for this new seed parameter.

cpp/include/cudf/partitioning.hpp Outdated Show resolved Hide resolved
@magnatelee
Copy link
Contributor Author

That's a possibility, but like you said, the user wouldn't override the default argument just to pass the default stream.

Hmmm. Streams are strongly typed in libcudf -- zero is not a valid argument for the stream parameter, it won't compile. But zero will work for this new seed parameter.

I see. then there is no case that the stream argument would be silently crossed with the seed.

@codecov
Copy link

codecov bot commented Mar 31, 2021

Codecov Report

Merging #7771 (56d1552) into branch-0.19 (7871e7a) will increase coverage by 0.81%.
The diff coverage is n/a.

❗ Current head 56d1552 differs from pull request most recent head 7113d6a. Consider uploading reports for the commit 7113d6a to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.19    #7771      +/-   ##
===============================================
+ Coverage        81.86%   82.68%   +0.81%     
===============================================
  Files              101      103       +2     
  Lines            16884    17566     +682     
===============================================
+ Hits             13822    14524     +702     
+ Misses            3062     3042      -20     
Impacted Files Coverage Δ
python/cudf/cudf/utils/dtypes.py 83.44% <0.00%> (-6.08%) ⬇️
python/cudf/cudf/core/column/lists.py 87.32% <0.00%> (-4.08%) ⬇️
python/cudf/cudf/core/column/decimal.py 92.68% <0.00%> (-2.19%) ⬇️
python/dask_cudf/dask_cudf/backends.py 87.50% <0.00%> (-2.13%) ⬇️
python/cudf/cudf/core/groupby/groupby.py 92.77% <0.00%> (-0.68%) ⬇️
python/cudf/cudf/core/column/column.py 87.61% <0.00%> (-0.15%) ⬇️
python/cudf/cudf/utils/utils.py 85.36% <0.00%> (-0.07%) ⬇️
python/cudf/cudf/io/feather.py 100.00% <0.00%> (ø)
python/cudf/cudf/utils/ioutils.py 78.71% <0.00%> (ø)
python/cudf/cudf/comm/serialize.py 0.00% <0.00%> (ø)
... and 46 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ad9212b...7113d6a. Read the comment docs.

Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On further review, I think this is actually what we want. The seed should be used as the seed for the first hash, and not just combined with the first hash:

   // Hash the first column w/ the seed
   auto const initial_hash = type_dispatcher(_table.column(0).type(), element_hasher_with_seed<hash_function, has_nulls>{_seed}, _table.column(0), row_index);

    auto hasher = [=](size_type column_index) {
      return cudf::type_dispatcher(_table.column(column_index).type(),
                                   element_hasher<hash_function, has_nulls>{},
                                   _table.column(column_index),
                                   row_index);
    };

    // Hash each element and combine all the hash values together
    return thrust::transform_reduce(thrust::seq,
                                    thrust::make_counting_iterator(1), // note that this starts at 1 and not 0 now since we already hashed the first column 
                                    thrust::make_counting_iterator(_table.num_columns()),
                                    hasher,
                                    initial_hash,
                                    hash_combiner);

@magnatelee
Copy link
Contributor Author

magnatelee commented Mar 31, 2021

On further review, I think this is actually what we want. The seed should be used as the seed for the first hash, and not just combined with the first hash:
...

I think you're right. I just pushed the suggested change. Please take a look.

@kkraus14 kkraus14 requested review from harrism and removed request for harrism, cwharris and kkraus14 March 31, 2021 18:35
@jrhemstad jrhemstad added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond labels Mar 31, 2021
@kkraus14
Copy link
Collaborator

@gpucibot merge

@kkraus14
Copy link
Collaborator

rerun tests

@kkraus14 kkraus14 added 0 - Waiting on Author Waiting for author to respond to review and removed 5 - Ready to Merge Testing and reviews complete, ready to merge labels Apr 1, 2021
@magnatelee
Copy link
Contributor Author

Just pushed a fix to recover the default behavior; the row hasher was originally doing 0 ⊕ hf(col0) ⊕ hf(col1) ⊕ ..., where operator ⊕ is hash_combine, and the new code was doing hf(col0) ⊕ hf(col1) ⊕ ..., which yields a different result because 0 is not the identity of ⊕. I confirmed the fix locally, so hopefully it will pass the CI.

Though I fixed the code to recover the original behavior, I don't think the failing tests were good ones; the tests are checking if hash_partition on a test input produces an output of a known shape, which is bad for two reasons: First, they would pass only for a specific implementation of the row hasher instantiated with Murmur3. If any of this changes in the future, the tests will start to fail again. Second, they are not really testing properties of hash_partition. I think what the tests should really be testing are that the produced partitions are disjoint and that rows with the same key always fall under the same partition even when they are in different dataframes.

@harrism
Copy link
Member

harrism commented Apr 1, 2021

Can you put this in an issue? Thanks!

@rapids-bot rapids-bot bot merged commit 299f6cc into rapidsai:branch-0.19 Apr 1, 2021
@magnatelee
Copy link
Contributor Author

Can you put this in an issue? Thanks!

Here is the issue: #7819

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Waiting on Author Waiting for author to respond to review breaking Breaking change improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants