Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support selecting different hash functions in hash_partition #6726

Merged
merged 9 commits into from
Dec 3, 2020

Conversation

gaohao95
Copy link
Contributor

@gaohao95 gaohao95 commented Nov 10, 2020

This PR intends to

Restrictions:

  • MD5 is not supported.

@gaohao95 gaohao95 requested a review from a team as a code owner November 10, 2020 18:07
@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

1 similar comment
@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

Copy link
Collaborator

@nsakharnykh nsakharnykh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to add a unit test with identity hash and unsupported data type, otherwise LGTM

Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no test for fatal assertion when the hash function is not compatible with the datatype. For example, using identity hash function on string column. Need to wait until #6696 is merged.

We should not rely on release_assert for communicating errors to the user. release_assert is a last resort as it is an unrecoverable error that requires restarting the process.

cpp/src/partitioning/partitioning.cu Outdated Show resolved Hide resolved
@gaohao95 gaohao95 changed the title [WIP] Support selecting different hash functions in hash_partition [REVIEW] Support selecting different hash functions in hash_partition Nov 11, 2020
@gaohao95
Copy link
Contributor Author

gaohao95 commented Dec 1, 2020

Any updates on reviewing this PR?

@rgsl888prabhu rgsl888prabhu changed the title [REVIEW] Support selecting different hash functions in hash_partition Support selecting different hash functions in hash_partition Dec 1, 2020
@rgsl888prabhu
Copy link
Contributor

@gaohao95 Can you please merge the conflicts, then we are ready to merge it.

@rgsl888prabhu rgsl888prabhu added 5 - Ready to Merge Testing and reviews complete, ready to merge libcudf Affects libcudf (C++/CUDA) code. labels Dec 1, 2020
@rgsl888prabhu rgsl888prabhu added non-breaking Non-breaking change feature request New feature or request labels Dec 1, 2020
@gaohao95
Copy link
Contributor Author

gaohao95 commented Dec 1, 2020

@gaohao95 Can you please merge the conflicts, then we are ready to merge it.

The conflicts should be addressed.

@@ -775,11 +777,25 @@ std::pair<std::unique_ptr<table>, std::vector<size_type>> hash_partition(
table_view const& input,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make this function templated? Then you can remove the runtime failures and switch statement below, right? That would let errors show up at compile-time instead of run-time, which is convenient for developers.

So for the caller, it would be:

hash_partition<IndentityHash>(input, columns_to_hash, num_paritions, steam, mr);

instead of:

hash_partition(input, columns_to_hash, num_paritions, hash_id::HASH_IDENTITY, steam, mr);

Also, you have written your check for data type twice, once here and once above with the if_enable_t. That is repeating yourself. Instead of doing the test twice, with this change, now you will only need to do it once. Also this function will remain the same length instead of growing by 12 lines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It cannot be a template because most libcudf users are dynamic/interpreted languages like Python/Spark where the relevant information isn't known until runtime. Making the hash function a template parameter would just force the caller to do the switch statement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay!

Comment on lines +582 to +588
template <typename return_type = result_type>
CUDA_HOST_DEVICE_CALLABLE std::enable_if_t<!std::is_arithmetic<Key>::value, return_type>
operator()(const Key& key) const
{
release_assert(false && "IdentityHash does not support this data type");
return 0;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary? I think that this code turns a compile time error into a runtime error, which is not good for developers because it will cause them to find their coding errors later.

If you remove this, does the code still compile? If so then this code is never generating anything so it can be removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function object is invoked via the type_dispatcher which will instantiate it for all possible libcudf types. We need to provide a valid instantiation for all types. This includes types that should never actually be invoked (as seen above).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so if these lines of code are removed then cudf will no longer compile successfully?

Copy link
Contributor Author

@gaohao95 gaohao95 Dec 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this case yes. static_cast a string to an integer should fail at compile time.

@gaohao95
Copy link
Contributor Author

gaohao95 commented Dec 2, 2020

@jrhemstad Do you have any other suggestions for this PR? Can you approve it?

@raydouglass
Copy link
Member

ok to test

@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@kkraus14
Copy link
Collaborator

kkraus14 commented Dec 2, 2020

add to allowlist

@harrism harrism dismissed jrhemstad’s stale review December 3, 2020 01:19

Jake on vacation and review addressed.

@harrism
Copy link
Member

harrism commented Dec 3, 2020

@gaohao95 please merge the latest from branch-0.17 and then add the missing include in partitioning.hpp.

cpp/include/cudf/partitioning.hpp Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Dec 3, 2020

Codecov Report

Merging #6726 (bac6457) into branch-0.17 (a2d2726) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff              @@
##           branch-0.17    #6726   +/-   ##
============================================
  Coverage        81.94%   81.94%           
============================================
  Files               96       96           
  Lines            16166    16166           
============================================
  Hits             13247    13247           
  Misses            2919     2919           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a2d2726...bac6457. Read the comment docs.

@rapids-bot rapids-bot bot merged commit f137ed1 into rapidsai:branch-0.17 Dec 3, 2020
@gaohao95 gaohao95 deleted the select-hash-partition branch February 17, 2021 23:58
rapids-bot bot pushed a commit that referenced this pull request Apr 1, 2021
This PR is to allow hash partitioning to configure the seed of its hash function. As noted in #6307, using the same hash function in hash partitioning and join leads to a massive hash collision and severely degrades join performance on multiple GPUs. There was an initial fix (#6726) to this problem, but it added only the code path to use identity hash function in hash partitioning, which doesn't support complex data types and thus cannot be used in general. In fact, using the same general Murmur3 hash function with different seeds in hash partitioning and join turned out to be a sufficient fix. This PR is to enable such configurations by making `hash_partition` accept an optional seed value.

Authors:
  - Wonchan Lee (https://github.com/magnatelee)

Approvers:
  - https://github.com/gaohao95
  - Mark Harris (https://github.com/harrism)
  - https://github.com/nvdbaranec
  - Jake Hemstad (https://github.com/jrhemstad)

URL: #7771
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support selecting different hash functions in hash_partition
9 participants