Support selecting different hash functions in hash_partition #6726

gaohao95 · 2020-11-10T18:07:02Z

This PR intends to

Allow hash_partition to select a different hash function (e.g. identity hash function) in additional to MurmurHash3_32. (Close [FEA] Support selecting different hash functions in hash_partition #6307)
Remove redundant identical hash_partition implementation in src/hash/hashing.cu.

Restrictions:

MD5 is not supported.

GPUtester · 2020-11-10T18:07:04Z

Can one of the admins verify this patch?

GPUtester · 2020-11-10T18:07:04Z

Can one of the admins verify this patch?

nsakharnykh

Would be good to add a unit test with identity hash and unsupported data type, otherwise LGTM

jrhemstad

There is no test for fatal assertion when the hash function is not compatible with the datatype. For example, using identity hash function on string column. Need to wait until #6696 is merged.

We should not rely on release_assert for communicating errors to the user. release_assert is a last resort as it is an unrecoverable error that requires restarting the process.

cpp/src/partitioning/partitioning.cu

…elect-hash-partition

gaohao95 · 2020-12-01T18:26:18Z

Any updates on reviewing this PR?

rgsl888prabhu · 2020-12-01T18:34:46Z

@gaohao95 Can you please merge the conflicts, then we are ready to merge it.

…elect-hash-partition

gaohao95 · 2020-12-01T21:36:55Z

@gaohao95 Can you please merge the conflicts, then we are ready to merge it.

The conflicts should be addressed.

esoha-nvidia · 2020-12-01T22:24:56Z

cpp/src/partitioning/partitioning.cu

@@ -775,11 +777,25 @@ std::pair<std::unique_ptr<table>, std::vector<size_type>> hash_partition(
  table_view const& input,


Why not make this function templated? Then you can remove the runtime failures and switch statement below, right? That would let errors show up at compile-time instead of run-time, which is convenient for developers.

So for the caller, it would be:

hash_partition<IndentityHash>(input, columns_to_hash, num_paritions, steam, mr);

instead of:

hash_partition(input, columns_to_hash, num_paritions, hash_id::HASH_IDENTITY, steam, mr);

Also, you have written your check for data type twice, once here and once above with the if_enable_t. That is repeating yourself. Instead of doing the test twice, with this change, now you will only need to do it once. Also this function will remain the same length instead of growing by 12 lines.

It cannot be a template because most libcudf users are dynamic/interpreted languages like Python/Spark where the relevant information isn't known until runtime. Making the hash function a template parameter would just force the caller to do the switch statement.

esoha-nvidia · 2020-12-01T22:26:48Z

cpp/include/cudf/detail/utilities/hash_functions.cuh

+  template <typename return_type = result_type>
+  CUDA_HOST_DEVICE_CALLABLE std::enable_if_t<!std::is_arithmetic<Key>::value, return_type>
+  operator()(const Key& key) const
+  {
+    release_assert(false && "IdentityHash does not support this data type");
+    return 0;
+  }


Why is this necessary? I think that this code turns a compile time error into a runtime error, which is not good for developers because it will cause them to find their coding errors later.

If you remove this, does the code still compile? If so then this code is never generating anything so it can be removed.

This function object is invoked via the type_dispatcher which will instantiate it for all possible libcudf types. We need to provide a valid instantiation for all types. This includes types that should never actually be invoked (as seen above).

Okay, so if these lines of code are removed then cudf will no longer compile successfully?

For this case yes. static_cast a string to an integer should fail at compile time.

gaohao95 · 2020-12-02T21:46:02Z

@jrhemstad Do you have any other suggestions for this PR? Can you approve it?

raydouglass · 2020-12-02T21:51:43Z

ok to test

GPUtester · 2020-12-02T21:52:14Z

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

kkraus14 · 2020-12-02T21:52:42Z

add to allowlist

…elect-hash-partition

Jake on vacation and review addressed.

cpp/include/cudf/partitioning.hpp

harrism · 2020-12-03T01:22:40Z

@gaohao95 please merge the latest from branch-0.17 and then add the missing include in partitioning.hpp.

…elect-hash-partition

cpp/include/cudf/partitioning.hpp

codecov · 2020-12-03T05:30:07Z

Codecov Report

Merging #6726 (bac6457) into branch-0.17 (a2d2726) will not change coverage.
The diff coverage is n/a.

@@             Coverage Diff              @@
##           branch-0.17    #6726   +/-   ##
============================================
  Coverage        81.94%   81.94%           
============================================
  Files               96       96           
  Lines            16166    16166           
============================================
  Hits             13247    13247           
  Misses            2919     2919

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a2d2726...bac6457. Read the comment docs.

This PR is to allow hash partitioning to configure the seed of its hash function. As noted in #6307, using the same hash function in hash partitioning and join leads to a massive hash collision and severely degrades join performance on multiple GPUs. There was an initial fix (#6726) to this problem, but it added only the code path to use identity hash function in hash partitioning, which doesn't support complex data types and thus cannot be used in general. In fact, using the same general Murmur3 hash function with different seeds in hash partitioning and join turned out to be a sufficient fix. This PR is to enable such configurations by making `hash_partition` accept an optional seed value. Authors: - Wonchan Lee (https://github.com/magnatelee) Approvers: - https://github.com/gaohao95 - Mark Harris (https://github.com/harrism) - https://github.com/nvdbaranec - Jake Hemstad (https://github.com/jrhemstad) URL: #7771

Select hash functions in hash_partition

6bb99f0

gaohao95 requested a review from a team as a code owner November 10, 2020 18:07

gaohao95 requested review from harrism and rgsl888prabhu November 10, 2020 18:07

nsakharnykh approved these changes Nov 10, 2020

View reviewed changes

jrhemstad previously requested changes Nov 10, 2020

View reviewed changes

cpp/src/partitioning/partitioning.cu Outdated Show resolved Hide resolved

harrism approved these changes Nov 11, 2020

View reviewed changes

gaohao95 added 2 commits November 11, 2020 07:19

Merge branch 'branch-0.17' of https://github.com/rapidsai/cudf into s…

e5b380a

…elect-hash-partition

Throw hash partition error on the host

f9fcc29

gaohao95 changed the title ~~[WIP] Support selecting different hash functions in hash_partition~~ [REVIEW] Support selecting different hash functions in hash_partition Nov 11, 2020

rgsl888prabhu approved these changes Dec 1, 2020

View reviewed changes

rgsl888prabhu changed the title ~~[REVIEW] Support selecting different hash functions in hash_partition~~ Support selecting different hash functions in hash_partition Dec 1, 2020

rgsl888prabhu added 5 - Ready to Merge Testing and reviews complete, ready to merge libcudf Affects libcudf (C++/CUDA) code. labels Dec 1, 2020

rgsl888prabhu added non-breaking Non-breaking change feature request New feature or request labels Dec 1, 2020

Merge branch 'branch-0.17' of https://github.com/rapidsai/cudf into s…

461049c

…elect-hash-partition

esoha-nvidia reviewed Dec 1, 2020

View reviewed changes

gaohao95 added 2 commits December 2, 2020 14:01

Merge branch 'branch-0.17' of https://github.com/rapidsai/cudf into s…

89bcabc

…elect-hash-partition

CHANGELOG

a7acd21

harrism reviewed Dec 3, 2020

View reviewed changes

cpp/include/cudf/partitioning.hpp Show resolved Hide resolved

gaohao95 added 2 commits December 2, 2020 17:24

Merge branch 'branch-0.17' of https://github.com/rapidsai/cudf into s…

effa95d

…elect-hash-partition

Add missing header

a2571ba

harrism requested changes Dec 3, 2020

View reviewed changes

cpp/include/cudf/partitioning.hpp Outdated Show resolved Hide resolved

Group headers by library

bac6457

harrism approved these changes Dec 3, 2020

View reviewed changes

harrism added the 6 - Okay to Auto-Merge label Dec 3, 2020

rapids-bot bot merged commit f137ed1 into rapidsai:branch-0.17 Dec 3, 2020

gaohao95 mentioned this pull request Dec 7, 2020

Update Dockerfile rapidsai/distributed-join#48

Merged

gaohao95 deleted the select-hash-partition branch February 17, 2021 23:58

magnatelee mentioned this pull request Mar 30, 2021

Allow hash_partition to take a seed value #7771

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support selecting different hash functions in hash_partition #6726

Support selecting different hash functions in hash_partition #6726

gaohao95 commented Nov 10, 2020 •

edited by harrism

Loading

GPUtester commented Nov 10, 2020

GPUtester commented Nov 10, 2020

nsakharnykh left a comment

jrhemstad left a comment

gaohao95 commented Dec 1, 2020

rgsl888prabhu commented Dec 1, 2020

gaohao95 commented Dec 1, 2020

esoha-nvidia Dec 1, 2020

jrhemstad Dec 1, 2020

esoha-nvidia Dec 2, 2020

esoha-nvidia Dec 1, 2020

jrhemstad Dec 1, 2020

esoha-nvidia Dec 2, 2020

gaohao95 Dec 2, 2020 •

edited

Loading

gaohao95 commented Dec 2, 2020

raydouglass commented Dec 2, 2020

GPUtester commented Dec 2, 2020

kkraus14 commented Dec 2, 2020

harrism commented Dec 3, 2020

codecov bot commented Dec 3, 2020 •

edited

Loading

		@@ -775,11 +777,25 @@ std::pair<std::unique_ptr<table>, std::vector<size_type>> hash_partition(
		table_view const& input,

Support selecting different hash functions in hash_partition #6726

Support selecting different hash functions in hash_partition #6726

Conversation

gaohao95 commented Nov 10, 2020 • edited by harrism Loading

GPUtester commented Nov 10, 2020

GPUtester commented Nov 10, 2020

nsakharnykh left a comment

Choose a reason for hiding this comment

jrhemstad left a comment

Choose a reason for hiding this comment

gaohao95 commented Dec 1, 2020

rgsl888prabhu commented Dec 1, 2020

gaohao95 commented Dec 1, 2020

esoha-nvidia Dec 1, 2020

Choose a reason for hiding this comment

jrhemstad Dec 1, 2020

Choose a reason for hiding this comment

esoha-nvidia Dec 2, 2020

Choose a reason for hiding this comment

esoha-nvidia Dec 1, 2020

Choose a reason for hiding this comment

jrhemstad Dec 1, 2020

Choose a reason for hiding this comment

esoha-nvidia Dec 2, 2020

Choose a reason for hiding this comment

gaohao95 Dec 2, 2020 • edited Loading

Choose a reason for hiding this comment

gaohao95 commented Dec 2, 2020

raydouglass commented Dec 2, 2020

GPUtester commented Dec 2, 2020

kkraus14 commented Dec 2, 2020

harrism commented Dec 3, 2020

codecov bot commented Dec 3, 2020 • edited Loading

Codecov Report

gaohao95 commented Nov 10, 2020 •

edited by harrism

Loading

gaohao95 Dec 2, 2020 •

edited

Loading

codecov bot commented Dec 3, 2020 •

edited

Loading