Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Struct hashing support for SerialMurmur3 and SparkMurmur3 #7714

Merged
merged 5 commits into from
Mar 31, 2021

Conversation

jlowe
Copy link
Member

@jlowe jlowe commented Mar 24, 2021

Adding struct column support for serial Murmur3 and Spark-compatible Murmur3 hashing. This explodes the struct column into the leaf columns before passing it to the existing hash support. The validity of the parent struct columns can be ignored because hashing a null ends up as a no-op that returns the hash seed, so only the leaf columns within the struct column need to be considered for the hash computation.

@jlowe jlowe added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Java Affects Java cuDF API. Spark Functionality that helps Spark RAPIDS 4 - Needs cuDF (Java) Reviewer improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 24, 2021
@jlowe jlowe self-assigned this Mar 24, 2021
@jlowe jlowe requested review from a team as code owners March 24, 2021 23:12
@codecov
Copy link

codecov bot commented Mar 25, 2021

Codecov Report

Merging #7714 (1fdcdf5) into branch-0.19 (7871e7a) will increase coverage by 0.84%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.19    #7714      +/-   ##
===============================================
+ Coverage        81.86%   82.70%   +0.84%     
===============================================
  Files              101      101              
  Lines            16884    17419     +535     
===============================================
+ Hits             13822    14407     +585     
+ Misses            3062     3012      -50     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/lists.py 87.68% <0.00%> (-3.72%) ⬇️
python/cudf/cudf/core/column/decimal.py 93.84% <0.00%> (-1.03%) ⬇️
python/cudf/cudf/core/column/column.py 87.53% <0.00%> (-0.23%) ⬇️
python/cudf/cudf/utils/utils.py 85.36% <0.00%> (-0.07%) ⬇️
python/cudf/cudf/io/feather.py 100.00% <0.00%> (ø)
python/cudf/cudf/utils/ioutils.py 78.71% <0.00%> (ø)
python/cudf/cudf/comm/serialize.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/io.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/struct.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/_version.py 0.00% <0.00%> (ø)
... and 44 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d49f75...1fdcdf5. Read the comment docs.

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the java side things look good to me.

Copy link
Contributor

@codereport codereport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple small comments.

cpp/tests/hashing/hash_test.cpp Show resolved Hide resolved
cpp/src/hash/hashing.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@codereport codereport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 👍

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but a couple of questions.

hash_value_type CUDA_DEVICE_CALLABLE
MurmurHash3_32<cudf::list_view>::operator()(cudf::list_view const& key) const
{
cudf_assert(false && "List column hashing is not supported");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this just be a static_assert? cudf_assert will only fail in a debug build, and only at runtime. This path should result in a compile-time error, shouldn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, static_assert would be ideal. I originally tried a static assert, but I believe the type dispatcher causes it to be referenced in the switch statement and fails the build. I think that's why cudf_assert is used in some other hash functions in this header file for unsupported types.

cudf_assert will only fail in a debug build

That's unfortunate. Is there a separate assertion utility that should be used here that will fire in a release build as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDF_FAIL?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is device code. CUDF_FAIL won't work.

That's unfortunate. Is there a separate assertion utility that should be used here that will fire in a release build as well?

release_assert used to do that, but then we found out it massively increased our binary size #7583

There's no good way to do error checking from device code like this that doesn't have a significant impact on performance. We just need to make sure to catch these kinds of things on the host beforehand.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a macro that simply asserts on the device in a release build without any lengthy string stuff such as function, file and line number? Not super useful for debugging the problem, but it would prevent the code from just running and returning some bogus result without an error in case the host-side checks aren't there or working properly. Does that still bloat the binary or harm performance in some way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that still bloat the binary or harm performance in some way?

Yeah, the mere presence of an assert, even unexecuted, impacts performance.

__global__ void benchmark_kernel(value_type* a, value_type* b, int n, bool do_asserts)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if(i >= n)
        return;
    if(do_asserts)
    {
        assert(b[i] <= static_cast<value_type>(n));
    }
    value_type x = tanh(a[i]*sinf(b[i]));
    x = cosf(x);
    if(do_asserts)
    {
        assert(x <= 1.f);
    }
    a[i] = x;
}

With do_asserts = false:

without -DNDEBUG:
repetitions:100 median: 3.02846ms min: 2.95219ms p20: 2.99827ms p80: 3.05152ms max: 5.3207ms

with -DNDEBUG:
repetitions:100 median: 2.816ms min: 2.7904ms p20: 2.80781ms p80: 2.82726ms max: 2.84467ms

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have debug builds getting tested under CI at least on a nightly/weekly basis in 0.20 which will help flag any code going down a path it shouldn't.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what's the action here @jlowe @jrhemstad ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, there's nothing more to do. The host code is already checking the types before attempting to dispatch, and these specializations using cudf_assert are only in place just in case somehow they get invoked errantly in practice. The performance hit from using an actual device assert in release code is too great, so cudf_assert is as close as we can get, IIUC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I thought Jake was saying just having a cudf_assert was expensive. Misunderstood.

auto child_columns = to_leaf_columns(col.child_begin(), col.child_end());
leaf_columns.insert(leaf_columns.end(), child_columns.begin(), child_columns.end());
} else {
leaf_columns.emplace_back(col);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I be worried that the else uses emplacement while the if uses insertion? I'm not sure how to understand this. Is col being moved while child_columns are being copied?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

col is being used as a copy-constructor argument and child_columns are being copied as they are inserted. These are all dealing with column_view which owns no resources, so there shouldn't be any move semantics here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I was hoping that somehow the two branches could be made more symmetric to avoid confusion. But I looked at how to do it and I couldn't find a way.

cpp/src/hash/hashing.cu Outdated Show resolved Hide resolved
@harrism harrism added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Mar 31, 2021
@harrism
Copy link
Member

harrism commented Mar 31, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 9970f1d into rapidsai:branch-0.19 Mar 31, 2021
@jlowe jlowe deleted the spark-hash-struct branch September 10, 2021 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants