Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose conditional join size calculation #8928

Merged
merged 27 commits into from
Aug 13, 2021

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Aug 2, 2021

Resolves #8918 by providing a new API for getting the output size for conditional joins (except full joins). This PR removes the unnecessary conditional_join.cuh header and inlines the logic into the conditional_join.cu file where it is used, and adds the new logic into that file as well. The public APIs are now exposed in conditional_join.hpp.

Adding the join size calculation also revealed a couple of bugs in the conditional join tests that were hiding a real bug in the conditional join implementation. The main test bug was the use of std::equal with the actual result as the first iterator, so if the actual result was empty it was never compared against the expected result (even if it was nonempty). This bug masked a couple of minor errors in the expected outputs encoded in the test. These are now fixed. This bug was also hiding a deeper issue where the AST device code was always using the left row index to pull data for the left hand operand to binary operations, even when the lhs was actually a column from the right table. That bug is now fixed as well.

Contributes to #8145.

@vyasr vyasr added feature request New feature or request 3 - Ready for Review Ready for review by team code quality libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function labels Aug 2, 2021
@vyasr vyasr added this to the Conditional Joins milestone Aug 2, 2021
@vyasr vyasr requested a review from a team as a code owner August 2, 2021 21:16
@vyasr vyasr self-assigned this Aug 2, 2021
@vyasr vyasr requested a review from karthikeyann August 2, 2021 21:16
@vyasr vyasr added non-breaking Non-breaking change and removed improvement Improvement / enhancement to an existing function labels Aug 2, 2021
@vyasr
Copy link
Contributor Author

vyasr commented Aug 2, 2021

@jlowe it looks like the hash join APIs for the output size are all detail APIs, so I assume that the same is fine here but let me know if that's not the case and I can plumb this up higher in the hierarchy.

@jlowe
Copy link
Member

jlowe commented Aug 2, 2021

it looks like the hash join APIs for the output size are all detail APIs

The APIs are public, part of the hash_join class. We will be calling this from an application and providing bindings to it from Java, so it should not be hidden in a detail header.

@vyasr vyasr force-pushed the feature/conditional_join_size branch from 5890929 to 87701fe Compare August 2, 2021 21:38
@vyasr
Copy link
Contributor Author

vyasr commented Aug 2, 2021

it looks like the hash join APIs for the output size are all detail APIs

The APIs are public, part of the hash_join class. We will be calling this from an application and providing bindings to it from Java, so it should not be hidden in a detail header.

I forgot that the hash join class itself was public, and none of free functions in cuDF for joins had this parameter added so I was confused (such as inner_join vs hash_join::inner_join). This is fixed now.

@codecov
Copy link

codecov bot commented Aug 2, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@2b92220). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.10    #8928   +/-   ##
===============================================
  Coverage                ?   10.65%           
===============================================
  Files                   ?      114           
  Lines                   ?    19080           
  Branches                ?        0           
===============================================
  Hits                    ?     2033           
  Misses                  ?    17047           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2b92220...2e4e94a. Read the comment docs.

…tional_join_size

# Conflicts:
#	cpp/src/join/conditional_join.cuh
#	cpp/src/transform/compute_column.cu
@jlowe
Copy link
Member

jlowe commented Aug 9, 2021

The failing test is intended to test the cudf behavior when an invalid table reference is specified. It's asking for a RIGHT table column on an expression that only takes one table. In the past it would silently ignore table references and act like it was referencing the LEFT table, but that doesn't seem to be the case anymore. If the behavior is explicitly undefined for this case we can remove the test, but it would be interesting to know why the behavior changed.

@github-actions github-actions bot added the Java Affects Java cuDF API. label Aug 9, 2021
@vyasr
Copy link
Contributor Author

vyasr commented Aug 9, 2021

Discussed offline, I'm changing the behavior to always throw an exception in compute_column. This also necessitates updating the test, but the new (defined) behavior is preferable to the old UB.

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally +1. A couple of clarification questions, really.

This was an enjoyable, and informative read. I'd like to look more closely at the kernels themselves, later on.

@vyasr vyasr requested a review from a team as a code owner August 13, 2021 16:08
@vyasr vyasr requested a review from mythrocks August 13, 2021 16:35
Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@vyasr
Copy link
Contributor Author

vyasr commented Aug 13, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit fb29071 into rapidsai:branch-21.10 Aug 13, 2021
rapids-bot bot pushed a commit that referenced this pull request Aug 16, 2021
This adds Java bindings to the conditional join output size APIs added in #8928.  In preparation for adding Java APIs for hash-join output sizing, it also refactors the names of the conditional join APIs to avoid overloading by adding "conditional" to the method names.

Authors:
  - Jason Lowe (https://github.com/jlowe)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)

URL: #9002
@vyasr vyasr deleted the feature/conditional_join_size branch January 14, 2022 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Expose size estimation for conditional joins
4 participants