Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] hash join gather maps limited by size_type #8121

Closed
revans2 opened this issue Apr 30, 2021 · 1 comment · Fixed by #8139
Closed

[BUG] hash join gather maps limited by size_type #8121

revans2 opened this issue Apr 30, 2021 · 1 comment · Fixed by #8139
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Apr 30, 2021

Describe the bug
Hash join gather maps were implemented to get around issues when a join explodes. #7454

But there is still a limit in place on the maximum result size that can be returned by a join.

CUDF_EXPECTS(h_size_estimate <
static_cast<estimate_size_type>(std::numeric_limits<cudf::size_type>::max()),
"Maximum join output size exceeded");

Steps/Code to reproduce bug
I was trying to probe the limits of what the new Spark out of core join implementation could do, and I hit this when trying to join a lot by a lot. 10 billion rows on the left hand size and 10 million on the right hand side. But because of how the joins work in spark it was a shuffled hash join so there were 4 separate joins happening. The keys were evenly distributed and there were 100,000 of them, so you could probably divide all the numbers by 4 to get the error, but it should not be hard to do.

Expected behavior
We can produce a gather map that is larger than can fir in a column.

@revans2 revans2 added bug Something isn't working Needs Triage Need team to review and classify labels Apr 30, 2021
@shwina
Copy link
Contributor

shwina commented Apr 30, 2021

Good catch. Looks like we need to use size_t here instead of size_type. Fix coming up.

@harrism harrism added libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS and removed Needs Triage Need team to review and classify labels May 4, 2021
@rapids-bot rapids-bot bot closed this as completed in #8139 May 5, 2021
rapids-bot bot pushed a commit that referenced this issue May 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants