[BUG] hash join gather maps limited by size_type #8121

revans2 · 2021-04-30T11:33:27Z

Describe the bug
Hash join gather maps were implemented to get around issues when a join explodes. #7454

But there is still a limit in place on the maximum result size that can be returned by a join.

Lines 152 to 154 in cea6c20

    
           CUDF_EXPECTS(h_size_estimate < 
        
                          static_cast<estimate_size_type>(std::numeric_limits<cudf::size_type>::max()), 
        
                        "Maximum join output size exceeded");

Steps/Code to reproduce bug
I was trying to probe the limits of what the new Spark out of core join implementation could do, and I hit this when trying to join a lot by a lot. 10 billion rows on the left hand size and 10 million on the right hand side. But because of how the joins work in spark it was a shuffled hash join so there were 4 separate joins happening. The keys were evenly distributed and there were 100,000 of them, so you could probably divide all the numbers by 4 to get the error, but it should not be hard to do.

Expected behavior
We can produce a gather map that is larger than can fir in a column.

shwina · 2021-04-30T13:58:41Z

Good catch. Looks like we need to use size_t here instead of size_type. Fix coming up.

Closes #8121 Authors: - Ashwin Srinath (https://github.com/shwina) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Jake Hemstad (https://github.com/jrhemstad) - Nghia Truong (https://github.com/ttnghia) - Robert (Bobby) Evans (https://github.com/revans2) URL: #8139

revans2 added bug Something isn't working Needs Triage Need team to review and classify labels Apr 30, 2021

revans2 mentioned this issue Apr 30, 2021

Allow batching the output of a join NVIDIA/spark-rapids#2310

Merged

shwina mentioned this issue May 3, 2021

Enable join results with size > INT32_MAX #8139

Merged

harrism added libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS and removed Needs Triage Need team to review and classify labels May 4, 2021

harrism assigned shwina May 4, 2021

rapids-bot bot closed this as completed in #8139 May 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] hash join gather maps limited by size_type #8121

[BUG] hash join gather maps limited by size_type #8121

revans2 commented Apr 30, 2021

shwina commented Apr 30, 2021

[BUG] hash join gather maps limited by size_type #8121

[BUG] hash join gather maps limited by size_type #8121

Comments

revans2 commented Apr 30, 2021

shwina commented Apr 30, 2021