You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Steps/Code to reproduce bug
I was trying to probe the limits of what the new Spark out of core join implementation could do, and I hit this when trying to join a lot by a lot. 10 billion rows on the left hand size and 10 million on the right hand side. But because of how the joins work in spark it was a shuffled hash join so there were 4 separate joins happening. The keys were evenly distributed and there were 100,000 of them, so you could probably divide all the numbers by 4 to get the error, but it should not be hard to do.
Expected behavior
We can produce a gather map that is larger than can fir in a column.
The text was updated successfully, but these errors were encountered:
Describe the bug
Hash join gather maps were implemented to get around issues when a join explodes. #7454
But there is still a limit in place on the maximum result size that can be returned by a join.
cudf/cpp/src/join/hash_join.cuh
Lines 152 to 154 in cea6c20
Steps/Code to reproduce bug
I was trying to probe the limits of what the new Spark out of core join implementation could do, and I hit this when trying to join a lot by a lot. 10 billion rows on the left hand size and 10 million on the right hand side. But because of how the joins work in spark it was a shuffled hash join so there were 4 separate joins happening. The keys were evenly distributed and there were 100,000 of them, so you could probably divide all the numbers by 4 to get the error, but it should not be hard to do.
Expected behavior
We can produce a gather map that is larger than can fir in a column.
The text was updated successfully, but these errors were encountered: