-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-34474: [C++] Detect and raise an error if a join will need too much key data #35087
GH-34474: [C++] Detect and raise an error if a join will need too much key data #35087
Conversation
…which is not supported by the RowArray. We return an invalid status. We cannot support that large of a join without spilling.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this LGTM and makes sense. I agree with the need for a specialized test suite for large joins. After CI passes, I'm fine with this getting merged.
Benchmark runs are scheduled for baseline = 6432a23 and contender = a1d1373. a1d1373 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
['Python', 'R'] benchmarks have high level of regressions. |
…h key data (#35087) ### Rationale for this change This fixes the test in #34474 though there are likely still other bad scenarios with large joins. I've fixed this one since the behavior (invalid data) is particularly bad. Most of the time if there is too much data I'm guessing we probably just crash. Still, I think a test suite of some kind stressing large joins would be good to have. Perhaps this could be added if someone finds time to work on join spilling. ### What changes are included in this PR? If the join will require more than 4GiB of key data it should now return an invalid status instead of invalid data. ### Are these changes tested? No. I created a unit test but it requires over 16GiB of RAM (Besides the input data itself (4GiB), by the time you get 4GiB of key data there are various other join state buffers that also grow. The test also took nearly a minute to run. I think investigation and creation of a test suite for large joins is probably a standalone effort. ### Are there any user-facing changes? No. * Closes: #34474 Authored-by: Weston Pace <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…oo much key data (apache#35087) ### Rationale for this change This fixes the test in apache#34474 though there are likely still other bad scenarios with large joins. I've fixed this one since the behavior (invalid data) is particularly bad. Most of the time if there is too much data I'm guessing we probably just crash. Still, I think a test suite of some kind stressing large joins would be good to have. Perhaps this could be added if someone finds time to work on join spilling. ### What changes are included in this PR? If the join will require more than 4GiB of key data it should now return an invalid status instead of invalid data. ### Are these changes tested? No. I created a unit test but it requires over 16GiB of RAM (Besides the input data itself (4GiB), by the time you get 4GiB of key data there are various other join state buffers that also grow. The test also took nearly a minute to run. I think investigation and creation of a test suite for large joins is probably a standalone effort. ### Are there any user-facing changes? No. * Closes: apache#34474 Authored-by: Weston Pace <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…oo much key data (apache#35087) ### Rationale for this change This fixes the test in apache#34474 though there are likely still other bad scenarios with large joins. I've fixed this one since the behavior (invalid data) is particularly bad. Most of the time if there is too much data I'm guessing we probably just crash. Still, I think a test suite of some kind stressing large joins would be good to have. Perhaps this could be added if someone finds time to work on join spilling. ### What changes are included in this PR? If the join will require more than 4GiB of key data it should now return an invalid status instead of invalid data. ### Are these changes tested? No. I created a unit test but it requires over 16GiB of RAM (Besides the input data itself (4GiB), by the time you get 4GiB of key data there are various other join state buffers that also grow. The test also took nearly a minute to run. I think investigation and creation of a test suite for large joins is probably a standalone effort. ### Are there any user-facing changes? No. * Closes: apache#34474 Authored-by: Weston Pace <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…oo much key data (apache#35087) ### Rationale for this change This fixes the test in apache#34474 though there are likely still other bad scenarios with large joins. I've fixed this one since the behavior (invalid data) is particularly bad. Most of the time if there is too much data I'm guessing we probably just crash. Still, I think a test suite of some kind stressing large joins would be good to have. Perhaps this could be added if someone finds time to work on join spilling. ### What changes are included in this PR? If the join will require more than 4GiB of key data it should now return an invalid status instead of invalid data. ### Are these changes tested? No. I created a unit test but it requires over 16GiB of RAM (Besides the input data itself (4GiB), by the time you get 4GiB of key data there are various other join state buffers that also grow. The test also took nearly a minute to run. I think investigation and creation of a test suite for large joins is probably a standalone effort. ### Are there any user-facing changes? No. * Closes: apache#34474 Authored-by: Weston Pace <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
[PR 35087](apache#35087) introduced an explicit fail in large joins with Acero when key data is larger than 4GB (solving the problem reported by [issue 34474](apache#34474)). However, I think (but I'm not sure) that this quick fix is too restrictive because the total size condition is applied to the total size of tables to be joined, rather than to the size of keys. As a consequence, Acero fails when trying to merge large tables, even when the size of key data is well below 4 GB. This PR modifies the source code so that the logical test only verifies whether the total size of _key variable_ is below 4 GB.
Rationale for this change
This fixes the test in #34474 though there are likely still other bad scenarios with large joins. I've fixed this one since the behavior (invalid data) is particularly bad. Most of the time if there is too much data I'm guessing we probably just crash. Still, I think a test suite of some kind stressing large joins would be good to have. Perhaps this could be added if someone finds time to work on join spilling.
What changes are included in this PR?
If the join will require more than 4GiB of key data it should now return an invalid status instead of invalid data.
Are these changes tested?
No. I created a unit test but it requires over 16GiB of RAM (Besides the input data itself (4GiB), by the time you get 4GiB of key data there are various other join state buffers that also grow. The test also took nearly a minute to run. I think investigation and creation of a test suite for large joins is probably a standalone effort.
Are there any user-facing changes?
No.