GH-34474: [C++] Detect and raise an error if a join will need too much key data #35087

westonpace · 2023-04-12T22:00:59Z

Rationale for this change

This fixes the test in #34474 though there are likely still other bad scenarios with large joins. I've fixed this one since the behavior (invalid data) is particularly bad. Most of the time if there is too much data I'm guessing we probably just crash. Still, I think a test suite of some kind stressing large joins would be good to have. Perhaps this could be added if someone finds time to work on join spilling.

What changes are included in this PR?

If the join will require more than 4GiB of key data it should now return an invalid status instead of invalid data.

Are these changes tested?

No. I created a unit test but it requires over 16GiB of RAM (Besides the input data itself (4GiB), by the time you get 4GiB of key data there are various other join state buffers that also grow. The test also took nearly a minute to run. I think investigation and creation of a test suite for large joins is probably a standalone effort.

Are there any user-facing changes?

No.

Closes: [Python] Table.join() produces incorrect results for large inputs #34474

…which is not supported by the RowArray. We return an invalid status. We cannot support that large of a join without spilling.

github-actions · 2023-04-12T22:01:22Z

Closes: [Python] Table.join() produces incorrect results for large inputs #34474

github-actions · 2023-04-12T22:01:25Z

⚠️ GitHub issue #34474 has been automatically assigned in GitHub to PR creator.

zeroshade

this LGTM and makes sense. I agree with the need for a specialized test suite for large joins. After CI passes, I'm fine with this getting merged.

ursabot · 2023-04-16T00:42:46Z

Benchmark runs are scheduled for baseline = 6432a23 and contender = a1d1373. a1d1373 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Finished ⬇️7.14% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.77% ⬆️0.03%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] a1d1373c ec2-t3-xlarge-us-east-2
[Failed] a1d1373c test-mac-arm
[Finished] a1d1373c ursa-i9-9960x
[Finished] a1d1373c ursa-thinkcentre-m75q
[Finished] 6432a238 ec2-t3-xlarge-us-east-2
[Failed] 6432a238 test-mac-arm
[Finished] 6432a238 ursa-i9-9960x
[Finished] 6432a238 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2023-04-16T00:44:31Z

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

…h key data (#35087) ### Rationale for this change This fixes the test in #34474 though there are likely still other bad scenarios with large joins. I've fixed this one since the behavior (invalid data) is particularly bad. Most of the time if there is too much data I'm guessing we probably just crash. Still, I think a test suite of some kind stressing large joins would be good to have. Perhaps this could be added if someone finds time to work on join spilling. ### What changes are included in this PR? If the join will require more than 4GiB of key data it should now return an invalid status instead of invalid data. ### Are these changes tested? No. I created a unit test but it requires over 16GiB of RAM (Besides the input data itself (4GiB), by the time you get 4GiB of key data there are various other join state buffers that also grow. The test also took nearly a minute to run. I think investigation and creation of a test suite for large joins is probably a standalone effort. ### Are there any user-facing changes? No. * Closes: #34474 Authored-by: Weston Pace <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>

…oo much key data (apache#35087) ### Rationale for this change This fixes the test in apache#34474 though there are likely still other bad scenarios with large joins. I've fixed this one since the behavior (invalid data) is particularly bad. Most of the time if there is too much data I'm guessing we probably just crash. Still, I think a test suite of some kind stressing large joins would be good to have. Perhaps this could be added if someone finds time to work on join spilling. ### What changes are included in this PR? If the join will require more than 4GiB of key data it should now return an invalid status instead of invalid data. ### Are these changes tested? No. I created a unit test but it requires over 16GiB of RAM (Besides the input data itself (4GiB), by the time you get 4GiB of key data there are various other join state buffers that also grow. The test also took nearly a minute to run. I think investigation and creation of a test suite for large joins is probably a standalone effort. ### Are there any user-facing changes? No. * Closes: apache#34474 Authored-by: Weston Pace <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>

[PR 35087](apache#35087) introduced an explicit fail in large joins with Acero when key data is larger than 4GB (solving the problem reported by [issue 34474](apache#34474)). However, I think (but I'm not sure) that this quick fix is too restrictive because the total size condition is applied to the total size of tables to be joined, rather than to the size of keys. As a consequence, Acero fails when trying to merge large tables, even when the size of key data is well below 4 GB. This PR modifies the source code so that the logical test only verifies whether the total size of _key variable_ is below 4 GB.

We now detect if the join would result in more than 4GiB of key data …

e88dede

…which is not supported by the RowArray. We return an invalid status. We cannot support that large of a join without spilling.

github-actions bot added Component: C++ awaiting committer review Awaiting committer review labels Apr 12, 2023

zeroshade approved these changes Apr 12, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Apr 12, 2023

jorisvandenbossche approved these changes Apr 13, 2023

View reviewed changes

jorisvandenbossche merged commit a1d1373 into apache:main Apr 13, 2023

oliviermeslin mentioned this pull request Sep 11, 2023

[C++] Acero cannot join large tables because of a misspecified test #37655

Closed

This was referenced Sep 13, 2023

Solve issue #37655 oliviermeslin/arrow#1

Open

GH-37655: [C++] Allow joins of large tables in Acero #37709

Closed

zanmato1984 mentioned this pull request Jul 31, 2024

[C++][Compute] Consider widening the row offset of the row table to 64-bit #43495

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-34474: [C++] Detect and raise an error if a join will need too much key data #35087

GH-34474: [C++] Detect and raise an error if a join will need too much key data #35087

westonpace commented Apr 12, 2023 •

edited by github-actions bot

Loading

github-actions bot commented Apr 12, 2023

github-actions bot commented Apr 12, 2023

zeroshade left a comment

ursabot commented Apr 16, 2023

ursabot commented Apr 16, 2023

GH-34474: [C++] Detect and raise an error if a join will need too much key data #35087

GH-34474: [C++] Detect and raise an error if a join will need too much key data #35087

Conversation

westonpace commented Apr 12, 2023 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Apr 12, 2023

github-actions bot commented Apr 12, 2023

zeroshade left a comment

Choose a reason for hiding this comment

ursabot commented Apr 16, 2023

ursabot commented Apr 16, 2023

westonpace commented Apr 12, 2023 •

edited by github-actions bot

Loading