Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34474: [C++] Detect and raise an error if a join will need too much key data #35087

Conversation

westonpace
Copy link
Member

@westonpace westonpace commented Apr 12, 2023

Rationale for this change

This fixes the test in #34474 though there are likely still other bad scenarios with large joins. I've fixed this one since the behavior (invalid data) is particularly bad. Most of the time if there is too much data I'm guessing we probably just crash. Still, I think a test suite of some kind stressing large joins would be good to have. Perhaps this could be added if someone finds time to work on join spilling.

What changes are included in this PR?

If the join will require more than 4GiB of key data it should now return an invalid status instead of invalid data.

Are these changes tested?

No. I created a unit test but it requires over 16GiB of RAM (Besides the input data itself (4GiB), by the time you get 4GiB of key data there are various other join state buffers that also grow. The test also took nearly a minute to run. I think investigation and creation of a test suite for large joins is probably a standalone effort.

Are there any user-facing changes?

No.

…which is not supported by the RowArray. We return an invalid status. We cannot support that large of a join without spilling.
@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #34474 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this LGTM and makes sense. I agree with the need for a specialized test suite for large joins. After CI passes, I'm fine with this getting merged.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Apr 12, 2023
@jorisvandenbossche jorisvandenbossche merged commit a1d1373 into apache:main Apr 13, 2023
@ursabot
Copy link

ursabot commented Apr 16, 2023

Benchmark runs are scheduled for baseline = 6432a23 and contender = a1d1373. a1d1373 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Finished ⬇️7.14% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.77% ⬆️0.03%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] a1d1373c ec2-t3-xlarge-us-east-2
[Failed] a1d1373c test-mac-arm
[Finished] a1d1373c ursa-i9-9960x
[Finished] a1d1373c ursa-thinkcentre-m75q
[Finished] 6432a238 ec2-t3-xlarge-us-east-2
[Failed] 6432a238 test-mac-arm
[Finished] 6432a238 ursa-i9-9960x
[Finished] 6432a238 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Apr 16, 2023

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

raulcd pushed a commit that referenced this pull request Apr 17, 2023
…h key data (#35087)

### Rationale for this change

This fixes the test in #34474 though there are likely still other bad scenarios with large joins.  I've fixed this one since the behavior (invalid data) is particularly bad.  Most of the time if there is too much data I'm guessing we probably just crash.  Still, I think a test suite of some kind stressing large joins would be good to have.  Perhaps this could be added if someone finds time to work on join spilling.

### What changes are included in this PR?

If the join will require more than 4GiB of key data it should now return an invalid status instead of invalid data.

### Are these changes tested?

No.  I created a unit test but it requires over 16GiB of RAM (Besides the input data itself (4GiB), by the time you get 4GiB of key data there are various other join state buffers that also grow.  The test also took nearly a minute to run.  I think investigation and creation of a test suite for large joins is probably a standalone effort.

### Are there any user-facing changes?

No.
* Closes: #34474

Authored-by: Weston Pace <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this pull request May 11, 2023
…oo much key data (apache#35087)

### Rationale for this change

This fixes the test in apache#34474 though there are likely still other bad scenarios with large joins.  I've fixed this one since the behavior (invalid data) is particularly bad.  Most of the time if there is too much data I'm guessing we probably just crash.  Still, I think a test suite of some kind stressing large joins would be good to have.  Perhaps this could be added if someone finds time to work on join spilling.

### What changes are included in this PR?

If the join will require more than 4GiB of key data it should now return an invalid status instead of invalid data.

### Are these changes tested?

No.  I created a unit test but it requires over 16GiB of RAM (Besides the input data itself (4GiB), by the time you get 4GiB of key data there are various other join state buffers that also grow.  The test also took nearly a minute to run.  I think investigation and creation of a test suite for large joins is probably a standalone effort.

### Are there any user-facing changes?

No.
* Closes: apache#34474

Authored-by: Weston Pace <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this pull request May 15, 2023
…oo much key data (apache#35087)

### Rationale for this change

This fixes the test in apache#34474 though there are likely still other bad scenarios with large joins.  I've fixed this one since the behavior (invalid data) is particularly bad.  Most of the time if there is too much data I'm guessing we probably just crash.  Still, I think a test suite of some kind stressing large joins would be good to have.  Perhaps this could be added if someone finds time to work on join spilling.

### What changes are included in this PR?

If the join will require more than 4GiB of key data it should now return an invalid status instead of invalid data.

### Are these changes tested?

No.  I created a unit test but it requires over 16GiB of RAM (Besides the input data itself (4GiB), by the time you get 4GiB of key data there are various other join state buffers that also grow.  The test also took nearly a minute to run.  I think investigation and creation of a test suite for large joins is probably a standalone effort.

### Are there any user-facing changes?

No.
* Closes: apache#34474

Authored-by: Weston Pace <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
rtpsw pushed a commit to rtpsw/arrow that referenced this pull request May 16, 2023
…oo much key data (apache#35087)

### Rationale for this change

This fixes the test in apache#34474 though there are likely still other bad scenarios with large joins.  I've fixed this one since the behavior (invalid data) is particularly bad.  Most of the time if there is too much data I'm guessing we probably just crash.  Still, I think a test suite of some kind stressing large joins would be good to have.  Perhaps this could be added if someone finds time to work on join spilling.

### What changes are included in this PR?

If the join will require more than 4GiB of key data it should now return an invalid status instead of invalid data.

### Are these changes tested?

No.  I created a unit test but it requires over 16GiB of RAM (Besides the input data itself (4GiB), by the time you get 4GiB of key data there are various other join state buffers that also grow.  The test also took nearly a minute to run.  I think investigation and creation of a test suite for large joins is probably a standalone effort.

### Are there any user-facing changes?

No.
* Closes: apache#34474

Authored-by: Weston Pace <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
oliviermeslin added a commit to oliviermeslin/arrow that referenced this pull request Sep 13, 2023
[PR 35087](apache#35087) introduced an explicit fail in large joins with Acero when key data is larger than 4GB (solving the problem reported by [issue 34474](apache#34474)). However, I think (but I'm not sure) that this quick fix is too restrictive because the total size condition is applied to the total size of tables to be joined, rather than to the size of keys. As a consequence, Acero fails when trying to merge large tables, even when the size of key data is well below 4 GB.

This PR modifies the source code so that the logical test only verifies whether the total size of _key variable_ is below 4 GB.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python] Table.join() produces incorrect results for large inputs
4 participants