-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Acero cannot join large tables because of a misspecified test #37655
Comments
[PR 35087](apache#35087) introduced an explicit fail in large joins with Acero when key data is larger than 4GB (solving the problem reported by [issue 34474](apache#34474)). However, I think (but I'm not sure) that this quick fix is too restrictive because the total size condition is applied to the total size of tables to be joined, rather than to the size of keys. As a consequence, Acero fails when trying to merge large tables, even when the size of key data is well below 4 GB. This PR modifies the source code so that the logical test only verifies whether the total size of _key variable_ is below 4 GB.
I wonder if it's possible to add some heuristics to improve this even further. Say, you have a columns with two or more long strings. Mapping it from |
@vkhodygo : I'm not sure what you mean by "improving this further".
|
It's the latter. I know how you feel, dealing with TBs of data can be pretty annoying. However, resolving this issue might take some time whereas many people would benefit from a fix right now. I did have another workaround for some of my data:
This is a very crude version of what devs suggested and it seems to be working nicely. |
@pitrou @jorisvandenbossche should this be part of 14.0.0? |
@raulcd Only if the fix is ready. |
@vkhodygo : thanks or your quick reply. I'm not sure we are talking about the same thing. In my opinion there are actually two separate problems:
I argue that solving this second problem would be a significant improvement over the current situation (even if the first problem remains), because I suspect that there are many use cases where tables are larger than 4 GB but key data is not. |
@pitrou @raulcd @jorisvandenbossche : Is there anything I could do to help? I can try to test this fix using artificial data, would that help? |
After some additional test, I discovered that this bug is actually related to the size of the right table, but insensitive to the size of the left table. So the bug is that the 4 GB key size test is applied to the size of the complete right table. Here are some additional tests to show this asymmetrical behavior.
|
…ulting in a limit of 4GB on the right table (apache/arrow#37655)
Hey all! I'm facing the same issue. Looking forward to seeing this issue fixed. Just wanted to share that, in the meantime, I'm converting the data to
|
This issue should be resolved by #43389 so I'm closing it. Feel free to try and give us your feedback. Thanks. |
TL;DR: PR 35087 introduced an explicit fail in large joins with Acero when key data is larger than 4GB (solving the problem reported by issue 34474). However, I think (but I'm not sure) that this quick fix is too restrictive because the total size condition is applied to the total size of tables to be joined, rather than to the size of keys. As a consequence, Acero fails when trying to merge large tables, even when the size of key data is well below 4 GB.
EDIT: It looks like this bug is actually related to the size of the right table, but unrelated to the size of the left table (see here).
In this issue I proceed in four steps. First, I show the test introduced by PR 35087 erroneously applies to the total size of data processed in the join, rather than the size of the key data. Second, I try to find the root cause of this behavior. Third, I discuss whether this is really a bug, or the expected behavior of Acero. Fourth, I suggest the sketch of a solution.
Description of the problem
PR 35087 introduced an explicit fail in large joins with Acero when key data is larger than 4GB (here). However, I discovered that this error message does disappear when I reduce the number of columns in the Arrow Tables I'm trying to merge.
In the following reproducible example, I generate a large Arrow Table and merge it with itself, increasing the number of columns in the right table. The merge works fine for a limited number of columns, then the error message pops up when the number of columns reaches a threshold (8 in my case).
As a consequence, Acero throws an error whenever I try to merge large Arrow Tables, even for tables with key data significantly smaller than 4 GB.
Reproducible example in
R
Cause of the problem
I dived in the C++ source code of Acero to understand the problem. Disclaimer: I do not know anything about C++, so my report might be messy from here on.
PR 35087 introduced a logical test in Status
RowArrayMerge::PrepareForMerge()
. This test computes what I understand to be the size of the sources (here).I think the problem comes from the fact that the Status
RowArrayMerge::PrepareForMerge()
is called twice inSwissTableForJoinBuild::PreparePrtnMerge()
: once for the keys (here) and once for payload variables (here). My intuition is that, when applied to payload variables, the logical test actually computes the size of the payload variables, so more or less the size of the tables to be joined. EDIT: It looks like this bug is actually related to the size of the right table, but unrelated to the size of the left table (see here).Is this really a bug?
Given that I don't know how Acero performs joins, I'm not entirely sure whether the 4 GB size limit mentioned by @westonpace in this message applies to the size of the keys or to the size of the tables to be joined. My understanding of the discussion in the issue and of the error message is that the size limit applies to keys, so the behavior I describe should be considered as a bug. But maybe I misunderstood, and the size limit applies to table size, so the behavior I describe should be considered as the expected one.
In other words: what is exactly the issue with Acero? It cannot join heavy tables, or it cannot join tables with heavy keys?
Suggestion of solution
If the behavior I describe is an actual bug, a potential solution could look like this:
RowArrayMerge::PrepareForMerge()
. If this argument is TRUE, then the logical test (here) would be performed. If FALSE, it would not be performed;RowArrayMerge::PrepareForMerge()
(here);RowArrayMerge::PrepareForMerge()
(here).With the help of Chat-GPT, I opened a PR suggesting an (untested) solution (here).
Component(s)
C++
The text was updated successfully, but these errors were encountered: