-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Table.join() produces incorrect results for large inputs #34474
Comments
Thank you very much for the report!
Hmm, there isn't much that is non-deterministic in a hash-join. So my guess would be that this is some sort of race condition. Perhaps we are scheduling more tasks at the higher size and that is leading to the issue. I was able to come up with a reproducer that runs in under a minute and should be runnable with 32GB of RAM:
This will be a tricky one to get to the bottom of I think. |
Ok, so it seems the non-determinism is from garbage memory and not threading. This code triggers a segmentation fault when run in debug mode. The error is somewhere in the hash-table and that code is pretty complex. That's about as far as I can get today but I'll try and find a day to really dive into this before the release. This needs to be fixed. For future reference, I'm attaching the stack trace I am getting. |
I managed to look into this today. The bad news is that this join isn't supported. There are 9 key columns. The date and int columns are 8 bytes each. The string columns are variable but at least 4 bytes and average out close enough to 8 bytes that we can just use 8. 72,000,000 * 8 bytes * 9 columns ~ 5GB of data. We store key data in a structure that we index with uint32_t which means we can have at most 4GiB of key data. The current behavior is that we trigger an overflow and clobber existing data in our keys array which is leading to the results you are seeing (incorrect data). I'm working on a fix that will detect this condition and fail the join when it encounters more than 4GiB key data. My guess is that by implementing hash join spilling (e.g. #13669) we would naturally increase this limit. Until then the best we can do is fail. |
…h key data (#35087) ### Rationale for this change This fixes the test in #34474 though there are likely still other bad scenarios with large joins. I've fixed this one since the behavior (invalid data) is particularly bad. Most of the time if there is too much data I'm guessing we probably just crash. Still, I think a test suite of some kind stressing large joins would be good to have. Perhaps this could be added if someone finds time to work on join spilling. ### What changes are included in this PR? If the join will require more than 4GiB of key data it should now return an invalid status instead of invalid data. ### Are these changes tested? No. I created a unit test but it requires over 16GiB of RAM (Besides the input data itself (4GiB), by the time you get 4GiB of key data there are various other join state buffers that also grow. The test also took nearly a minute to run. I think investigation and creation of a test suite for large joins is probably a standalone effort. ### Are there any user-facing changes? No. * Closes: #34474 Authored-by: Weston Pace <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…h key data (#35087) ### Rationale for this change This fixes the test in #34474 though there are likely still other bad scenarios with large joins. I've fixed this one since the behavior (invalid data) is particularly bad. Most of the time if there is too much data I'm guessing we probably just crash. Still, I think a test suite of some kind stressing large joins would be good to have. Perhaps this could be added if someone finds time to work on join spilling. ### What changes are included in this PR? If the join will require more than 4GiB of key data it should now return an invalid status instead of invalid data. ### Are these changes tested? No. I created a unit test but it requires over 16GiB of RAM (Besides the input data itself (4GiB), by the time you get 4GiB of key data there are various other join state buffers that also grow. The test also took nearly a minute to run. I think investigation and creation of a test suite for large joins is probably a standalone effort. ### Are there any user-facing changes? No. * Closes: #34474 Authored-by: Weston Pace <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…oo much key data (apache#35087) ### Rationale for this change This fixes the test in apache#34474 though there are likely still other bad scenarios with large joins. I've fixed this one since the behavior (invalid data) is particularly bad. Most of the time if there is too much data I'm guessing we probably just crash. Still, I think a test suite of some kind stressing large joins would be good to have. Perhaps this could be added if someone finds time to work on join spilling. ### What changes are included in this PR? If the join will require more than 4GiB of key data it should now return an invalid status instead of invalid data. ### Are these changes tested? No. I created a unit test but it requires over 16GiB of RAM (Besides the input data itself (4GiB), by the time you get 4GiB of key data there are various other join state buffers that also grow. The test also took nearly a minute to run. I think investigation and creation of a test suite for large joins is probably a standalone effort. ### Are there any user-facing changes? No. * Closes: apache#34474 Authored-by: Weston Pace <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…oo much key data (apache#35087) ### Rationale for this change This fixes the test in apache#34474 though there are likely still other bad scenarios with large joins. I've fixed this one since the behavior (invalid data) is particularly bad. Most of the time if there is too much data I'm guessing we probably just crash. Still, I think a test suite of some kind stressing large joins would be good to have. Perhaps this could be added if someone finds time to work on join spilling. ### What changes are included in this PR? If the join will require more than 4GiB of key data it should now return an invalid status instead of invalid data. ### Are these changes tested? No. I created a unit test but it requires over 16GiB of RAM (Besides the input data itself (4GiB), by the time you get 4GiB of key data there are various other join state buffers that also grow. The test also took nearly a minute to run. I think investigation and creation of a test suite for large joins is probably a standalone effort. ### Are there any user-facing changes? No. * Closes: apache#34474 Authored-by: Weston Pace <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…oo much key data (apache#35087) ### Rationale for this change This fixes the test in apache#34474 though there are likely still other bad scenarios with large joins. I've fixed this one since the behavior (invalid data) is particularly bad. Most of the time if there is too much data I'm guessing we probably just crash. Still, I think a test suite of some kind stressing large joins would be good to have. Perhaps this could be added if someone finds time to work on join spilling. ### What changes are included in this PR? If the join will require more than 4GiB of key data it should now return an invalid status instead of invalid data. ### Are these changes tested? No. I created a unit test but it requires over 16GiB of RAM (Besides the input data itself (4GiB), by the time you get 4GiB of key data there are various other join state buffers that also grow. The test also took nearly a minute to run. I think investigation and creation of a test suite for large joins is probably a standalone effort. ### Are there any user-facing changes? No. * Closes: apache#34474 Authored-by: Weston Pace <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
[PR 35087](apache#35087) introduced an explicit fail in large joins with Acero when key data is larger than 4GB (solving the problem reported by [issue 34474](apache#34474)). However, I think (but I'm not sure) that this quick fix is too restrictive because the total size condition is applied to the total size of tables to be joined, rather than to the size of keys. As a consequence, Acero fails when trying to merge large tables, even when the size of key data is well below 4 GB. This PR modifies the source code so that the logical test only verifies whether the total size of _key variable_ is below 4 GB.
Describe the bug, including details regarding any error messages, version, and platform.
Pyarrow's join does not produce the same results as Pandas when the input tables are large. I am observing this in industry data that I am working with, and I have a reproducible example below that mimics this data.
In this example, we have 72 million unique rows in each table with 9 join key columns of various types. The tables are identical except for the 'val' column in the second table.
Pyarrow's join creates null values for 'val' where there should be actual values from the second table. The join performed in
Pandas produces the expected result.
I can produce the same result as Pandas by splitting each table into pieces, joining each left piece to each right piece, coalescing 'val', and concatenating the outputs (e.g., pa.concat([tbl1_a.join(tbl2_a).join(tbl2_b), tbl1_b.join(tbl2_a).join(tbl2_b)])).
Apologies for the long-running example. The first section that generates the join key data takes about an hour on my machine (AWS r5.24xlarge EC2 instance) with the rest taking about 30 minutes. Around 100GB of memory is necessary to run the code.
Component(s)
Python
The text was updated successfully, but these errors were encountered: