-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Acero] Incorrect results in inner join #38074
Comments
@llama90 thank you for investigating this and working to solve it. There was a recent conversation on the Arrow developer mailing list that might be relevant (but I'm not sure): Could this explain the incorrect results you are seeing? |
@ianmcook Hello. It seems like a different issue from a Swiss join. Even in the smallest case, when both table_1 and table_2 have 1025 identical records each, the result still shows 1024 join results. When I change the number of right records to 1537, I finally could get 1025 values. Based on the information you mentioned, I became interested and conducted a test by increasing the number of records to 100,000.
Furthermore, I tested increasing the record count to 10 million without BloomFilter, and I confirmed that the results were accurate in this scenario. In this case, I was able to confirm that the results were coming out in batches of 32,768 each. Maybe it is a morsel unit.
In my opinion, it seems clear that there might be an issue with the Bloom Filter. Cases where accurate results were obtained, whether using BloomFilter or not, were as follows:
Below is an example of the tables. table_1 (
|
col_1 | col_2 | col_3 |
---|---|---|
2023-09-14 21:43:18.678917 | 1 | foo |
2023-09-14 21:43:18.678917 | 2 | foo |
2023-09-14 21:43:18.678917 | 3 | foo |
2023-09-14 21:43:18.678917 | -1 | foo |
2023-09-14 21:43:18.678917 | -1 | foo |
table 2
col_1 | col_2 | col_3 | col_4 |
---|---|---|---|
2023-09-14 21:43:18.678917 | 1 | foo | bar |
2023-09-14 21:43:18.678917 | 2 | foo | bar |
2023-09-14 21:43:18.678917 | 3 | foo | bar |
2023-09-14 21:43:18.678917 | 4 | foo | bar |
I haven't managed to find the exact issue yet (I'm not too familiar with this code in particular), but this section is fairly suspicious: arrow/cpp/src/arrow/acero/hash_join_node.cc Lines 1131 to 1146 in 3697bcd
Note that this function is exclusive to the bloom filter path and the the value of |
It seems correct. I've noticed that when modifying the I suspect there might be an issue with handling the remainder when it exceeds the mini-batch size, so I'm looking into that part. Thank you for checking! |
It seems that the issue has been fixed. I will clean up the code, write unit tests, and aim to submit a PR as soon as possible. Thanks to your review, I was able to reproduce the symptoms on a smaller scale for testing. I appreciate your review once again. @benibus |
…types in slice function
…e conversion in uint32_t * int64_t
…e_utf8 and large_binary
…r the slice function with binary type
…types in slice function
…e conversion in uint32_t * int64_t
…e_utf8 and large_binary
…r the slice function with binary type
…types in slice function
…e conversion in uint32_t * int64_t
…e_utf8 and large_binary
…r the slice function with binary type
…and Binary Types in Hash Join (#38147) ### Rationale for this change We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly. To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. * Issue raised: #37729 ### What changes are included in this PR? * The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability. * Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. * During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values. ### Are these changes tested? Yes ### Are there any user-facing changes? Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug. * Closes: #38074 Authored-by: Hyunseok Seo <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…and Binary Types in Hash Join (#38147) ### Rationale for this change We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly. To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. * Issue raised: #37729 ### What changes are included in this PR? * The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability. * Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. * During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values. ### Are these changes tested? Yes ### Are there any user-facing changes? Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug. * Closes: #38074 Authored-by: Hyunseok Seo <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…tring and Binary Types in Hash Join (apache#38147) ### Rationale for this change We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly. To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. * Issue raised: apache#37729 ### What changes are included in this PR? * The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability. * Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. * During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values. ### Are these changes tested? Yes ### Are there any user-facing changes? Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug. * Closes: apache#38074 Authored-by: Hyunseok Seo <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…tring and Binary Types in Hash Join (apache#38147) ### Rationale for this change We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly. To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. * Issue raised: apache#37729 ### What changes are included in this PR? * The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability. * Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. * During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values. ### Are these changes tested? Yes ### Are there any user-facing changes? Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug. * Closes: apache#38074 Authored-by: Hyunseok Seo <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…tring and Binary Types in Hash Join (apache#38147) ### Rationale for this change We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly. To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. * Issue raised: apache#37729 ### What changes are included in this PR? * The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability. * Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. * During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values. ### Are these changes tested? Yes ### Are there any user-facing changes? Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug. * Closes: apache#38074 Authored-by: Hyunseok Seo <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Describe the bug, including details regarding any error messages, version, and platform.
Overview
In the previous issue, the user is discussing the occurrence of incorrect results when performing an inner join.
Although it was well-explained in the previous issue, to reiterate, the user has created this issue because they are getting incorrect results when performing an inner join between two tables, namely
table_1
andtable_2
.Each of these tables (
table_1
andtable_2
) has columnscol_1
,col_2
,col_3
, andtable_2
has an additionalcol_4
. Upon my investigation, it appears thattable_2.parquet
has 7 more records, but forcol_1
,col_2
, andcol_3
, it contains the same values astable_1
.The number of records in each table is 6282 and 6289, respectively.
So, when performing an inner join using
col_1
,col_2
, andcol_3
as the join keys, the result should be 6282, regardless of the order of the tables.Reason
To start with the cause, there is an issue with the BloomFilter logic.
When testing in C++, if you set the BloomFilter option (
disable_bloom_filter
) totrue
, the join operation is performed without any issues.Additionally, the findings from further investigation are as follows.
Up to this point, this is what I have gathered about the issue, and I am working hard to fix the bug.
However, I am a novice about Arrow, which is causing it to take longer than expected. Nonetheless, I will continue to make efforts to resolve the bug.
Any advice or insights from those who are more experienced would be greatly appreciated, and it would also be great if someone with expertise could tackle this issue first. So, I'm sharing this here.
Component(s)
C++
The text was updated successfully, but these errors were encountered: