-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inner joins are incorrect #37729
Comments
Hi all! Following up here -- this is a very significant bug. We've stopped using inner joins (using left joins and drop nulls instead) as inner joins return incorrect rows. Recommend anyone coming across this GitHub issue do the same until this is remediated. |
…and Binary Types in Hash Join (#38147) ### Rationale for this change We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly. To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. * Issue raised: #37729 ### What changes are included in this PR? * The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability. * Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. * During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values. ### Are these changes tested? Yes ### Are there any user-facing changes? Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug. * Closes: #38074 Authored-by: Hyunseok Seo <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…and Binary Types in Hash Join (#38147) ### Rationale for this change We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly. To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. * Issue raised: #37729 ### What changes are included in this PR? * The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability. * Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. * During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values. ### Are these changes tested? Yes ### Are there any user-facing changes? Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug. * Closes: #38074 Authored-by: Hyunseok Seo <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…tring and Binary Types in Hash Join (apache#38147) ### Rationale for this change We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly. To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. * Issue raised: apache#37729 ### What changes are included in this PR? * The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability. * Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. * During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values. ### Are these changes tested? Yes ### Are there any user-facing changes? Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug. * Closes: apache#38074 Authored-by: Hyunseok Seo <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…tring and Binary Types in Hash Join (apache#38147) ### Rationale for this change We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly. To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. * Issue raised: apache#37729 ### What changes are included in this PR? * The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability. * Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. * During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values. ### Are these changes tested? Yes ### Are there any user-facing changes? Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug. * Closes: apache#38074 Authored-by: Hyunseok Seo <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
I tried the code locally on dev ( In [1]: import pyarrow.parquet as pq
...:
...: table_1 = pq.read_table("python/table_1.parquet")
...: table_2 = pq.read_table("python/table_2.parquet")
...:
...: inner_joined = table_1.join(table_2, ["col_1", "col_2", "col_3"], join_type="inner",)
...: print(len(inner_joined))
...:
...: inner_joined = table_2.join(table_1, ["col_1", "col_2", "col_3"], join_type="inner", use_threads=False)
...: print(len(inner_joined))
...:
...: outer_joined = table_1.join(table_2, ["col_1", "col_2", "col_3"], join_type="left outer")
...: print(len(outer_joined))
...:
...: print(outer_joined.column("col_4").null_count)
6289
6289
6289
0 but when running the same code on pyarrow version >>> import pyarrow as pa
>>> pa.__version__
'13.0.0'
>>> import pyarrow.parquet as pq
>>> table_1 = pq.read_table("../repos/arrow/python/table_1.parquet")
>>> table_2 = pq.read_table("../repos/arrow/python/table_2.parquet")
>>> inner_joined = table_1.join(table_2, ["col_1", "col_2", "col_3"], join_type="inner",)
>>> print(len(inner_joined))
3596
>>> inner_joined = table_2.join(table_1, ["col_1", "col_2", "col_3"], join_type="inner", use_threads=False)
>>> print(len(inner_joined))
3601
>>> outer_joined = table_1.join(table_2, ["col_1", "col_2", "col_3"], join_type="left outer")
>>> print(len(outer_joined))
6289
>>> print(outer_joined.column("col_4").null_count)
0 I will try to see which PR fixed this issue. |
Missed the issue and the PR already linked in this report. The fix for the bug is:
and was added in the 14.0.0 release. |
…tring and Binary Types in Hash Join (apache#38147) ### Rationale for this change We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly. To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. * Issue raised: apache#37729 ### What changes are included in this PR? * The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability. * Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. * During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values. ### Are these changes tested? Yes ### Are there any user-facing changes? Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug. * Closes: apache#38074 Authored-by: Hyunseok Seo <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Describe the bug, including details regarding any error messages, version, and platform.
I am running into an issue where inner joins between two tables are incorrectly computed. To reproduce this issue, here is a zip file containing two example tables (
table_1.parquet
andtable_2.parquet
). I have encountered this problem on pyarrow 10.0.0 - 13.0.0:The schema for
table_1
is:And for
table_2
:Below is an example to reproduce the issue I am running in to with inner joins:
I believe this is a bug but please let me know if I am missing anything.
Component(s)
Python
The text was updated successfully, but these errors were encountered: