Inner joins are incorrect #37729

ravirahman · 2023-09-15T00:04:35Z

Describe the bug, including details regarding any error messages, version, and platform.

I am running into an issue where inner joins between two tables are incorrectly computed. To reproduce this issue, here is a zip file containing two example tables (table_1.parquet and table_2.parquet). I have encountered this problem on pyarrow 10.0.0 - 13.0.0:

The schema for table_1 is:

col_1: timestamp[us, tz=UTC]  # All values are the same timestamp ("2023-09-14 21:43:18.678917")
col_2: large_string  
col_3: large_string  # All values are the same ("foo")

And for table_2:

col_1: timestamp[us, tz=UTC]  # All values are the same timestamp ("2023-09-14 21:43:18.678917")
col_2: large_string
col_3: large_string  # All values are the same ("foo")
col_4: large_string  # All values are the same ("bar")

Below is an example to reproduce the issue I am running in to with inner joins:

import pyarrow.parquet as pq

table_1 = pq.read_table("table_1.parquet")
table_2 = pq.read_table("table_2.parquet")

inner_joined = table_1.join(table_2, ["col_1", "col_2", "col_3"], join_type="inner",)
print(len(inner_joined))  # prints 3596 (incorrect)

inner_joined = table_2.join(table_1, ["col_1", "col_2", "col_3"], join_type="inner", use_threads=False)
print(len(inner_joined))  # prints 3601 (incorrect). "A inner join B" should be the same as "B inner join A"?

outer_joined = table_1.join(table_2, ["col_1", "col_2", "col_3"], join_type="left outer")
print(len(outer_joined))  # prints 6289 (correct)

# Table 2 is a superset of table 1. Thus, the null count for `col_4` when doing a left join should be zero
print(outer_joined.column("col_4").null_count)  # prints 0 (correct)

I believe this is a bug but please let me know if I am missing anything.

Component(s)

Python

The text was updated successfully, but these errors were encountered:

emarx · 2023-09-28T19:46:26Z

Hi all! Following up here -- this is a very significant bug. We've stopped using inner joins (using left joins and drop nulls instead) as inner joins return incorrect rows. Recommend anyone coming across this GitHub issue do the same until this is remediated.

…and Binary Types in Hash Join (#38147) ### Rationale for this change We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly. To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. * Issue raised: #37729 ### What changes are included in this PR? * The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability. * Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. * During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values. ### Are these changes tested? Yes ### Are there any user-facing changes? Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug. * Closes: #38074 Authored-by: Hyunseok Seo <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

…tring and Binary Types in Hash Join (apache#38147) ### Rationale for this change We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly. To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. * Issue raised: apache#37729 ### What changes are included in this PR? * The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability. * Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. * During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values. ### Are these changes tested? Yes ### Are there any user-facing changes? Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug. * Closes: apache#38074 Authored-by: Hyunseok Seo <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

AlenkaF · 2023-12-13T11:21:46Z

I tried the code locally on dev (15.0.0.dev) and wasn't able to reproduce:

In [1]: import pyarrow.parquet as pq
   ...: 
   ...: table_1 = pq.read_table("python/table_1.parquet")
   ...: table_2 = pq.read_table("python/table_2.parquet")
   ...: 
   ...: inner_joined = table_1.join(table_2, ["col_1", "col_2", "col_3"], join_type="inner",)
   ...: print(len(inner_joined))
   ...: 
   ...: inner_joined = table_2.join(table_1, ["col_1", "col_2", "col_3"], join_type="inner", use_threads=False)
   ...: print(len(inner_joined))
   ...: 
   ...: outer_joined = table_1.join(table_2, ["col_1", "col_2", "col_3"], join_type="left outer")
   ...: print(len(outer_joined))
   ...: 
   ...: print(outer_joined.column("col_4").null_count)

6289
6289
6289
0

but when running the same code on pyarrow version 13.0.0 I get the reported behaviour:

>>> import pyarrow as pa
>>> pa.__version__
'13.0.0'

>>> import pyarrow.parquet as pq
>>> table_1 = pq.read_table("../repos/arrow/python/table_1.parquet")
>>> table_2 = pq.read_table("../repos/arrow/python/table_2.parquet")

>>> inner_joined = table_1.join(table_2, ["col_1", "col_2", "col_3"], join_type="inner",)
>>> print(len(inner_joined))
3596

>>> inner_joined = table_2.join(table_1, ["col_1", "col_2", "col_3"], join_type="inner", use_threads=False)
>>> print(len(inner_joined))
3601

>>> outer_joined = table_1.join(table_2, ["col_1", "col_2", "col_3"], join_type="left outer")
>>> print(len(outer_joined))
6289

>>> print(outer_joined.column("col_4").null_count)
0

I will try to see which PR fixed this issue.

AlenkaF · 2023-12-18T07:34:36Z

Missed the issue and the PR already linked in this report. The fix for the bug is:

and was added in the 14.0.0 release.

…tring and Binary Types in Hash Join (apache#38147) ### Rationale for this change We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly. To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. * Issue raised: apache#37729 ### What changes are included in this PR? * The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability. * Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. * During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values. ### Are these changes tested? Yes ### Are there any user-facing changes? Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug. * Closes: apache#38074 Authored-by: Hyunseok Seo <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

ravirahman added the Type: bug label Sep 15, 2023

github-actions bot added the Component: Python label Sep 15, 2023

This was referenced Oct 6, 2023

[C++][Acero] Incorrect results in inner join #38074

Closed

GH-38074: [C++] Fix Offset Size Calculation for Slicing Large String and Binary Types in Hash Join #38147

Merged

AlenkaF closed this as completed Dec 18, 2023

zanmato1984 mentioned this issue Dec 21, 2023

[Python][C++?] Dataset "right anti" join gives incorrect result with large_string columns (11.0.0) #35354

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inner joins are incorrect #37729

Inner joins are incorrect #37729

ravirahman commented Sep 15, 2023

emarx commented Sep 28, 2023

AlenkaF commented Dec 13, 2023

AlenkaF commented Dec 18, 2023

Inner joins are incorrect #37729

Inner joins are incorrect #37729

Comments

ravirahman commented Sep 15, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

emarx commented Sep 28, 2023

AlenkaF commented Dec 13, 2023

AlenkaF commented Dec 18, 2023