Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inner joins are incorrect #37729

Closed
ravirahman opened this issue Sep 15, 2023 · 3 comments
Closed

Inner joins are incorrect #37729

ravirahman opened this issue Sep 15, 2023 · 3 comments

Comments

@ravirahman
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

I am running into an issue where inner joins between two tables are incorrectly computed. To reproduce this issue, here is a zip file containing two example tables (table_1.parquet and table_2.parquet). I have encountered this problem on pyarrow 10.0.0 - 13.0.0:

The schema for table_1 is:

col_1: timestamp[us, tz=UTC]  # All values are the same timestamp ("2023-09-14 21:43:18.678917")
col_2: large_string  
col_3: large_string  # All values are the same ("foo")

And for table_2:

col_1: timestamp[us, tz=UTC]  # All values are the same timestamp ("2023-09-14 21:43:18.678917")
col_2: large_string
col_3: large_string  # All values are the same ("foo")
col_4: large_string  # All values are the same ("bar")

Below is an example to reproduce the issue I am running in to with inner joins:

import pyarrow.parquet as pq

table_1 = pq.read_table("table_1.parquet")
table_2 = pq.read_table("table_2.parquet")

inner_joined = table_1.join(table_2, ["col_1", "col_2", "col_3"], join_type="inner",)
print(len(inner_joined))  # prints 3596 (incorrect)

inner_joined = table_2.join(table_1, ["col_1", "col_2", "col_3"], join_type="inner", use_threads=False)
print(len(inner_joined))  # prints 3601 (incorrect). "A inner join B" should be the same as "B inner join A"?

outer_joined = table_1.join(table_2, ["col_1", "col_2", "col_3"], join_type="left outer")
print(len(outer_joined))  # prints 6289 (correct)

# Table 2 is a superset of table 1. Thus, the null count for `col_4` when doing a left join should be zero
print(outer_joined.column("col_4").null_count)  # prints 0 (correct)

I believe this is a bug but please let me know if I am missing anything.

Component(s)

Python

@emarx
Copy link

emarx commented Sep 28, 2023

Hi all! Following up here -- this is a very significant bug. We've stopped using inner joins (using left joins and drop nulls instead) as inner joins return incorrect rows. Recommend anyone coming across this GitHub issue do the same until this is remediated.

pitrou pushed a commit that referenced this issue Oct 16, 2023
…and Binary Types in Hash Join (#38147)

### Rationale for this change

We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly.

To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. 

* Issue raised: #37729 

### What changes are included in this PR?

* The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability.
* Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. 
* During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values.

### Are these changes tested?

Yes

### Are there any user-facing changes?

Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug.

* Closes: #38074

Authored-by: Hyunseok Seo <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
raulcd pushed a commit that referenced this issue Oct 16, 2023
…and Binary Types in Hash Join (#38147)

### Rationale for this change

We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly.

To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. 

* Issue raised: #37729 

### What changes are included in this PR?

* The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability.
* Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. 
* During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values.

### Are these changes tested?

Yes

### Are there any user-facing changes?

Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug.

* Closes: #38074

Authored-by: Hyunseok Seo <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…tring and Binary Types in Hash Join (apache#38147)

### Rationale for this change

We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly.

To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. 

* Issue raised: apache#37729 

### What changes are included in this PR?

* The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability.
* Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. 
* During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values.

### Are these changes tested?

Yes

### Are there any user-facing changes?

Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug.

* Closes: apache#38074

Authored-by: Hyunseok Seo <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…tring and Binary Types in Hash Join (apache#38147)

### Rationale for this change

We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly.

To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. 

* Issue raised: apache#37729 

### What changes are included in this PR?

* The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability.
* Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. 
* During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values.

### Are these changes tested?

Yes

### Are there any user-facing changes?

Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug.

* Closes: apache#38074

Authored-by: Hyunseok Seo <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@AlenkaF
Copy link
Member

AlenkaF commented Dec 13, 2023

I tried the code locally on dev (15.0.0.dev) and wasn't able to reproduce:

In [1]: import pyarrow.parquet as pq
   ...: 
   ...: table_1 = pq.read_table("python/table_1.parquet")
   ...: table_2 = pq.read_table("python/table_2.parquet")
   ...: 
   ...: inner_joined = table_1.join(table_2, ["col_1", "col_2", "col_3"], join_type="inner",)
   ...: print(len(inner_joined))
   ...: 
   ...: inner_joined = table_2.join(table_1, ["col_1", "col_2", "col_3"], join_type="inner", use_threads=False)
   ...: print(len(inner_joined))
   ...: 
   ...: outer_joined = table_1.join(table_2, ["col_1", "col_2", "col_3"], join_type="left outer")
   ...: print(len(outer_joined))
   ...: 
   ...: print(outer_joined.column("col_4").null_count)

6289
6289
6289
0

but when running the same code on pyarrow version 13.0.0 I get the reported behaviour:

>>> import pyarrow as pa
>>> pa.__version__
'13.0.0'

>>> import pyarrow.parquet as pq
>>> table_1 = pq.read_table("../repos/arrow/python/table_1.parquet")
>>> table_2 = pq.read_table("../repos/arrow/python/table_2.parquet")

>>> inner_joined = table_1.join(table_2, ["col_1", "col_2", "col_3"], join_type="inner",)
>>> print(len(inner_joined))
3596

>>> inner_joined = table_2.join(table_1, ["col_1", "col_2", "col_3"], join_type="inner", use_threads=False)
>>> print(len(inner_joined))
3601

>>> outer_joined = table_1.join(table_2, ["col_1", "col_2", "col_3"], join_type="left outer")
>>> print(len(outer_joined))
6289

>>> print(outer_joined.column("col_4").null_count)
0

I will try to see which PR fixed this issue.

@AlenkaF
Copy link
Member

AlenkaF commented Dec 18, 2023

Missed the issue and the PR already linked in this report. The fix for the bug is:

and was added in the 14.0.0 release.

@AlenkaF AlenkaF closed this as completed Dec 18, 2023
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…tring and Binary Types in Hash Join (apache#38147)

### Rationale for this change

We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly.

To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. 

* Issue raised: apache#37729 

### What changes are included in this PR?

* The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability.
* Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. 
* During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values.

### Are these changes tested?

Yes

### Are there any user-facing changes?

Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug.

* Closes: apache#38074

Authored-by: Hyunseok Seo <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants