Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Acero] Incorrect results in inner join #38074

Closed
llama90 opened this issue Oct 6, 2023 · 5 comments · Fixed by #38147
Closed

[C++][Acero] Incorrect results in inner join #38074

llama90 opened this issue Oct 6, 2023 · 5 comments · Fixed by #38147
Assignees
Labels
Component: C++ Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Priority: Blocker Marks a blocker for the release Type: bug
Milestone

Comments

@llama90
Copy link
Contributor

llama90 commented Oct 6, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Before explaining, if recreating previously submitted issues like this is considered a mistake, I apologize.

Overview

In the previous issue, the user is discussing the occurrence of incorrect results when performing an inner join.

Although it was well-explained in the previous issue, to reiterate, the user has created this issue because they are getting incorrect results when performing an inner join between two tables, namely table_1 and table_2.

Each of these tables (table_1 and table_2) has columns col_1, col_2, col_3, and table_2 has an additional col_4. Upon my investigation, it appears that table_2.parquet has 7 more records, but for col_1, col_2, and col_3, it contains the same values as table_1.

The number of records in each table is 6282 and 6289, respectively.

So, when performing an inner join using col_1, col_2, and col_3 as the join keys, the result should be 6282, regardless of the order of the tables.

Reason

To start with the cause, there is an issue with the BloomFilter logic.

When testing in C++, if you set the BloomFilter option (disable_bloom_filter) to true, the join operation is performed without any issues.

// with bloom_filter (default)
HashJoinNodeOptions join_opts{
    JoinType::INNER, 
    {"col_1", "col_2", "col_3"}, 
    {"col_1", "col_2", "col_3"}, 
    literal(true), 
    "_l", 
    "_r"
};

// without bloom_filter
HashJoinNodeOptions join_opts{
    JoinType::INNER, 
    {"col_1", "col_2", "col_3"}, 
    {"col_1", "col_2", "col_3"}, 
    literal(true), 
    "_l", 
    "_r",
    true // disable_bloom_filter
};

Additionally, the findings from further investigation are as follows.

  1. The number of matching records between the left and right tables must exceed 1024.
  2. When the number of matching records is close to the number of records in the right table, errors occur.
  3. The number of matching records does not appear to have any correlation with the number of records in the left table.

Up to this point, this is what I have gathered about the issue, and I am working hard to fix the bug.

However, I am a novice about Arrow, which is causing it to take longer than expected. Nonetheless, I will continue to make efforts to resolve the bug.

Any advice or insights from those who are more experienced would be greatly appreciated, and it would also be great if someone with expertise could tackle this issue first. So, I'm sharing this here.

Component(s)

C++

@ianmcook
Copy link
Member

ianmcook commented Oct 6, 2023

@llama90 thank you for investigating this and working to solve it.

There was a recent conversation on the Arrow developer mailing list that might be relevant (but I'm not sure):
https://lists.apache.org/thread/pq10zd4sr494hd09t93o3hb6zswn5wll
When batch sizes are too large, silent overflows can occur, producing incorrect hash join results.

Could this explain the incorrect results you are seeing?

@llama90
Copy link
Contributor Author

llama90 commented Oct 6, 2023

@ianmcook Hello.

It seems like a different issue from a Swiss join. Even in the smallest case, when both table_1 and table_2 have 1025 identical records each, the result still shows 1024 join results. When I change the number of right records to 1537, I finally could get 1025 values.

Based on the information you mentioned, I became interested and conducted a test by increasing the number of records to 100,000.

  • Left: 100,000
  • Right: 100,000
  • Match: 100,000
    • without BloomFilter: 100,000
    • with BloomFilter: 84,334

Furthermore, I tested increasing the record count to 10 million without BloomFilter, and I confirmed that the results were accurate in this scenario.

In this case, I was able to confirm that the results were coming out in batches of 32,768 each. Maybe it is a morsel unit.

  • Left: 32,768
  • Right: 32,768
  • Match: 32,768
    • without BloomFilter: 32,768
    • with BloomFilter: 17,010

In my opinion, it seems clear that there might be an issue with the Bloom Filter. Cases where accurate results were obtained, whether using BloomFilter or not, were as follows:

  • When the number of match records is 1024 or less, regardless of the number of records in the Left and Right tables.
  • When the number of match records exceeds 1024, but the number of records in the Right table is greater than the number of match records.
    • For example, R - 5000, M - 5000, L - 10000 (In this case, reducing it to 9000 records results in 4904 as the output).

Below is an example of the tables.

table_1 (left)

  • schema ({timestamp(TimeUnit::MICRO, "UTC"), large_utf8(), large_utf8()})
    • col_1: All values are 2023-09-14 21:43:18.678917
    • col_2: 1, 2, 3, 4, 5 ... N, -1, -1, -1, -1, -1, ... -1
    • col_3: All values are foo
  • N is the number of matching records between table 1 and table 2

table_2 (right)

  • schema ({timestamp(TimeUnit::MICRO, "UTC"), large_utf8(), large_utf8(), large_utf8()})
    • col_1: All values are 2023-09-14 21:43:18.678917
    • col_2: 1, 2, 3, 4, 5 ... M
    • col_3: All values are foo
    • col_4: All values are bar
  • M is the number of records for table 2

Example

  • The number of records for table 1: 5
  • N: 3
  • M: 4

table 1

col_1 col_2 col_3
2023-09-14 21:43:18.678917 1 foo
2023-09-14 21:43:18.678917 2 foo
2023-09-14 21:43:18.678917 3 foo
2023-09-14 21:43:18.678917 -1 foo
2023-09-14 21:43:18.678917 -1 foo

table 2

col_1 col_2 col_3 col_4
2023-09-14 21:43:18.678917 1 foo bar
2023-09-14 21:43:18.678917 2 foo bar
2023-09-14 21:43:18.678917 3 foo bar
2023-09-14 21:43:18.678917 4 foo bar

@benibus
Copy link
Collaborator

benibus commented Oct 6, 2023

I haven't managed to find the exact issue yet (I'm not too familiar with this code in particular), but this section is fairly suspicious:

arrow::util::TempVectorHolder<uint32_t> hash_holder(
stack, arrow::util::MiniBatch::kMiniBatchLength);
uint32_t* hashes = hash_holder.mutable_data();
for (int64_t i = 0; i < key_batch.length;
i += arrow::util::MiniBatch::kMiniBatchLength) {
int64_t length =
std::min(static_cast<int64_t>(key_batch.length - i),
static_cast<int64_t>(arrow::util::MiniBatch::kMiniBatchLength));
std::vector<KeyColumnArray> temp_column_arrays;
RETURN_NOT_OK(Hashing32::HashBatch(key_batch, hashes, temp_column_arrays,
ctx_->cpu_info()->hardware_flags(), stack, i,
length));
RETURN_NOT_OK(build_.builder_->PushNextBatch(thread_index, length, hashes));
}
return Status::OK();

Note that this function is exclusive to the bloom filter path and the the value of kMiniBatchLength happens to be 1024...

@llama90
Copy link
Contributor Author

llama90 commented Oct 7, 2023

It seems correct.

I've noticed that when modifying the kMiniBatchLength value arbitrarily to be larger than the number of match records, the results come out fine.

I suspect there might be an issue with handling the remainder when it exceeds the mini-batch size, so I'm looking into that part.

Thank you for checking!

@llama90
Copy link
Contributor Author

llama90 commented Oct 8, 2023

It seems that the issue has been fixed.

I will clean up the code, write unit tests, and aim to submit a PR as soon as possible. Thanks to your review, I was able to reproduce the symptoms on a smaller scale for testing. I appreciate your review once again. @benibus

llama90 added a commit to llama90/arrow that referenced this issue Oct 9, 2023
llama90 added a commit to llama90/arrow that referenced this issue Oct 9, 2023
llama90 added a commit to llama90/arrow that referenced this issue Oct 9, 2023
llama90 added a commit to llama90/arrow that referenced this issue Oct 10, 2023
llama90 added a commit to llama90/arrow that referenced this issue Oct 10, 2023
@ianmcook ianmcook added this to the 14.0.0 milestone Oct 10, 2023
@ianmcook ianmcook added the Priority: Blocker Marks a blocker for the release label Oct 10, 2023
llama90 added a commit to llama90/arrow that referenced this issue Oct 11, 2023
llama90 added a commit to llama90/arrow that referenced this issue Oct 11, 2023
llama90 added a commit to llama90/arrow that referenced this issue Oct 11, 2023
llama90 added a commit to llama90/arrow that referenced this issue Oct 11, 2023
llama90 added a commit to llama90/arrow that referenced this issue Oct 11, 2023
llama90 added a commit to llama90/arrow that referenced this issue Oct 11, 2023
llama90 added a commit to llama90/arrow that referenced this issue Oct 11, 2023
llama90 added a commit to llama90/arrow that referenced this issue Oct 11, 2023
llama90 added a commit to llama90/arrow that referenced this issue Oct 11, 2023
llama90 added a commit to llama90/arrow that referenced this issue Oct 11, 2023
pitrou pushed a commit that referenced this issue Oct 16, 2023
…and Binary Types in Hash Join (#38147)

### Rationale for this change

We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly.

To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. 

* Issue raised: #37729 

### What changes are included in this PR?

* The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability.
* Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. 
* During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values.

### Are these changes tested?

Yes

### Are there any user-facing changes?

Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug.

* Closes: #38074

Authored-by: Hyunseok Seo <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@pitrou pitrou modified the milestones: 14.0.0, 15.0.0 Oct 16, 2023
@raulcd raulcd modified the milestones: 15.0.0, 14.0.0 Oct 16, 2023
raulcd pushed a commit that referenced this issue Oct 16, 2023
…and Binary Types in Hash Join (#38147)

### Rationale for this change

We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly.

To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. 

* Issue raised: #37729 

### What changes are included in this PR?

* The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability.
* Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. 
* During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values.

### Are these changes tested?

Yes

### Are there any user-facing changes?

Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug.

* Closes: #38074

Authored-by: Hyunseok Seo <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@kou kou changed the title [C++] [Acero] Incorrect results in inner join [C++][Acero] Incorrect results in inner join Oct 17, 2023
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…tring and Binary Types in Hash Join (apache#38147)

### Rationale for this change

We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly.

To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. 

* Issue raised: apache#37729 

### What changes are included in this PR?

* The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability.
* Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. 
* During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values.

### Are these changes tested?

Yes

### Are there any user-facing changes?

Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug.

* Closes: apache#38074

Authored-by: Hyunseok Seo <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…tring and Binary Types in Hash Join (apache#38147)

### Rationale for this change

We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly.

To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. 

* Issue raised: apache#37729 

### What changes are included in this PR?

* The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability.
* Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. 
* During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values.

### Are these changes tested?

Yes

### Are there any user-facing changes?

Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug.

* Closes: apache#38074

Authored-by: Hyunseok Seo <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@amoeba amoeba added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Nov 17, 2023
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…tring and Binary Types in Hash Join (apache#38147)

### Rationale for this change

We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly.

To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. 

* Issue raised: apache#37729 

### What changes are included in this PR?

* The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability.
* Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. 
* During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values.

### Are these changes tested?

Yes

### Are there any user-facing changes?

Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug.

* Closes: apache#38074

Authored-by: Hyunseok Seo <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: C++ Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Priority: Blocker Marks a blocker for the release Type: bug
Projects
None yet
6 participants