[C++][Acero] Incorrect results in inner join #38074

llama90 · 2023-10-06T12:35:51Z

Describe the bug, including details regarding any error messages, version, and platform.

Before explaining, if recreating previously submitted issues like this is considered a mistake, I apologize.

Inner joins are incorrect #37729

Overview

In the previous issue, the user is discussing the occurrence of incorrect results when performing an inner join.

Although it was well-explained in the previous issue, to reiterate, the user has created this issue because they are getting incorrect results when performing an inner join between two tables, namely table_1 and table_2.

Each of these tables (table_1 and table_2) has columns col_1, col_2, col_3, and table_2 has an additional col_4. Upon my investigation, it appears that table_2.parquet has 7 more records, but for col_1, col_2, and col_3, it contains the same values as table_1.

The number of records in each table is 6282 and 6289, respectively.

So, when performing an inner join using col_1, col_2, and col_3 as the join keys, the result should be 6282, regardless of the order of the tables.

Reason

To start with the cause, there is an issue with the BloomFilter logic.

When testing in C++, if you set the BloomFilter option (disable_bloom_filter) to true, the join operation is performed without any issues.

// with bloom_filter (default)
HashJoinNodeOptions join_opts{
    JoinType::INNER, 
    {"col_1", "col_2", "col_3"}, 
    {"col_1", "col_2", "col_3"}, 
    literal(true), 
    "_l", 
    "_r"
};

// without bloom_filter
HashJoinNodeOptions join_opts{
    JoinType::INNER, 
    {"col_1", "col_2", "col_3"}, 
    {"col_1", "col_2", "col_3"}, 
    literal(true), 
    "_l", 
    "_r",
    true // disable_bloom_filter
};

Additionally, the findings from further investigation are as follows.

The number of matching records between the left and right tables must exceed 1024.
When the number of matching records is close to the number of records in the right table, errors occur.
The number of matching records does not appear to have any correlation with the number of records in the left table.

Up to this point, this is what I have gathered about the issue, and I am working hard to fix the bug.

However, I am a novice about Arrow, which is causing it to take longer than expected. Nonetheless, I will continue to make efforts to resolve the bug.

Any advice or insights from those who are more experienced would be greatly appreciated, and it would also be great if someone with expertise could tackle this issue first. So, I'm sharing this here.

Component(s)

C++

The text was updated successfully, but these errors were encountered:

ianmcook · 2023-10-06T14:07:15Z

@llama90 thank you for investigating this and working to solve it.

There was a recent conversation on the Arrow developer mailing list that might be relevant (but I'm not sure):
https://lists.apache.org/thread/pq10zd4sr494hd09t93o3hb6zswn5wll
When batch sizes are too large, silent overflows can occur, producing incorrect hash join results.

Could this explain the incorrect results you are seeing?

llama90 · 2023-10-06T15:42:22Z

@ianmcook Hello.

It seems like a different issue from a Swiss join. Even in the smallest case, when both table_1 and table_2 have 1025 identical records each, the result still shows 1024 join results. When I change the number of right records to 1537, I finally could get 1025 values.

Based on the information you mentioned, I became interested and conducted a test by increasing the number of records to 100,000.

Left: 100,000
Right: 100,000
Match: 100,000
- without BloomFilter: 100,000
- with BloomFilter: 84,334

Furthermore, I tested increasing the record count to 10 million without BloomFilter, and I confirmed that the results were accurate in this scenario.

In this case, I was able to confirm that the results were coming out in batches of 32,768 each. Maybe it is a morsel unit.

Left: 32,768
Right: 32,768
Match: 32,768
- without BloomFilter: 32,768
- with BloomFilter: 17,010

In my opinion, it seems clear that there might be an issue with the Bloom Filter. Cases where accurate results were obtained, whether using BloomFilter or not, were as follows:

When the number of match records is 1024 or less, regardless of the number of records in the Left and Right tables.
When the number of match records exceeds 1024, but the number of records in the Right table is greater than the number of match records.
- For example, R - 5000, M - 5000, L - 10000 (In this case, reducing it to 9000 records results in 4904 as the output).

Below is an example of the tables.

table_1 (`left`)

schema ({timestamp(TimeUnit::MICRO, "UTC"), large_utf8(), large_utf8()})
- col_1: All values are 2023-09-14 21:43:18.678917
- col_2: 1, 2, 3, 4, 5 ... N, -1, -1, -1, -1, -1, ... -1
- col_3: All values are foo
N is the number of matching records between table 1 and table 2

table_2 (`right`)

schema ({timestamp(TimeUnit::MICRO, "UTC"), large_utf8(), large_utf8(), large_utf8()})
- col_1: All values are 2023-09-14 21:43:18.678917
- col_2: 1, 2, 3, 4, 5 ... M
- col_3: All values are foo
- col_4: All values are bar
M is the number of records for table 2

Example

The number of records for table 1: 5
N: 3
M: 4

table 1

col_1	col_2	col_3
2023-09-14 21:43:18.678917	1	foo
2023-09-14 21:43:18.678917	2	foo
2023-09-14 21:43:18.678917	3	foo
2023-09-14 21:43:18.678917	-1	foo
2023-09-14 21:43:18.678917	-1	foo

table 2

col_1	col_2	col_3	col_4
2023-09-14 21:43:18.678917	1	foo	bar
2023-09-14 21:43:18.678917	2	foo	bar
2023-09-14 21:43:18.678917	3	foo	bar
2023-09-14 21:43:18.678917	4	foo	bar

benibus · 2023-10-06T21:58:19Z

I haven't managed to find the exact issue yet (I'm not too familiar with this code in particular), but this section is fairly suspicious:

arrow/cpp/src/arrow/acero/hash_join_node.cc

Lines 1131 to 1146 in 3697bcd

    
           arrow::util::TempVectorHolder<uint32_t> hash_holder( 
        
               stack, arrow::util::MiniBatch::kMiniBatchLength); 
        
           uint32_t* hashes = hash_holder.mutable_data(); 
        
           for (int64_t i = 0; i < key_batch.length; 
        
                i += arrow::util::MiniBatch::kMiniBatchLength) { 
        
             int64_t length = 
        
                 std::min(static_cast<int64_t>(key_batch.length - i), 
        
                          static_cast<int64_t>(arrow::util::MiniBatch::kMiniBatchLength)); 
        
             std::vector<KeyColumnArray> temp_column_arrays; 
        
             RETURN_NOT_OK(Hashing32::HashBatch(key_batch, hashes, temp_column_arrays, 
        
                                                ctx_->cpu_info()->hardware_flags(), stack, i, 
        
                                                length)); 
        
             RETURN_NOT_OK(build_.builder_->PushNextBatch(thread_index, length, hashes)); 
        
           } 
        
           return Status::OK();

Note that this function is exclusive to the bloom filter path and the the value of kMiniBatchLength happens to be 1024...

llama90 · 2023-10-07T03:25:07Z

It seems correct.

I've noticed that when modifying the kMiniBatchLength value arbitrarily to be larger than the number of match records, the results come out fine.

I suspect there might be an issue with handling the remainder when it exceeds the mini-batch size, so I'm looking into that part.

Thank you for checking!

llama90 · 2023-10-08T17:28:07Z

It seems that the issue has been fixed.

I will clean up the code, write unit tests, and aim to submit a PR as soon as possible. Thanks to your review, I was able to reproduce the symptoms on a smaller scale for testing. I appreciate your review once again. @benibus

…tion

…types in slice function

…e conversion in uint32_t * int64_t

…e_utf8 and large_binary

…r the slice function with binary type

…tion

…types in slice function

…e conversion in uint32_t * int64_t

…e_utf8 and large_binary

…r the slice function with binary type

…tion

…types in slice function

…e conversion in uint32_t * int64_t

…e_utf8 and large_binary

…r the slice function with binary type

…and Binary Types in Hash Join (#38147) ### Rationale for this change We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly. To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. * Issue raised: #37729 ### What changes are included in this PR? * The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability. * Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. * During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values. ### Are these changes tested? Yes ### Are there any user-facing changes? Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug. * Closes: #38074 Authored-by: Hyunseok Seo <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

…tring and Binary Types in Hash Join (apache#38147) ### Rationale for this change We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly. To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. * Issue raised: apache#37729 ### What changes are included in this PR? * The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability. * Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. * During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values. ### Are these changes tested? Yes ### Are there any user-facing changes? Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug. * Closes: apache#38074 Authored-by: Hyunseok Seo <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

llama90 added the Type: bug label Oct 6, 2023

github-actions bot added the Component: C++ label Oct 6, 2023

llama90 added a commit to llama90/arrow that referenced this issue Oct 9, 2023

feat(cpp/computer): apache#38074 Support uint64_t types in slice func…

e546f17

…tion

llama90 added a commit to llama90/arrow that referenced this issue Oct 9, 2023

feat(cpp/computer): apache#38074 Add unitest for supporting uint64_t …

88a33d7

…types in slice function

llama90 mentioned this issue Oct 9, 2023

GH-38074: [C++] Fix Offset Size Calculation for Slicing Large String and Binary Types in Hash Join #38147

Merged

github-actions bot assigned llama90 Oct 9, 2023

llama90 added a commit to llama90/arrow that referenced this issue Oct 9, 2023

fix(cpp/compute): apache#38074 Fix to avoid error due to implicit typ…

6d04345

…e conversion in uint32_t * int64_t

llama90 added a commit to llama90/arrow that referenced this issue Oct 10, 2023

feat(cpp/acero): apache#38074 Add unit test for inner join about larg…

3894147

…e_utf8 and large_binary

llama90 added a commit to llama90/arrow that referenced this issue Oct 10, 2023

style(cpp/compute): apache#38074 Enhance readability in unit tests fo…

b2ecf0f

…r the slice function with binary type

ianmcook added this to the 14.0.0 milestone Oct 10, 2023

ianmcook added the Priority: Blocker Marks a blocker for the release label Oct 10, 2023

llama90 added a commit to llama90/arrow that referenced this issue Oct 11, 2023

feat(cpp/computer): apache#38074 Support uint64_t types in slice func…

a813865

…tion

llama90 added a commit to llama90/arrow that referenced this issue Oct 11, 2023

feat(cpp/computer): apache#38074 Add unitest for supporting uint64_t …

c46f428

…types in slice function

llama90 added a commit to llama90/arrow that referenced this issue Oct 11, 2023

fix(cpp/compute): apache#38074 Fix to avoid error due to implicit typ…

e996dd4

…e conversion in uint32_t * int64_t

llama90 added a commit to llama90/arrow that referenced this issue Oct 11, 2023

feat(cpp/acero): apache#38074 Add unit test for inner join about larg…

fd19c3f

…e_utf8 and large_binary

llama90 added a commit to llama90/arrow that referenced this issue Oct 11, 2023

style(cpp/compute): apache#38074 Enhance readability in unit tests fo…

343ff02

…r the slice function with binary type

llama90 added a commit to llama90/arrow that referenced this issue Oct 11, 2023

feat(cpp/computer): apache#38074 Support uint64_t types in slice func…

8788b85

…tion

llama90 added a commit to llama90/arrow that referenced this issue Oct 11, 2023

feat(cpp/computer): apache#38074 Add unitest for supporting uint64_t …

38d751e

…types in slice function

llama90 added a commit to llama90/arrow that referenced this issue Oct 11, 2023

fix(cpp/compute): apache#38074 Fix to avoid error due to implicit typ…

d0037b1

…e conversion in uint32_t * int64_t

llama90 added a commit to llama90/arrow that referenced this issue Oct 11, 2023

feat(cpp/acero): apache#38074 Add unit test for inner join about larg…

73b69a2

…e_utf8 and large_binary

llama90 added a commit to llama90/arrow that referenced this issue Oct 11, 2023

style(cpp/compute): apache#38074 Enhance readability in unit tests fo…

ee4ec93

…r the slice function with binary type

pitrou closed this as completed in #38147 Oct 16, 2023

pitrou modified the milestones: 14.0.0, 15.0.0 Oct 16, 2023

raulcd modified the milestones: 15.0.0, 14.0.0 Oct 16, 2023

kou changed the title ~~[C++] [Acero] Incorrect results in inner join~~ [C++][Acero] Incorrect results in inner join Oct 17, 2023

amoeba added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Nov 17, 2023

AlenkaF mentioned this issue Dec 18, 2023

Inner joins are incorrect #37729

Closed

zanmato1984 mentioned this issue Dec 21, 2023

[Python][C++?] Dataset "right anti" join gives incorrect result with large_string columns (11.0.0) #35354

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Acero] Incorrect results in inner join #38074

[C++][Acero] Incorrect results in inner join #38074

llama90 commented Oct 6, 2023

ianmcook commented Oct 6, 2023

llama90 commented Oct 6, 2023 •

edited

Loading

benibus commented Oct 6, 2023

llama90 commented Oct 7, 2023

llama90 commented Oct 8, 2023

[C++][Acero] Incorrect results in inner join #38074

[C++][Acero] Incorrect results in inner join #38074

Comments

llama90 commented Oct 6, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Overview

Reason

Component(s)

ianmcook commented Oct 6, 2023

llama90 commented Oct 6, 2023 • edited Loading

table_1 (left)

table_2 (right)

Example

benibus commented Oct 6, 2023

llama90 commented Oct 7, 2023

llama90 commented Oct 8, 2023

llama90 commented Oct 6, 2023 •

edited

Loading

table_1 (`left`)

table_2 (`right`)