Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-38074: [C++] Fix Offset Size Calculation for Slicing Large String and Binary Types in Hash Join #38147

Merged
merged 12 commits into from
Oct 16, 2023

Conversation

llama90
Copy link
Contributor

@llama90 llama90 commented Oct 9, 2023

Rationale for this change

We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The Slice function was not calculating their sizes correctly.

To fix this, I changed the Slice function to calculate the sizes correctly, based on the type of data for large string and binary.

What changes are included in this PR?

  • The Slice function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability.
  • Unit tests (TEST(KeyColumnArray, SliceBinaryTest))for the Slice function have been added.
  • During random tests for Hash Join (TEST(HashJoin, Random)), modifications were made to allow the creation of Large String as key column values.

Are these changes tested?

Yes

Are there any user-facing changes?

Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug.

@github-actions
Copy link

github-actions bot commented Oct 9, 2023

⚠️ GitHub issue #38074 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 9, 2023
@llama90 llama90 requested a review from pitrou October 9, 2023 15:22
@pitrou
Copy link
Member

pitrou commented Oct 9, 2023

@llama90 I don't understand why you've changed your fix, while I was asking you to explain the underlying bug.

@llama90
Copy link
Contributor Author

llama90 commented Oct 9, 2023

@pitrou I thought you were pointing out a part in the code where a bug could occur due to implicit type conversion.

So that's why I made the changes, and I didn't realize that there should be a discussion first when such reviews are given. I apologize for the confusion.

Is it right to fundamentally ask why the code was changed?

The initial issue raised was regarding incorrect return values of the Inner Join. Upon analyzing the code, it was found that during the execution of the BuildBloomFilter_exec_task function, incorrect offset calculations were made when calling the HashBatch function, leading to incorrect hash values being generated.

HashBatch is responsible for copying ColumnArrays within the Key Batch using offset and length, and it calls the Slice function during this process.

In the issue, a large_utf8 type key column was being used, and the original code was set to always calculate the offset for such binary types as uint32_t size, which resulted in incorrect Inner Join outcomes.

@pitrou
Copy link
Member

pitrou commented Oct 9, 2023

In the issue, a large_utf8 type key column was being used, and the original code was set to always calculate the offset for such binary types as uint32_t size, which resulted in incorrect Inner Join outcomes.

Ahah, ok, thanks for the explanation.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some comments.

Also, can you add join tests with large_binary or large_utf8?

cpp/src/arrow/compute/light_array_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/light_array.cc Show resolved Hide resolved
@llama90
Copy link
Contributor Author

llama90 commented Oct 9, 2023

@pitrou You are right. I will incorporate your feedback and add the Commit as soon as possible.

@llama90
Copy link
Contributor Author

llama90 commented Oct 10, 2023

Here are some comments.

Also, can you add join tests with large_binary or large_utf8?

I added the unit test about inner join for large_binary or large_utf8

@llama90 llama90 requested a review from pitrou October 10, 2023 16:33
@ianmcook
Copy link
Member

@llama90 could you please run the linter? Instructions at https://arrow.apache.org/docs/developers/cpp/development.html#code-style-linting-and-ci

@llama90
Copy link
Contributor Author

llama90 commented Oct 11, 2023

@llama90 could you please run the linter? Instructions at https://arrow.apache.org/docs/developers/cpp/development.html#code-style-linting-and-ci

Did I apply the lint correctly as you intended?

@ianmcook
Copy link
Member

@llama90 could you please run the linter? Instructions at https://arrow.apache.org/docs/developers/cpp/development.html#code-style-linting-and-ci

Did I apply the lint correctly as you intended?

Yes, the "Dev / Lint C++, Python, R, Docker, RAT" test is passing now

@ianmcook
Copy link
Member

@llama90 could you please merge/rebase this with the latest changes on the main branch? That should fix the remaining CI failure.

@pitrou
Copy link
Member

pitrou commented Oct 11, 2023

FTR, I still need to take a look at the fix and see if we can make things more maintainable and more understandable in the future.

@llama90
Copy link
Contributor Author

llama90 commented Oct 11, 2023

FTR, I still need to take a look at the fix and see if we can make things more maintainable and more understandable in the future.

If possible, could you provide specific guidelines?

@llama90
Copy link
Contributor Author

llama90 commented Oct 11, 2023

@llama90 could you please merge/rebase this with the latest changes on the main branch? That should fix the remaining CI failure.

I rebased the main branch code onto my working branch and encountered the following error.

✅ For now, I resolved the error by adding the -DARROW_FLIGHT=OFF -DARROW_FLIGHT_SQL=OFF options.

[873/1122] Building CXX object src/arrow/flight/sql/CMakeFiles/acero-flight-sql-server.dir/example/acero_server.cc.o
FAILED: src/arrow/flight/sql/CMakeFiles/acero-flight-sql-server.dir/example/acero_server.cc.o 
/opt/homebrew/bin/ccache /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -DARROW_EXTRA_ERROR_CONTEXT -DARROW_HAVE_NEON -DARROW_HDFS -DARROW_MIMALLOC -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 -DARROW_WITH_RE2 -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_UTF8PROC -DARROW_WITH_ZLIB -DARROW_WITH_ZSTD -DGFLAGS_IS_A_DLL=0 -DGRPC_ENABLE_ASYNC -DGRPC_NAMESPACE_FOR_TLS_CREDENTIALS_OPTIONS=grpc::experimental -DGRPC_USE_CERTIFICATE_VERIFIER -DGRPC_USE_TLS_CHANNEL_CREDENTIALS_OPTIONS -DURI_STATIC_BUILD -DUTF8PROC_STATIC -I/Users/lama/workspace/arrow-2/cpp/build-debug/src -I/Users/lama/workspace/arrow-2/cpp/src -I/Users/lama/workspace/arrow-2/cpp/src/generated -I/Users/lama/workspace/arrow-2/cpp/build-debug/substrait_ep-generated -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/grpc_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/absl_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/re2_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/cares_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/zlib_ep/src/zlib_ep-install/include -isystem /opt/homebrew/opt/openssl@3/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/protobuf_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/thirdparty/flatbuffers/include -isystem /Users/lama/workspace/arrow-2/cpp/thirdparty/hadoop/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/google_cloud_cpp_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/nlohmann_json_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/crc32c_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/boost_ep-prefix/src/boost_ep -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/brotli_ep/src/brotli_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/bzip2_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/lz4_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/snappy_ep/src/snappy_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/zstd_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/orc_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/awssdk_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/utf8proc_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/rapidjson_ep/src/rapidjson_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/xsimd_ep/src/xsimd_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/jemalloc_ep-prefix/src -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/mimalloc_ep/src/mimalloc_ep/include/mimalloc-2.0 -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/_deps/googletest-src/googletest/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/_deps/googletest-src/googletest -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/_deps/googletest-src/googlemock/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/_deps/googletest-src/googlemock -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/thrift_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/gflags_ep-prefix/src/gflags_ep/include -fno-aligned-new  -Qunused-arguments -fcolor-diagnostics  -Wall -Wextra -Wdocumentation -Wshorten-64-to-32 -Wno-missing-braces -Wno-unused-parameter -Wno-constant-logical-operand -Wno-return-stack-address -Wdate-time -Wno-unknown-warning-option -Wno-pass-failed -march=armv8-a  -g -Werror -O0 -ggdb  -std=c++17 -arch arm64 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.3.sdk -fPIE -fcolor-diagnostics -MD -MT src/arrow/flight/sql/CMakeFiles/acero-flight-sql-server.dir/example/acero_server.cc.o -MF src/arrow/flight/sql/CMakeFiles/acero-flight-sql-server.dir/example/acero_server.cc.o.d -o src/arrow/flight/sql/CMakeFiles/acero-flight-sql-server.dir/example/acero_server.cc.o -c /Users/lama/workspace/arrow-2/cpp/src/arrow/flight/sql/example/acero_server.cc
/Users/lama/workspace/arrow-2/cpp/src/arrow/flight/sql/example/acero_server.cc:169:86: error: missing field 'app_metadata' initializer [-Werror,-Wmissing-field-initializers]
        Ticket{std::move(ticket)}, /*locations=*/{}, /*expiration_time=*/std::nullopt}};
                                                                                     ^
1 error generated.
[877/1122] Building CXX object src/arrow/flight/sql/CMakeFiles/arrow-flight-sql-test.dir/example/acero_server.cc.o
FAILED: src/arrow/flight/sql/CMakeFiles/arrow-flight-sql-test.dir/example/acero_server.cc.o 
/opt/homebrew/bin/ccache /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -DARROW_EXTRA_ERROR_CONTEXT -DARROW_FLIGHT_SQL_STATIC -DARROW_FLIGHT_STATIC -DARROW_HAVE_NEON -DARROW_HDFS -DARROW_MIMALLOC -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 -DARROW_WITH_RE2 -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_UTF8PROC -DARROW_WITH_ZLIB -DARROW_WITH_ZSTD -DGRPC_ENABLE_ASYNC -DGRPC_NAMESPACE_FOR_TLS_CREDENTIALS_OPTIONS=grpc::experimental -DGRPC_USE_CERTIFICATE_VERIFIER -DGRPC_USE_TLS_CHANNEL_CREDENTIALS_OPTIONS -DURI_STATIC_BUILD -DUTF8PROC_STATIC -I/Users/lama/workspace/arrow-2/cpp/build-debug/src -I/Users/lama/workspace/arrow-2/cpp/src -I/Users/lama/workspace/arrow-2/cpp/src/generated -I/Users/lama/workspace/arrow-2/cpp/build-debug/substrait_ep-generated -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/src/arrow/flight/sql/.. -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/grpc_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/absl_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/re2_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/cares_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/zlib_ep/src/zlib_ep-install/include -isystem /opt/homebrew/opt/openssl@3/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/protobuf_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/thirdparty/flatbuffers/include -isystem /Users/lama/workspace/arrow-2/cpp/thirdparty/hadoop/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/google_cloud_cpp_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/nlohmann_json_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/crc32c_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/boost_ep-prefix/src/boost_ep -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/brotli_ep/src/brotli_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/bzip2_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/lz4_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/snappy_ep/src/snappy_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/zstd_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/orc_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/awssdk_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/utf8proc_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/rapidjson_ep/src/rapidjson_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/xsimd_ep/src/xsimd_ep-install/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/jemalloc_ep-prefix/src -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/mimalloc_ep/src/mimalloc_ep/include/mimalloc-2.0 -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/_deps/googletest-src/googletest/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/_deps/googletest-src/googletest -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/_deps/googletest-src/googlemock/include -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/_deps/googletest-src/googlemock -isystem /Users/lama/workspace/arrow-2/cpp/build-debug/thrift_ep-install/include -fno-aligned-new  -Qunused-arguments -fcolor-diagnostics  -Wall -Wextra -Wdocumentation -Wshorten-64-to-32 -Wno-missing-braces -Wno-unused-parameter -Wno-constant-logical-operand -Wno-return-stack-address -Wdate-time -Wno-unknown-warning-option -Wno-pass-failed -march=armv8-a  -g -Werror -O0 -ggdb  -std=c++17 -arch arm64 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.3.sdk -fPIE -fcolor-diagnostics -MD -MT src/arrow/flight/sql/CMakeFiles/arrow-flight-sql-test.dir/example/acero_server.cc.o -MF src/arrow/flight/sql/CMakeFiles/arrow-flight-sql-test.dir/example/acero_server.cc.o.d -o src/arrow/flight/sql/CMakeFiles/arrow-flight-sql-test.dir/example/acero_server.cc.o -c /Users/lama/workspace/arrow-2/cpp/src/arrow/flight/sql/example/acero_server.cc
/Users/lama/workspace/arrow-2/cpp/src/arrow/flight/sql/example/acero_server.cc:169:86: error: missing field 'app_metadata' initializer [-Werror,-Wmissing-field-initializers]
        Ticket{std::move(ticket)}, /*locations=*/{}, /*expiration_time=*/std::nullopt}};
                                                                                     ^
1 error generated.
[880/1122] Building CXX object src/arrow/flight/sql/CMakeFiles/arrow-flight-sql-test.dir/acero_test.cc.o
ninja: build stopped: subcommand failed.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Oct 13, 2023
@llama90
Copy link
Contributor Author

llama90 commented Oct 13, 2023

@westonpace I've refined the code, removing the dictionary type as it doesn't seem to be added in any test.

Also, I truly appreciate all the reviews.

As a beginner, I feel both overwhelmed and excited to handle an issue that requires a complex understanding. While I am aware of my limitations, I am committed to giving my best.

I humbly ask for your generous advice and guidance. Thank you.

@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Oct 13, 2023
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! This is an improvement over what was there before. I think, with this PR, that slicing KeyColumnArray with large string works.

I'm not quite convinced yet that large strings work consistently in the hash join. I see you did add some testing of large string / hash join in the hash_join_node_test but these don't cover values greater than 2^32 (which is hard to do in any kind of performant test sadly). So maybe this is a sign that we support "large strings that could be stored as small strings"

If the values in buffers_[1] of the key column array are ever cast to int32_t in the hash join code (which I feel they most likely are) then this type of failure wouldn't show up until actual large strings start showing up.

However, this is an improvement, and we don't need to solve every problem all at once, so I don't see any real concern with proceeding with this PR if @pitrou is satisfied.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Oct 14, 2023
@westonpace
Copy link
Member

As a beginner, I feel both overwhelmed and excited to handle an issue that requires a complex understanding. While I am aware of my limitations, I am committed to giving my best.

The hash join code is complex and quite different from the rest of the arrow code base. It unfortunately reinvents things that we have elsewhere. Don't worry about feeling overwhelmed here, I think many of us are. Do you have any long term goals for this feature?

@pitrou
Copy link
Member

pitrou commented Oct 14, 2023

Yes, I agree the PR as is is a good improvement now.

What I would suggest is to update the PR title and description to better explain the problem. Specifically, it is about slicing large string and large binary types, with the problem being the offset size not correctly computed, IIUC.

("uint64_t Types" in the title is really confusing as this PR has nothing to do with 64-bit integer columns)

@pitrou
Copy link
Member

pitrou commented Oct 14, 2023

Also, big +1 to what @westonpace said above. You definitely didn't choose the easiest part of Arrow to contribute to :-)

@llama90 llama90 changed the title GH-38074: [C++] Support uint64_t Types in Slice Function to Address Specific Inner Join Bug GH-38074: [C++] Fix Offset Size Calculation for Slicing Large String and Binary Types in Hash Join Oct 14, 2023
@llama90
Copy link
Contributor Author

llama90 commented Oct 14, 2023

@pitrou Hello, I have revised and updated the PR title and content.

@westonpace It seems like the issues you mentioned include the following items:

  • Support for Dictionary types in key columns during Hash Join
  • Handling of Dictionary types in Hash Join (Swiss)
  • Join support when some columns contain lists

All are related to joins and seem to be interesting areas. I am also interested in the issues you've highlighted and would like to attempt improvements when I have some spare time.

I feel proud to have made a meaningful contribution.

@pitrou @westonpace @ianmcook Thank you again for your review, and I hope to engage with you more frequently with new contributions.

@llama90 llama90 requested a review from pitrou October 15, 2023 09:19
@pitrou pitrou merged commit fb26178 into apache:main Oct 16, 2023
41 checks passed
@pitrou pitrou removed the awaiting merge Awaiting merge label Oct 16, 2023
@pitrou
Copy link
Member

pitrou commented Oct 16, 2023

Thanks a lot for this fix @llama90 !

@raulcd This should probably be a candidate for 14.0.0.

@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Oct 16, 2023
raulcd pushed a commit that referenced this pull request Oct 16, 2023
…and Binary Types in Hash Join (#38147)

### Rationale for this change

We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly.

To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. 

* Issue raised: #37729 

### What changes are included in this PR?

* The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability.
* Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. 
* During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values.

### Are these changes tested?

Yes

### Are there any user-facing changes?

Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug.

* Closes: #38074

Authored-by: Hyunseok Seo <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit fb26178.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them.

JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Oct 23, 2023
…tring and Binary Types in Hash Join (apache#38147)

### Rationale for this change

We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly.

To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. 

* Issue raised: apache#37729 

### What changes are included in this PR?

* The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability.
* Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. 
* During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values.

### Are these changes tested?

Yes

### Are there any user-facing changes?

Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug.

* Closes: apache#38074

Authored-by: Hyunseok Seo <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…tring and Binary Types in Hash Join (apache#38147)

### Rationale for this change

We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly.

To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. 

* Issue raised: apache#37729 

### What changes are included in this PR?

* The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability.
* Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. 
* During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values.

### Are these changes tested?

Yes

### Are there any user-facing changes?

Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug.

* Closes: apache#38074

Authored-by: Hyunseok Seo <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…tring and Binary Types in Hash Join (apache#38147)

### Rationale for this change

We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly.

To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. 

* Issue raised: apache#37729 

### What changes are included in this PR?

* The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability.
* Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. 
* During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values.

### Are these changes tested?

Yes

### Are there any user-facing changes?

Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug.

* Closes: apache#38074

Authored-by: Hyunseok Seo <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Acero] Incorrect results in inner join
4 participants