-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-38074: [C++] Fix Offset Size Calculation for Slicing Large String and Binary Types in Hash Join #38147
Conversation
|
@llama90 I don't understand why you've changed your fix, while I was asking you to explain the underlying bug. |
@pitrou I thought you were pointing out a part in the code where a bug could occur due to implicit type conversion. So that's why I made the changes, and I didn't realize that there should be a discussion first when such reviews are given. I apologize for the confusion. Is it right to fundamentally ask why the code was changed? The initial issue raised was regarding incorrect return values of the Inner Join. Upon analyzing the code, it was found that during the execution of the
In the issue, a |
Ahah, ok, thanks for the explanation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some comments.
Also, can you add join tests with large_binary or large_utf8?
@pitrou You are right. I will incorporate your feedback and add the Commit as soon as possible. |
I added the unit test about inner join for |
@llama90 could you please run the linter? Instructions at https://arrow.apache.org/docs/developers/cpp/development.html#code-style-linting-and-ci |
Did I apply the lint correctly as you intended? |
Yes, the "Dev / Lint C++, Python, R, Docker, RAT" test is passing now |
@llama90 could you please merge/rebase this with the latest changes on the main branch? That should fix the remaining CI failure. |
FTR, I still need to take a look at the fix and see if we can make things more maintainable and more understandable in the future. |
4df23e9
to
dd5bce5
Compare
If possible, could you provide specific guidelines? |
I rebased the main branch code onto my working branch and encountered the following error.
|
…types in slice function
…e conversion in uint32_t * int64_t
…e_utf8 and large_binary
…r the slice function with binary type
dd5bce5
to
222de88
Compare
@westonpace I've refined the code, removing the dictionary type as it doesn't seem to be added in any test. Also, I truly appreciate all the reviews. As a beginner, I feel both overwhelmed and excited to handle an issue that requires a complex understanding. While I am aware of my limitations, I am committed to giving my best. I humbly ask for your generous advice and guidance. Thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution! This is an improvement over what was there before. I think, with this PR, that slicing KeyColumnArray
with large string works.
I'm not quite convinced yet that large strings work consistently in the hash join. I see you did add some testing of large string / hash join in the hash_join_node_test but these don't cover values greater than 2^32 (which is hard to do in any kind of performant test sadly). So maybe this is a sign that we support "large strings that could be stored as small strings"
If the values in buffers_[1]
of the key column array are ever cast to int32_t
in the hash join code (which I feel they most likely are) then this type of failure wouldn't show up until actual large strings start showing up.
However, this is an improvement, and we don't need to solve every problem all at once, so I don't see any real concern with proceeding with this PR if @pitrou is satisfied.
The hash join code is complex and quite different from the rest of the arrow code base. It unfortunately reinvents things that we have elsewhere. Don't worry about feeling overwhelmed here, I think many of us are. Do you have any long term goals for this feature? |
Yes, I agree the PR as is is a good improvement now. What I would suggest is to update the PR title and description to better explain the problem. Specifically, it is about slicing large string and large binary types, with the problem being the offset size not correctly computed, IIUC. ("uint64_t Types" in the title is really confusing as this PR has nothing to do with 64-bit integer columns) |
Also, big +1 to what @westonpace said above. You definitely didn't choose the easiest part of Arrow to contribute to :-) |
@pitrou Hello, I have revised and updated the PR title and content. @westonpace It seems like the issues you mentioned include the following items:
All are related to joins and seem to be interesting areas. I am also interested in the issues you've highlighted and would like to attempt improvements when I have some spare time. I feel proud to have made a meaningful contribution. @pitrou @westonpace @ianmcook Thank you again for your review, and I hope to engage with you more frequently with new contributions. |
…and Binary Types in Hash Join (#38147) ### Rationale for this change We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly. To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. * Issue raised: #37729 ### What changes are included in this PR? * The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability. * Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. * During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values. ### Are these changes tested? Yes ### Are there any user-facing changes? Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug. * Closes: #38074 Authored-by: Hyunseok Seo <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit fb26178. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…tring and Binary Types in Hash Join (apache#38147) ### Rationale for this change We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly. To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. * Issue raised: apache#37729 ### What changes are included in this PR? * The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability. * Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. * During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values. ### Are these changes tested? Yes ### Are there any user-facing changes? Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug. * Closes: apache#38074 Authored-by: Hyunseok Seo <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…tring and Binary Types in Hash Join (apache#38147) ### Rationale for this change We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly. To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. * Issue raised: apache#37729 ### What changes are included in this PR? * The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability. * Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. * During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values. ### Are these changes tested? Yes ### Are there any user-facing changes? Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug. * Closes: apache#38074 Authored-by: Hyunseok Seo <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…tring and Binary Types in Hash Join (apache#38147) ### Rationale for this change We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The `Slice` function was not calculating their sizes correctly. To fix this, I changed the `Slice` function to calculate the sizes correctly, based on the type of data for large string and binary. * Issue raised: apache#37729 ### What changes are included in this PR? * The `Slice` function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability. * Unit tests (`TEST(KeyColumnArray, SliceBinaryTest)`)for the Slice function have been added. * During random tests for Hash Join (`TEST(HashJoin, Random)`), modifications were made to allow the creation of Large String as key column values. ### Are these changes tested? Yes ### Are there any user-facing changes? Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug. * Closes: apache#38074 Authored-by: Hyunseok Seo <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Rationale for this change
We found that the wrong results in inner joins during hash join operations were caused by a problem with how large strings and binary types were handled. The
Slice
function was not calculating their sizes correctly.To fix this, I changed the
Slice
function to calculate the sizes correctly, based on the type of data for large string and binary.What changes are included in this PR?
Slice
function has been updated to correctly calculate the offset for Large String and Large Binary types, and assertion statements have been added to improve maintainability.TEST(KeyColumnArray, SliceBinaryTest)
)for the Slice function have been added.TEST(HashJoin, Random)
), modifications were made to allow the creation of Large String as key column values.Are these changes tested?
Yes
Are there any user-facing changes?
Acero might not have a large user base as it is an experimental feature, but I deemed the issue of incorrect join results as critical and have addressed the bug.