-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Segmentation fault on arrow-compute-hash-join-node-test on macos nightlies #32570
Comments
Raúl Cumplido / @raulcd: 36/90 Test #35: arrow-compute-hash-join-node-test .........***Failed 3.62 sec
Running arrow-compute-hash-join-node-test, redirecting output into /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/arrow-HEAD.XXXXX.6JD4aZtZ/cpp-build/build/test-logs/arrow-compute-hash-join-node-test.txt (attempt 1/1)
/Users/runner/work/crossbow/crossbow/arrow/cpp/build-support/run-test.sh: line 88: 88684 Segmentation fault: 11 $TEST_EXECUTABLE "$@" > $LOGFILE.raw 2>&1
Running main() from /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/arrow-HEAD.XXXXX.6JD4aZtZ/cpp-build/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest_main.cc
[==========] Running 29 tests from 4 test suites.
[----------] Global test environment set-up.
[----------] 10 tests from HashJoin
[ RUN ] HashJoin.Suffix
[ OK ] HashJoin.Suffix (5 ms)
[ RUN ] HashJoin.Random
/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/arrow-HEAD.XXXXX.6JD4aZtZ/cpp-build/src/arrow/compute/exec |
Vibhatha Lakmal Abeykoon / @vibhatha:
It looks very much the same. But only reproduced this once (for 37 CI runs) |
Weston Pace / @westonpace: Once it merges in, we should see if this failure continues to occur. Either way, we have likely instances of this failure going back as far as I can go. E.g. we started tracking nightly failures in Zulip in May and I still see this test failure sporadically though I cannot confirm the cause because Github no longer has the logs. My suspicion is that, whatever this bug is, we have probably already released several releases with it, and it should not be a blocker for 10.0.0. |
Alessandro Molina / @amol-: |
I've seen this on the last nightly builds but this seems to be a flaky that's been happening for a long time:
|
This is still happening https://github.com/ursacomputing/crossbow/actions/runs/4667721380/jobs/8263874619 |
Captured a stack trace. Leaving it here for future reference for myself:
|
I kinda happen to be looking at this issue. I tried several ways to reproduce it and found that with all allocators (jemalloc and mimalloc) disabled, ASAN reported the issue almost all the time, with exact the same call stack (I guess allocators might prevent ASAN from detecting this issue). Also I confirmed that this issue still exists on main branch. Now I'm able to take a deeper look and hopefully will come up with the root cause and probably a fix soon. cc @westonpace |
I actually found two bugs. Bug 1 is the straight cause of this issue. Bug 2 will cause later failure when bug 1 is fixed. Bug 1In arrow/cpp/src/arrow/compute/light_array.cc Lines 599 to 611 in 4e58f7c
so we need to calculate the number of tail rows to skip (the skipped rows, or bytes, will be copied in bytes arrow/cpp/src/arrow/compute/light_array.cc Lines 612 to 619 in 4e58f7c
). But in the calculation we neglect the case that there might be multiple occurrences of the same row, which results in accessing the tail row that should be skipped. The access is by word and possibly unaligned, so if the tail row doesn't span full words, the unaligned word access will eventually exceed the buffer boundary (segmentation fault). Note that to let this bug happen, the tail row need to be close to the 64-byte aligned memory region (minimal of arrow buffer), otherwise the access will be still valid in the perspective of OS, even if it exceeds the "logically" boundary. I've simplified this bug into a unit test case in https://github.com/apache/arrow/pull/39234/files#diff-f33777ab347090e9e661ef1d9260ed1488902627f69a703097d73c6d186bca69. Another thing I want to mention is that, I only fixed the non-fixed-length case, because I think all fixed-length cases will be fine (disclaimer: I could be wrong) based on the following facts:
Bug 2After fixing bug 1, the PR #39234 filed to fix them both. |
Wow, thanks for picking this up and documenting you results so thoroughly! |
…nsecutive tail rows with the same id may exceed buffer boundary (#39234) ### Rationale for this change Addressed in #32570 (comment) ### What changes are included in this PR? 1. Skip consecutive rows with the same id when calculating rows to skip when appending to `ExecBatchBuilder`. 2. Fix the bug that column offset is neglected when calculating rows to skip. ### Are these changes tested? Yes. New UT included and the change is also protected by the existing case mentioned in the issue. ### Are there any user-facing changes? No. **This PR contains a "Critical Fix".** Because #32570 is labeled critical, and causes a crash even when the API contract is upheld. * Closes: #32570 Authored-by: zanmato <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Update: these analysis of fixed size types are wrong. See #39583 and the subsequent fix #39585 for details. |
…ecutive tail rows with the same id may exceed buffer boundary (for fixed size types) (#39585) ### Rationale for this change #39583 is a subsequent issue of #32570 (fixed by #39234). The last issue and fixed only resolved var length types. It turns out fixed size types have the same issue. ### What changes are included in this PR? Do the same fix of #39234 for fixed size types. ### Are these changes tested? UT included. ### Are there any user-facing changes? * Closes: #39583 Authored-by: zanmato1984 <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…g consecutive tail rows with the same id may exceed buffer boundary (for fixed size types) (apache#39585) ### Rationale for this change apache#39583 is a subsequent issue of apache#32570 (fixed by apache#39234). The last issue and fixed only resolved var length types. It turns out fixed size types have the same issue. ### What changes are included in this PR? Do the same fix of apache#39234 for fixed size types. ### Are these changes tested? UT included. ### Are there any user-facing changes? * Closes: apache#39583 Authored-by: zanmato1984 <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…ing consecutive tail rows with the same id may exceed buffer boundary (apache#39234) ### Rationale for this change Addressed in apache#32570 (comment) ### What changes are included in this PR? 1. Skip consecutive rows with the same id when calculating rows to skip when appending to `ExecBatchBuilder`. 2. Fix the bug that column offset is neglected when calculating rows to skip. ### Are these changes tested? Yes. New UT included and the change is also protected by the existing case mentioned in the issue. ### Are there any user-facing changes? No. **This PR contains a "Critical Fix".** Because apache#32570 is labeled critical, and causes a crash even when the API contract is upheld. * Closes: apache#32570 Authored-by: zanmato <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…g consecutive tail rows with the same id may exceed buffer boundary (for fixed size types) (apache#39585) ### Rationale for this change apache#39583 is a subsequent issue of apache#32570 (fixed by apache#39234). The last issue and fixed only resolved var length types. It turns out fixed size types have the same issue. ### What changes are included in this PR? Do the same fix of apache#39234 for fixed size types. ### Are these changes tested? UT included. ### Are there any user-facing changes? * Closes: apache#39583 Authored-by: zanmato1984 <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
@zanmato1984 I'd like to report that while I'm seeing crashes with stacks very similar to the ones in the reference post above, the patches you provided do not fix them. Probably some cases remain that have not been handled. |
Would you mind to provide the stack you are seeing? That will be helpful. Thanks. |
Sure, here it is. The line numbers may be slightly off because I had to manually apply your patch to a then-fresh version of the repository (a couple of weeks before the Arrow 15 release date, when your second PR was still in review).
|
This stack seems quite different from the one in #32570 (comment). I think it is something else that my previous two PRs are not for. Perhaps you can file a new issue with the provided stack and let's move there for further discussion? Thanks. |
…ing consecutive tail rows with the same id may exceed buffer boundary (apache#39234) ### Rationale for this change Addressed in apache#32570 (comment) ### What changes are included in this PR? 1. Skip consecutive rows with the same id when calculating rows to skip when appending to `ExecBatchBuilder`. 2. Fix the bug that column offset is neglected when calculating rows to skip. ### Are these changes tested? Yes. New UT included and the change is also protected by the existing case mentioned in the issue. ### Are there any user-facing changes? No. **This PR contains a "Critical Fix".** Because apache#32570 is labeled critical, and causes a crash even when the API contract is upheld. * Closes: apache#32570 Authored-by: zanmato <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…g consecutive tail rows with the same id may exceed buffer boundary (for fixed size types) (apache#39585) ### Rationale for this change apache#39583 is a subsequent issue of apache#32570 (fixed by apache#39234). The last issue and fixed only resolved var length types. It turns out fixed size types have the same issue. ### What changes are included in this PR? Do the same fix of apache#39234 for fixed size types. ### Are these changes tested? UT included. ### Are there any user-facing changes? * Closes: apache#39583 Authored-by: zanmato1984 <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…ecutive tail rows with the same id may exceed buffer boundary (for fixed size types) (#39585) ### Rationale for this change #39583 is a subsequent issue of #32570 (fixed by #39234). The last issue and fixed only resolved var length types. It turns out fixed size types have the same issue. ### What changes are included in this PR? Do the same fix of #39234 for fixed size types. ### Are these changes tested? UT included. ### Are there any user-facing changes? * Closes: #39583 Authored-by: zanmato1984 <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…g consecutive tail rows with the same id may exceed buffer boundary (for fixed size types) (apache#39585) ### Rationale for this change apache#39583 is a subsequent issue of apache#32570 (fixed by apache#39234). The last issue and fixed only resolved var length types. It turns out fixed size types have the same issue. ### What changes are included in this PR? Do the same fix of apache#39234 for fixed size types. ### Are these changes tested? UT included. ### Are there any user-facing changes? * Closes: apache#39583 Authored-by: zanmato1984 <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…g consecutive tail rows with the same id may exceed buffer boundary (for fixed size types) (apache#39585) ### Rationale for this change apache#39583 is a subsequent issue of apache#32570 (fixed by apache#39234). The last issue and fixed only resolved var length types. It turns out fixed size types have the same issue. ### What changes are included in this PR? Do the same fix of apache#39234 for fixed size types. ### Are these changes tested? UT included. ### Are there any user-facing changes? * Closes: apache#39583 Authored-by: zanmato1984 <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Some of our nightly builds are failing due to a segmentation fault on hash-join tests:
The failures can be seen. It seems to be only related to macos from the failed jobs:
verify-rc-source-cpp-macos-conda-amd64
verify-rc-source-integration-macos-conda-amd64
verify-rc-source-python-macos-amd64
Reporter: Raúl Cumplido / @raulcd
Assignee: Vibhatha Lakmal Abeykoon / @vibhatha
PRs and other links:
Note: This issue was originally created as ARROW-17292. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: