Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Segmentation fault in hash-join/swiss-join #39951

Closed
mpimenov opened this issue Feb 5, 2024 · 4 comments
Closed

[C++] Segmentation fault in hash-join/swiss-join #39951

mpimenov opened this issue Feb 5, 2024 · 4 comments

Comments

@mpimenov
Copy link

mpimenov commented Feb 5, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Continuing from #32570

Here is the crash I see in my code occasionally. Unfortunately, I do not have a small or even a reliable test case to reproduce.

@zanmato1984 fixed several crashes related to hash join before, I suspect this one may be another case from that family.

Visit<(lambda at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/swiss_join.cc:332:9)> at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/swiss_join.cc:133 [0x2e57685]
DecodeSelected at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/swiss_join.cc:330 [0x2e57685]
FlushBuildColumn at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/swiss_join.cc:1672 [0x2e5f9f6]
Flush at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/swiss_join.cc:1716 [0x2e60178]
AppendAndOutput<(lambda at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/swiss_join_internal.h:612:9), (lambda at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/swiss_join.cc:1993:9)> at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/swiss_join_internal.h:570 [0x2e60dbc]
Append<(lambda at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/swiss_join.cc:1993:9)> at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/swiss_join_internal.h:610 [0x2e60dbc]
OnNextBatch at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/swiss_join.cc:1993 [0x2e60dbc]
ProbeSingleBatch at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/swiss_join.cc:2144 [0x2e653e7]
OnProbeSideBatch at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/hash_join_node.cc:818 [0x2e0766d]
InputReceived at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/hash_join_node.cc:891 [0x2e06491]
OutputBatchCallback at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/hash_join_node.cc:1004 [0x2e0a3af]
operator() at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/hash_join_node.cc:947 [0x2e0a3af]
__invoke_impl<arrow::Status, (lambda at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/hash_join_node.cc:947:5) &, long, arrow::compute::ExecBatch> at /usr/bin/../lib/gcc/x86_64-linux-gnu/14.0.0/../../../../include/c++/14.0.0/bits/invoke.h:61 [0x2e0a1fb]
__invoke_r<arrow::Status, (lambda at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/hash_join_node.cc:947:5) &, long, arrow::compute::ExecBatch> at /usr/bin/../lib/gcc/x86_64-linux-gnu/14.0.0/../../../../include/c++/14.0.0/bits/invoke.h:114 [0x2e0a1fb]
_M_invoke at /usr/bin/../lib/gcc/x86_64-linux-gnu/14.0.0/../../../../include/c++/14.0.0/bits/std_function.h:290 [0x2e0a1fb]
operator() at /usr/bin/../lib/gcc/x86_64-linux-gnu/14.0.0/../../../../include/c++/14.0.0/bits/std_function.h:591 [0x2e60eb5]
operator() at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/swiss_join.cc:1993 [0x2e60eb5]
AppendAndOutput<(lambda at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/swiss_join_internal.h:612:9), (lambda at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/swiss_join.cc:1993:9)> at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/swiss_join_internal.h:571 [0x2e60eb5]
Append<(lambda at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/swiss_join.cc:1993:9)> at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/swiss_join_internal.h:610 [0x2e60eb5]
OnNextBatch at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/swiss_join.cc:1993 [0x2e60eb5]
ProbeSingleBatch at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/swiss_join.cc:2144 [0x2e653e7]
OnProbeSideBatch at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/hash_join_node.cc:818 [0x2e0766d]
InputReceived at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/hash_join_node.cc:891 [0x2e06491]
OutputBatchCallback at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/hash_join_node.cc:1004 [0x2e0a3af]
operator() at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/hash_join_node.cc:947 [0x2e0a3af]
__invoke_impl<arrow::Status, (lambda at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/hash_join_node.cc:947:5) &, long, arrow::compute::ExecBatch> at /usr/bin/../lib/gcc/x86_64-linux-gnu/14.0.0/../../../../include/c++/14.0.0/bits/invoke.h:61 [0x2e0a1fb]
__invoke_r<arrow::Status, (lambda at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/hash_join_node.cc:947:5) &, long, arrow::compute::ExecBatch> at /usr/bin/../lib/gcc/x86_64-linux-gnu/14.0.0/../../../../include/c++/14.0.0/bits/invoke.h:114 [0x2e0a1fb]
_M_invoke at /usr/bin/../lib/gcc/x86_64-linux-gnu/14.0.0/../../../../include/c++/14.0.0/bits/std_function.h:290 [0x2e0a1fb]
operator() at /usr/bin/../lib/gcc/x86_64-linux-gnu/14.0.0/../../../../include/c++/14.0.0/bits/std_function.h:591 [0x2e618ef]
operator() at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/swiss_join.cc:2039 [0x2e618ef]
Flush<(lambda at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/swiss_join.cc:2039:5)> at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/swiss_join_internal.h:626 [0x2e618ef]
OnFinished at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/swiss_join.cc:2039 [0x2e618ef]
OnScanHashTableFinished at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/swiss_join.cc:2425 [0x2e683ff]
StartScanHashTable at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/swiss_join.cc:2323 [0x2e68ff0]
ProbingFinished at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/swiss_join.cc:2153 [0x2e655f9]
OnQueuedBatchesProbed at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/hash_join_node.cc:876 [0x2e0a61b]
operator() at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/hash_join_node.cc:968 [0x2e0a61b]
__invoke_impl<arrow::Status, (lambda at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/hash_join_node.cc:967:9) &, unsigned long> at /usr/bin/../lib/gcc/x86_64-linux-gnu/14.0.0/../../../../include/c++/14.0.0/bits/invoke.h:61 [0x2e0a61b]
__invoke_r<arrow::Status, (lambda at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/hash_join_node.cc:967:9) &, unsigned long> at /usr/bin/../lib/gcc/x86_64-linux-gnu/14.0.0/../../../../include/c++/14.0.0/bits/invoke.h:114 [0x2e0a61b]
_M_invoke at /usr/bin/../lib/gcc/x86_64-linux-gnu/14.0.0/../../../../include/c++/14.0.0/bits/std_function.h:290 [0x2e0a61b]
operator() at /usr/bin/../lib/gcc/x86_64-linux-gnu/14.0.0/../../../../include/c++/14.0.0/bits/std_function.h:591 [0x2e695f5]
OnTaskGroupFinished at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/task_util.cc:252 [0x2e695f5]
operator() at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/task_util.cc:371 [0x2e6a313]
__invoke_impl<arrow::Status, (lambda at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/task_util.cc:371:5) &, unsigned long> at /usr/bin/../lib/gcc/x86_64-linux-gnu/14.0.0/../../../../include/c++/14.0.0/bits/invoke.h:61 [0x2e6a313]
__invoke_r<arrow::Status, (lambda at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/task_util.cc:371:5) &, unsigned long> at /usr/bin/../lib/gcc/x86_64-linux-gnu/14.0.0/../../../../include/c++/14.0.0/bits/invoke.h:114 [0x2e6a313]
_M_invoke at /usr/bin/../lib/gcc/x86_64-linux-gnu/14.0.0/../../../../include/c++/14.0.0/bits/std_function.h:290 [0x2e6a313]
operator() at /usr/bin/../lib/gcc/x86_64-linux-gnu/14.0.0/../../../../include/c++/14.0.0/bits/std_function.h:591 [0x2e3279f]
operator() at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/query_context.cc:82 [0x2e3279f]
__invoke_impl<arrow::Status, (lambda at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/query_context.cc:80:40) &> at /usr/bin/../lib/gcc/x86_64-linux-gnu/14.0.0/../../../../include/c++/14.0.0/bits/invoke.h:61 [0x2e3279f]
__invoke_r<arrow::Status, (lambda at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/acero/query_context.cc:80:40) &> at /usr/bin/../lib/gcc/x86_64-linux-gnu/14.0.0/../../../../include/c++/14.0.0/bits/invoke.h:114 [0x2e3279f]
_M_invoke at /usr/bin/../lib/gcc/x86_64-linux-gnu/14.0.0/../../../../include/c++/14.0.0/bits/std_function.h:290 [0x2e3279f]
operator() at /usr/bin/../lib/gcc/x86_64-linux-gnu/14.0.0/../../../../include/c++/14.0.0/bits/std_function.h:591 [0x2e3355e]
operator()<std::function<arrow::Status ()> &, arrow::Status, arrow::Future<arrow::internal::Empty> > at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/util/future.h:150 [0x2e3355e]
__invoke_impl<void, arrow::detail::ContinueFuture &, arrow::Future<arrow::internal::Empty> &, std::function<arrow::Status ()> &> at /usr/bin/../lib/gcc/x86_64-linux-gnu/14.0.0/../../../../include/c++/14.0.0/bits/invoke.h:61 [0x2e3355e]
operator() at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/util/functional.h:140 [0x3376417]
WorkerLoop at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/util/thread_pool.cc:457 [0x3376417]
operator() at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/util/thread_pool.cc:618 [0x3376417]
__invoke_impl<void, (lambda at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/util/thread_pool.cc:616:23)> at /usr/bin/../lib/gcc/x86_64-linux-gnu/14.0.0/../../../../include/c++/14.0.0/bits/invoke.h:61 [0x3376417]
__invoke<(lambda at /tmp/source-root/.conan2/p/b/arrow19d6b0dc5db3a/b/src/cpp/src/arrow/util/thread_pool.cc:616:23)> at /usr/bin/../lib/gcc/x86_64-linux-gnu/14.0.0/../../../../include/c++/14.0.0/bits/invoke.h:96 [0x3376417]
_M_invoke<0UL> at /usr/bin/../lib/gcc/x86_64-linux-gnu/14.0.0/../../../../include/c++/14.0.0/bits/std_thread.h:292 [0x3376417]
operator() at /usr/bin/../lib/gcc/x86_64-linux-gnu/14.0.0/../../../../include/c++/14.0.0/bits/std_thread.h:299 [0x3376417]
_M_run at /usr/bin/../lib/gcc/x86_64-linux-gnu/14.0.0/../../../../include/c++/14.0.0/bits/std_thread.h:244 [0x3376417]

Component(s)

C++

@zanmato1984
Copy link
Contributor

Thanks for reporting.

For me, it's hard to see the cause from this bare stack. It'll help to have a cpp test case, not necessarily small though, to reproduce the crash, even randomly.

Alternatively, if you are able to build your own arrow, it would also help to enable ASAN by adding cmake options -DARROW_USE_ASAN=ON -DARROW_JEMALLOC=OFF -DARROW_MIMALLOC=OFF. This will catch the issue at a higher rate and output extensive context of the crash.

@mpimenov
Copy link
Author

Enabling ASAN makes this crash go away.
Enabling TSAN results in some reports which I describe in #40068, #40069.
Building in debug mode results in this assertion firing.

Lowering arrow::dataset::ScanOptions::batch_size to 16 also fixes the crash (and lowering to 1024 does not).

@zanmato1984
Copy link
Contributor

Enabling ASAN makes this crash go away. Enabling TSAN results in some reports which I describe in #40068, #40069. Building in debug mode results in this assertion firing.

Lowering arrow::dataset::ScanOptions::batch_size to 16 also fixes the crash (and lowering to 1024 does not).

Thanks for the experiments. Though I can only guess what was happening, I think we are making progress.

First I think the errors reported by TSAN don't seem to be related to this crash. But the fired assertion does. It indicates that an arrow-managed stack-like temp buffer is overflowed and possibly causing subsequent unexpected behaviors. It also explains why lowering batch_size makes crash go away - less temp space is required for smaller batch data. Though I can't explain why ASAN makes the crash go away, except that it slows down the program significantly so the chance of crash is reduced.

To verify if the fired assertion is the root cause, could you try something similar to #40007 and see if it resolves the issue?

@mpimenov
Copy link
Author

I agree that TSAN crashes seem unrelated, I've reported them for completeness only. Then again, you never know with races.

The patch in #40007 seems to resolve my case, thank you! Looks like that PR will be merged and a new "investigate why this is even needed" issue will be open. If it's the case I suggest linking this issue there, and for now I'm closing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants