Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(signal: 11, SIGSEGV: invalid memory reference) reported during release verification #5145

Closed
alamb opened this issue Feb 1, 2023 · 11 comments
Labels
bug Something isn't working

Comments

@alamb
Copy link
Contributor

alamb commented Feb 1, 2023

Describe the bug
Reported by @waynexia on the dev mailing list https://lists.apache.org/thread/j1t3oly1p175p6xh3dlrojgldmx3h6pm


I get an error with the verify script, but it can pass when I run `cargo
test --all` manually. The error message is

error: test failed, to rerun pass `-p datafusion-proto --lib`

Caused by:
  process didn't exit successfully:
`/private/var/folders/xf/hbx7j1g15r134f0pl0wl47j80000gn/T/arrow-17.0.0.XXXXX.Nm9elczf/apache-arrow-datafusion-17.0.0/target/debug/deps/datafusion_proto-c4f7c76569fc195a`
(signal: 11, SIGSEGV: invalid memory reference)

Expected behavior
Script should pass

Additional context
Add any other context about the problem here.

@alamb alamb added the bug Something isn't working label Feb 1, 2023
@waynexia
Copy link
Member

waynexia commented Feb 1, 2023

Currently I only reproduced it in the temp env (via the script). I'll try to find other ways to reproduce it to provide more infos.

@ozankabak
Copy link
Contributor

I see this quite frequently when I run unit tests.

@waynexia
Copy link
Member

I cannot get the tarball via verify script now, it reports a "not found" error. I've checked that page and it only contains arrow things. Maybe I miss something?

But the normal test method (cargo test) on my env still passes.

@alamb
Copy link
Contributor Author

alamb commented Feb 12, 2023

I cannot get the tarball via verify script now, it reports a "not found" error. I've checked that page and it only contains arrow things. Maybe I miss something?

Once the RC is approved, it gets moved in SVN to
https://dist.apache.org/repos/dist/release/arrow/arrow-datafusion-17.0.0/

I can go dig up old versions if anyone needs them from the SVN repo

@waynexia
Copy link
Member

waynexia commented Feb 25, 2023

I reproduced it when verifying release 19.0.0 RC1. I suspect this SIGSEGV is caused by stack overflow when running test roundtrip_deeply_nested which leads to a very deep recursion:
https://github.com/apache/arrow-datafusion/blob/eda875bfb579cc75698c96e4207d7813cf1a8a37/datafusion/proto/src/bytes/mod.rs#L415-L420

It has 784 stack frames when failing:

    frame #771: 0x00000001002829fc datafusion_proto-89e1dd0bdfe0a085`datafusion_proto::bytes::test::roundtrip_deeply_nested::_$u7b$$u7b$closure$u7d$$u7d$::h1a1fc66429fca31d((null)={closure_env#0} @ 0x0000000172f3acaf) at mod.rs:461:35
    frame #772: 0x00000001000230c4 datafusion_proto-89e1dd0bdfe0a085`std::sys_common::backtrace::__rust_begin_short_backtrace::hd4a7b09c2a677a33(f={closure_env#0} @ 0x0000000172f3acfe) at backtrace.rs:121:18
    frame #773: 0x0000000100270a18 datafusion_proto-89e1dd0bdfe0a085`std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h419ecd7fa9846e99 at mod.rs:550:17
    frame #774: 0x00000001002758a0 datafusion_proto-89e1dd0bdfe0a085`_$LT$core..panic..unwind_safe..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::hbbd5ca162943921f(self=AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<datafusion_proto::bytes::test::roundtrip_deeply_nested::{closure_env#0}, ()>> @ 0x0000000172f3ad4f, _args=<unavailable>) at unwind_safe.rs:271:9
    frame #775: 0x000000010022cfd0 datafusion_proto-89e1dd0bdfe0a085`std::panicking::try::do_call::h679e56c379f6cd05(data="") at panicking.rs:483:40
    frame #776: 0x000000010022d218 datafusion_proto-89e1dd0bdfe0a085`__rust_try + 32
    frame #777: 0x000000010022cec8 datafusion_proto-89e1dd0bdfe0a085`std::panicking::try::h4a06b00fd06a7006(f=AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<datafusion_proto::bytes::test::roundtrip_deeply_nested::{closure_env#0}, ()>> @ 0x0000000172f3adfd) at panicking.rs:447:19
    frame #778: 0x0000000100027658 datafusion_proto-89e1dd0bdfe0a085`std::panic::catch_unwind::h9b5885fe3f384ccd(f=AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<datafusion_proto::bytes::test::roundtrip_deeply_nested::{closure_env#0}, ()>> @ 0x0000000172f3ae3f) at panic.rs:137:14
    frame #779: 0x0000000100270278 datafusion_proto-89e1dd0bdfe0a085`std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::h1dba44a24758c3c6 at mod.rs:549:30
    frame #780: 0x00000001001ebed8 datafusion_proto-89e1dd0bdfe0a085`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::hc5cb988a34a7e66a((null)=0x000060000020c040, (null)=<unavailable>) at function.rs:507:5
    frame #781: 0x0000000102f06504 datafusion_proto-89e1dd0bdfe0a085`std::sys::unix::thread::Thread::new::thread_start::h92ee0ad602ca1aab [inlined] _$LT$alloc..boxed..Box$LT$F$C$A$GT$$u20$as$u20$core..ops..function..FnOnce$LT$Args$GT$$GT$::call_once::hf101e7e0479e2c15 at boxed.rs:2000:9 [opt]
    frame #782: 0x0000000102f064f8 datafusion_proto-89e1dd0bdfe0a085`std::sys::unix::thread::Thread::new::thread_start::h92ee0ad602ca1aab [inlined] _$LT$alloc..boxed..Box$LT$F$C$A$GT$$u20$as$u20$core..ops..function..FnOnce$LT$Args$GT$$GT$::call_once::hcc97a7a0d1eb566a at boxed.rs:2000:9 [opt]
    frame #783: 0x0000000102f064f4 datafusion_proto-89e1dd0bdfe0a085`std::sys::unix::thread::Thread::new::thread_start::h92ee0ad602ca1aab at thread.rs:108:17 [opt]
    frame #784: 0x00000001860b626c libsystem_pthread.dylib`_pthread_start + 148

So this is not a bug in DataFusion? But I wonder why others can pass.

@ozankabak
Copy link
Contributor

Is it possible that 10 MB is not enough for that round trip test?

@alamb
Copy link
Contributor Author

alamb commented Feb 26, 2023

So this is not a bug in DataFusion? But I wonder why others can pass.

It is also very strange this results in a SIGSEGV for you (but normally I would expect a rust panic when the stack is exhausted)

I wonder if you have some environment set (e.g. RUSTC_FLAGS or similar) that might be changing the behavior 🤔

@waynexia
Copy link
Member

It is also very strange this results in a SIGSEGV for you (but normally I would expect a rust panic when the stack is exhausted)

I'm unsure about the behavior, but the last time I ran into stack overflow (in GreptimeTeam/greptimedb#734) also got a SIGSEGV (but the way they run out of stack is different).

I wonder if you have some environment set (e.g. RUSTC_FLAGS or similar) that might be changing the behavior 🤔

I don't set anything into RUSTC_FLAGS. Maybe this is controlled by something else? And I don't remember I've adjusted something special. But this is (only) reproducible with verify script, IIRC it would create a temporary dry rust env for verification, right?

Is it possible that 10 MB is not enough for that round trip test?

From the result I guess yes. But when I run the case in my develop environment everything goes well, and lldb won't report SIGSEGV. I'm a bit confusing 🧐

@alamb
Copy link
Contributor Author

alamb commented Feb 28, 2023

IIRC it would create a temporary dry rust env for verification, right?

Yes I believe it does

@alamb
Copy link
Contributor Author

alamb commented Jan 11, 2024

Have we seen this issue recently 🤔

@waynexia
Copy link
Member

No. I haven't seen it for the recent month. I think we can close this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants