Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent segfault on program exit #3441

Closed
2 of 3 tasks
eloff opened this issue Dec 4, 2022 · 20 comments
Closed
2 of 3 tasks

Intermittent segfault on program exit #3441

eloff opened this issue Dec 4, 2022 · 20 comments
Labels

Comments

@eloff
Copy link

eloff commented Dec 4, 2022

I'm getting an intermittent seg fault on program exit in something to do with a Postgres db connection in some scheduled-thread-pool thread.

Setup

Versions

  • Rust:
    rustc 1.65.0 (897e37553 2022-11-02)
  • Diesel:
    2.0.2
  • Database:
    (PostgreSQL) 15.1 (Ubuntu 15.1-1.pgdg20.04+1)
  • Operating System
    Ubuntu 20.04 5.15.0-56-generic

Feature Flags

  • diesel:
    diesel = { version="2.0.2", features=["postgres","r2d2","uuid","chrono"] }

Problem Description

Seg fault in scheduled-thread-pool in some connection run() method. I'm working on creating a less intermittent reproduction that I can share.

What are you trying to accomplish?

What is the expected output?

What is the actual output?

Are you seeing any additional errors?

Steps to reproduce

git clone https://github.com/eloff/diesel-segfault
cargo run
repeat until segfault

Checklist

  • This issue can be reproduced on Rust's stable channel. (Your issue will be
    closed if this is not the case)
  • This issue can be reproduced without requiring a third party crate
@eloff eloff added the bug label Dec 4, 2022
@weiznich
Copy link
Member

weiznich commented Dec 4, 2022

Please explain why you believe that this is a diesel issue and not a issue in some of the underlying dependencies. Especially as the segfault is pointing to a different crate (scheduled-thread-pool which is a r2d2 dependency). In addition you seem to link libpq statically, which is something that's not officially supported by newer postgres versions.

@eloff
Copy link
Author

eloff commented Dec 4, 2022

It could be an r2d2 issue. It's diesel::r2d2, so I reported it here. I'll open a ticket on the r2d2 repo.

I dropped the static linking, it still happens.

@weiznich
Copy link
Member

weiznich commented Dec 4, 2022

Can you provide a self contained docker image where the issue happens?

@eloff
Copy link
Author

eloff commented Dec 4, 2022

I added a Dockerfile and docker-compose, but now I can't reproduce the segfault. That means somehow the issue is related to my local environment. How does diesel choose what libpq to link with? Maybe it's using a different version on my system versus in the Dockerfile.

I tried installing libpq-dev the same way on my local machine the docker container, but it didn't change anything.

@eloff
Copy link
Author

eloff commented Dec 4, 2022

sfackler from r2d2 responded here: sfackler/r2d2#137

There's no unsafe code in r2d2, so this is definitely a diesel issue.

Note that the crash doesn't happen if I comment out either of the min_idle or max_size lines:

Pool::builder()
        .min_idle(Some(10))
        .max_size(96)
        .build(manager)
        .expect("Failed to create pool.")

@weiznich
Copy link
Member

weiznich commented Dec 4, 2022

I've tried to reproduce that locally with the provided code and I cannot reproduce this issue after a few hundredth runs. Seems like this is really dependent on your environment. Again it would be really helpful to reproduce it in a docker container.
Can you try to provide a gdb stacktrace of such a crash?

Also as you seem to be sure that this is a diesel issue: Please explain why you believe this is the case, otherwise this can also be caused by libpq.

How does diesel choose what libpq to link with? Maybe it's using a different version on my system versus in the Dockerfile.

If you link libpq dynamically you can just use ldd your/binary to see which libpq is linked.

@eloff
Copy link
Author

eloff commented Dec 4, 2022

I would wager on it being an issue in one of the C dependencies of diesel. It's just a broader surface area for crashes.

https://raw.githubusercontent.com/eloff/diesel-segfault/main/coredump.png
https://raw.githubusercontent.com/eloff/diesel-segfault/main/thread-1.png (thread that crashed)
https://raw.githubusercontent.com/eloff/diesel-segfault/main/thread-2.png
https://raw.githubusercontent.com/eloff/diesel-segfault/main/thread-3.png

ldd target/debug/acme

        linux-vdso.so.1 (0x00007ffc07588000)
        libpq.so.5 => /lib/x86_64-linux-gnu/libpq.so.5 (0x00007fe678203000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fe6781e8000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fe6781c5000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fe678076000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fe678070000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fe677e7e000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fe6783aa000)
        libssl.so.1.1 => /usr/local/lib/libssl.so.1.1 (0x00007fe677cf2000)
        libcrypto.so.1.1 => /usr/local/lib/libcrypto.so.1.1 (0x00007fe677964000)
        libgssapi_krb5.so.2 => /lib/x86_64-linux-gnu/libgssapi_krb5.so.2 (0x00007fe677917000)
        libldap_r-2.4.so.2 => /lib/x86_64-linux-gnu/libldap_r-2.4.so.2 (0x00007fe6778c1000)
        libkrb5.so.3 => /lib/x86_64-linux-gnu/libkrb5.so.3 (0x00007fe6777e4000)
        libk5crypto.so.3 => /lib/x86_64-linux-gnu/libk5crypto.so.3 (0x00007fe6777b1000)
        libcom_err.so.2 => /lib/x86_64-linux-gnu/libcom_err.so.2 (0x00007fe6777aa000)
        libkrb5support.so.0 => /lib/x86_64-linux-gnu/libkrb5support.so.0 (0x00007fe67779b000)
        liblber-2.4.so.2 => /lib/x86_64-linux-gnu/liblber-2.4.so.2 (0x00007fe67778a000)
        libresolv.so.2 => /lib/x86_64-linux-gnu/libresolv.so.2 (0x00007fe67776e000)
        libsasl2.so.2 => /lib/x86_64-linux-gnu/libsasl2.so.2 (0x00007fe677751000)
        libgssapi.so.3 => /lib/x86_64-linux-gnu/libgssapi.so.3 (0x00007fe67770a000)
        libgnutls.so.30 => /lib/x86_64-linux-gnu/libgnutls.so.30 (0x00007fe677534000)
        libkeyutils.so.1 => /lib/x86_64-linux-gnu/libkeyutils.so.1 (0x00007fe67752d000)
        libheimntlm.so.0 => /lib/x86_64-linux-gnu/libheimntlm.so.0 (0x00007fe677521000)
        libkrb5.so.26 => /lib/x86_64-linux-gnu/libkrb5.so.26 (0x00007fe67748e000)
        libasn1.so.8 => /lib/x86_64-linux-gnu/libasn1.so.8 (0x00007fe6773e5000)
        libhcrypto.so.4 => /lib/x86_64-linux-gnu/libhcrypto.so.4 (0x00007fe6773ad000)
        libroken.so.18 => /lib/x86_64-linux-gnu/libroken.so.18 (0x00007fe677394000)
        libp11-kit.so.0 => /lib/x86_64-linux-gnu/libp11-kit.so.0 (0x00007fe67725e000)
        libidn2.so.0 => /lib/x86_64-linux-gnu/libidn2.so.0 (0x00007fe67723d000)
        libunistring.so.2 => /lib/x86_64-linux-gnu/libunistring.so.2 (0x00007fe6770bb000)
        libtasn1.so.6 => /lib/x86_64-linux-gnu/libtasn1.so.6 (0x00007fe6770a3000)
        libnettle.so.7 => /lib/x86_64-linux-gnu/libnettle.so.7 (0x00007fe677069000)
        libhogweed.so.5 => /lib/x86_64-linux-gnu/libhogweed.so.5 (0x00007fe677032000)
        libgmp.so.10 => /lib/x86_64-linux-gnu/libgmp.so.10 (0x00007fe676fae000)
        libwind.so.0 => /lib/x86_64-linux-gnu/libwind.so.0 (0x00007fe676f84000)
        libheimbase.so.1 => /lib/x86_64-linux-gnu/libheimbase.so.1 (0x00007fe676f72000)
        libhx509.so.5 => /lib/x86_64-linux-gnu/libhx509.so.5 (0x00007fe676f22000)
        libsqlite3.so.0 => /lib/x86_64-linux-gnu/libsqlite3.so.0 (0x00007fe676df9000)
        libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x00007fe676dbe000)
        libffi.so.7 => /lib/x86_64-linux-gnu/libffi.so.7 (0x00007fe676db2000)

@eloff
Copy link
Author

eloff commented Dec 4, 2022

It looks like an openssl issue:

* thread #2, name = 'r2d2-worker-0', stop reason = signal SIGSEGV: invalid address (fault address: 0x18)
  * frame #0: 0x00007ffff7f21d46 libpthread.so.0`__GI___pthread_rwlock_wrlock at pthread_rwlock_common.c:604:7
    frame #1: 0x00007ffff7f21d46 libpthread.so.0`__GI___pthread_rwlock_wrlock(rwlock=0x0000000000000000) at pthread_rwlock_wrlock.c:27
    frame #2: 0x00007ffff7867889 libcrypto.so.1.1`CRYPTO_THREAD_write_lock(lock=<unavailable>) at threads_pthread.c:78
    frame #3: 0x00007ffff7839532 libcrypto.so.1.1`RAND_get_rand_method at rand_lib.c:849
    frame #4: 0x00007ffff78399f9 libcrypto.so.1.1`RAND_status at rand_lib.c:958
    frame #5: 0x00007ffff7f7f545 libpq.so.5`___lldb_unnamed_symbol294$$libpq.so.5 + 37
    frame #6: 0x00007ffff7f5ef2e libpq.so.5`___lldb_unnamed_symbol8$$libpq.so.5 + 142
    frame #7: 0x00007ffff7f76be1 libpq.so.5`___lldb_unnamed_symbol139$$libpq.so.5 + 1873
    frame #8: 0x00007ffff7f6362a libpq.so.5`PQconnectPoll + 3578
    frame #9: 0x00007ffff7f64797 libpq.so.5`___lldb_unnamed_symbol35$$libpq.so.5 + 279
    frame #10: 0x00007ffff7f679e8 libpq.so.5`PQconnectdb + 56
    frame #11: 0x00005555555c78ff acme`diesel::pg::connection::raw::RawConnection::establish::hf66f33a54d350dd8(database_url=(data_ptr = "postgres://acme:1021f155ba05dd9cbfa2a955@/acme?sslmode=disable", length = 62)) at raw.rs:23:39
    frame #12: 0x00005555555cfbf1 acme`_$LT$diesel..pg..connection..PgConnection$u20$as$u20$diesel..connection..Connection$GT$::establish::h55fc3bc2dc70ac2f(database_url=(data_ptr = "postgres://acme:1021f155ba05dd9cbfa2a955@/acme?sslmode=disable", length = 62)) at mod.rs:170:9
    frame #13: 0x000055555557a8aa acme`_$LT$diesel..r2d2..ConnectionManager$LT$T$GT$$u20$as$u20$r2d2..ManageConnection$GT$::connect::hb76e3f16e2530650(self=0x00005555556a0888) at r2d2.rs:192:9
    frame #14: 0x0000555555578e5c acme`r2d2::add_connection::inner::_$u7b$$u7b$closure$u7d$$u7d$::he22fd9af265b9649 at lib.rs:241:24
    frame #15: 0x000055555557a608 acme`scheduled_thread_pool::thunk::Thunk$LT$$LP$$RP$$C$R$GT$::new::_$u7b$$u7b$closure$u7d$$u7d$::hde78cba3c8d5d267((null)=<unavailable>) at thunk.rs:20:35
    frame #16: 0x000055555557a64d acme`_$LT$F$u20$as$u20$scheduled_thread_pool..thunk..Invoke$LT$A$C$R$GT$$GT$::invoke::h6651609b8afd4a0d(self=0x0000555555685680, arg=<unavailable>) at thunk.rs:50:9
    frame #17: 0x00005555555de540 acme`scheduled_thread_pool::thunk::Thunk$LT$A$C$R$GT$::invoke::h0d3f0d3691c28468(self=Thunk<(), ()> @ 0x00007ffff6af7ee0, arg=<unavailable>) at thunk.rs:35:9
    frame #18: 0x00005555555e1007 acme`scheduled_thread_pool::Worker::run_job::h204e5ba542e425b0(self=0x00007ffff6af8940, job=Job @ 0x00007ffff6af83b8) at lib.rs:364:33
    frame #19: 0x00005555555e0a82 acme`scheduled_thread_pool::Worker::run::_$u7b$$u7b$closure$u7d$$u7d$::h9f5de07f20041366 at lib.rs:326:61
    frame #20: 0x00005555555dd750 acme`_$LT$core..panic..unwind_safe..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::h922b39ce782e6da9(self=<unavailable>, _args=<unavailable>) at unwind_safe.rs:271:9
    frame #21: 0x00005555555e6efa acme`std::panicking::try::do_call::h353343075a44a7a2(data="\b\x88\xaf���) at panicking.rs:492:40
    frame #22: 0x00005555555e7b5b acme`__rust_try + 27
    frame #23: 0x00005555555e6ce6 acme`std::panicking::try::hbd1af4a4897eae11(f=<unavailable>) at panicking.rs:456:19
    frame #24: 0x00005555555dcba1 acme`std::panic::catch_unwind::h8487ef104e1ed0b7(f=<unavailable>) at panic.rs:137:14
    frame #25: 0x00005555555e0a0d acme`scheduled_thread_pool::Worker::run::h0b0fddce95686635(self=0x00007ffff6af8940) at lib.rs:326:21
    frame #26: 0x00005555555e08b0 acme`scheduled_thread_pool::Worker::start::_$u7b$$u7b$closure$u7d$$u7d$::h685180c0a62bc588 at lib.rs:320:30
    frame #27: 0x00005555555dcb6e acme`std::sys_common::backtrace::__rust_begin_short_backtrace::h148f92b33cf1c7fb(f={closure_env#0} @ 0x00007ffff6af8968) at backtrace.rs:122:18
    frame #28: 0x00005555555eb49b acme`std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h6c24d17a05224da0 at mod.rs:514:17
    frame #29: 0x00005555555dd71f acme`_$LT$core..panic..unwind_safe..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::h329ec1e2ffc93f6a(self=AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<scheduled_thread_pool::{impl#8}::start::{closure_env#0}, ()>> @ 0x00007ffff6af89b0, _args=<unavailable>) at unwind_safe.rs:271:9
    frame #30: 0x00005555555e6f61 acme`std::panicking::try::do_call::h7ad9760e4b97e4ac(data="�,jUUU") at panicking.rs:492:40
    frame #31: 0x00005555555e7b5b acme`__rust_try + 27
    frame #32: 0x00005555555e6d8f acme`std::panicking::try::hf0bb4c1fa863e31d(f=AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<scheduled_thread_pool::{impl#8}::start::{closure_env#0}, ()>> @ 0x00007ffff6af8a80) at panicking.rs:456:19
    frame #33: 0x00005555555dcbcf acme`std::panic::catch_unwind::h86a8b4bcde1746d1(f=AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<scheduled_thread_pool::{impl#8}::start::{closure_env#0}, ()>> @ 0x00007ffff6af8ad0) at panic.rs:137:14
    frame #34: 0x00005555555eb26d acme`std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::h2c1a5cd6dd347e00 at mod.rs:513:30
    frame #35: 0x00005555555e482f acme`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h662c6075d9edf8bc((null)=0x00005555556a0990, (null)=<unavailable>) at function.rs:248:5
    frame #36: 0x0000555555623c63 acme`std::sys::unix::thread::Thread::new::thread_start::h62ca48b42d48a8fc [inlined] _$LT$alloc..boxed..Box$LT$F$C$A$GT$$u20$as$u20$core..ops..function..FnOnce$LT$Args$GT$$GT$::call_once::h49f797984e2121bf at boxed.rs:1940:9
    frame #37: 0x0000555555623c5d acme`std::sys::unix::thread::Thread::new::thread_start::h62ca48b42d48a8fc [inlined] _$LT$alloc..boxed..Box$LT$F$C$A$GT$$u20$as$u20$core..ops..function..FnOnce$LT$Args$GT$$GT$::call_once::hfa4f3d0ee6440e0b at boxed.rs:1940
    frame #38: 0x0000555555623c56 acme`std::sys::unix::thread::Thread::new::thread_start::h62ca48b42d48a8fc at thread.rs:108
    frame #39: 0x00007ffff7f1c609 libpthread.so.0`start_thread(arg=<unavailable>) at pthread_create.c:477:8
    frame #40: 0x00007ffff7cec133 libc.so.6`__clone at clone.S:95

@eloff
Copy link
Author

eloff commented Dec 4, 2022

How stable is https://github.com/weiznich/diesel_async?

That could be one way to side-step the issue.

@weiznich
Copy link
Member

weiznich commented Dec 5, 2022

Yes the backtrace indicates that this is openssl related. Might be related to #813

Please double check that your libpq uses the same openssl version as other parts of your application.

How stable is https://github.com/weiznich/diesel_async?

I do consider that crate as experimental.

@eloff
Copy link
Author

eloff commented Dec 6, 2022

Bringing in the latest openssl crate (0.10.43) for rust as an explicit dependency, and calling openssl::init() at the top of main() to disable the openssl atexit handler fixes the crash. Maybe that's something diesel could do?

This recent commit to openssl rust here is the one that fixes it: https://github.com/sfackler/rust-openssl/pull/1649/files#r1040260747, I guess I have openssl 111b on my machine. I checked under the debugger that it is that line being executed that disables the atexit handler.

@weiznich
Copy link
Member

weiznich commented Dec 6, 2022

I'm happy to hear that you figured out a solution for this problem 👍 .

I don't think it would be a good idea to add an explicit call to openssl::init to diesel, because of:

  • libpq can be build without ssl support
  • libpq can link to a different openssl installation that that one used by the openssl crate.
  • The postgres documentation states that libpq internally initializes opennsl on it's own if we do not call one of the listed functions (which diesel does not call).

I would like to close this issue as environment specific issue.

@eloff
Copy link
Author

eloff commented Dec 7, 2022

Fair enough, I think if others have this problem, they'll find this github issue and can follow the same step to resolve it (add openssl as explicit dependency, call openssl::init() at the top of main()).

@weiznich
Copy link
Member

weiznich commented Dec 8, 2022

Closed as this is not considered to be an issue in diesel itself. I would consider that to be an libpq issue, so it might be worth to report this upstream?

That written: I would really like to see a pure rust postgres connection implementation as third party crate extending diesel. If someone is interested in working and maintaining such a crate, please reach out. I would estimate that this can be done in a 500-1000 lines of code by wrapping the existing rust-postgres crate and reusing some things that already exist in diesel.

@weiznich weiznich closed this as not planned Won't fix, can't repro, duplicate, stale Dec 8, 2022
@thomasmost
Copy link
Contributor

thomasmost commented Apr 28, 2023

Just wanted to post here and note that we've been experiencing this issue as an intermittent failure in our integration tests for months. Just today, we isolated it to Diesel... then discovered this issue. @eloff I will buy you a drink if you are ever passing through NYC.

As an additional data point, we use MySQL—we've never experienced it in our production Docker container (based on debian:bullseye-slim), but we reproduced in both GitHub's hosted ubuntu image and on local Windows and MacOS machines.

Thank you again 🙏🏼 Cheers!

@weiznich
Copy link
Member

weiznich commented May 2, 2023

@thomasmost If your case is mysql related that's likely not related to this issue. If you have a minimal reproducible example, please open a new issue containing that example so that we can track down the actual issue.

@thomasmost
Copy link
Contributor

The openssl::init fix seems to solve it though... Are you sure?

@weiznich
Copy link
Member

weiznich commented May 3, 2023

@thomasmost Yes it would be good to have. In the worst case it's only for documentation purposes, but even there it might be helpful.

@thomasmost
Copy link
Contributor

Okay I'll work on an MVR

@wojciech-graj
Copy link

wojciech-graj commented Nov 2, 2023

I had the exact same problem, but in my case initializing ssl before the failing code didn't fix it. If this is also the case for anyone else, check if you're statically linking openssl by setting the vendored cargo feature, and if you are, disable this. It seems like there can be a version incompatibility between libpq's openssl and the one embedded in your binary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants