Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix event loop thread might exit unexpectedly #217

Merged
merged 1 commit into from
Mar 15, 2023

Conversation

BewareMyPower
Copy link
Contributor

Fixes #209

Motivation

When the event loop thread is started from ExecutorService::restart, there is a chance that io_service_.run(ec) returns immediately. In this case, the ClientConnection::resolver_ that was registered in the IO service could block forever in async_resolve method.

Modifications

In the event loop thread, if the io_service::run method is not stopped by ExecutorService::close, just call restart and run again to avoid the event loop thread exits unexpectedly.

Run the ConnectionFailTest independently with --gtest_repeat=20 to avoid it's still flaky.

Fixes apache#209

### Motivation

When the event loop thread is started from `ExecutorService::restart`,
there is a chance that `io_service_.run(ec)` returns immediately. In
this case, the `ClientConnection::resolver_` that was registered in the
IO service could block forever in `async_resolve` method.

### Modifications

In the event loop thread, if the `io_service::run` method is not stopped
by `ExecutorService::close`, just call `restart` and `run` again to
avoid the event loop thread exits unexpectedly.

Run the `ConnectionFailTest` independently with `--gtest_repeat=20` to
avoid it's still flaky.
@BewareMyPower BewareMyPower added bug Something isn't working flaky-test labels Mar 15, 2023
@BewareMyPower BewareMyPower added this to the 3.2.0 milestone Mar 15, 2023
@BewareMyPower BewareMyPower self-assigned this Mar 15, 2023
@BewareMyPower
Copy link
Contributor Author

The threads info before this patch when it's stuck:

(gdb) info threads
  Id   Target Id                                          Frame
* 1    Thread 0x7fb2859c50c0 (LWP 7394) "ConnectionFailT" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
    futex_word=0x7fb270038ba0) at ./nptl/futex-internal.c:57
  2    Thread 0x7fb2851b8640 (LWP 7651) "ConnectionFailT" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
    futex_word=0x55de8e4853c8) at ./nptl/futex-internal.c:57
(gdb) thread 1
[Switching to thread 1 (Thread 0x7fb2859c50c0 (LWP 7394))]
#0  __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fb270038ba0) at ./nptl/futex-internal.c:57
57      in ./nptl/futex-internal.c
(gdb) bt
<...>
#5  0x00007fb2874fab99 in pulsar::Future<pulsar::Result, pulsar::Producer>::get (this=0x7ffe339bc330, result=...) at /app/lib/Future.h:69
#6  0x00007fb2874f8892 in pulsar::Client::createProducer (this=0x7fb2700372c8, topic="test-connection-fail-51678865999", conf=..., producer=...)
    at /app/lib/Client.cc:55
<...>
(gdb) thread 2
[Switching to thread 2 (Thread 0x7fb2851b8640 (LWP 7651))]
#0  __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x55de8e4853c8) at ./nptl/futex-internal.c:57
57      in ./nptl/futex-internal.c
(gdb) bt
#0  __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x55de8e4853c8) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x55de8e4853c8) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x55de8e4853c8, expected=expected@entry=0, clockid=clockid@entry=0,
    abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007fb286c2fac1 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55de8e485368, cond=0x55de8e4853a0) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x55de8e4853a0, mutex=0x55de8e485368) at ./nptl/pthread_cond_wait.c:627
#5  0x00007fb28752be08 in boost::asio::detail::posix_event::wait<boost::asio::detail::conditionally_enabled_mutex::scoped_lock> (this=0x55de8e4853a0, lock=...)
    at /usr/include/boost/asio/detail/posix_event.hpp:119
#6  0x00007fb28751e700 in boost::asio::detail::conditionally_enabled_event::wait (this=0x55de8e485398, lock=...)
    at /usr/include/boost/asio/detail/conditionally_enabled_event.hpp:97
#7  0x00007fb28752035d in boost::asio::detail::scheduler::do_run_one (this=0x55de8e485330, lock=..., this_thread=..., ec=...)
    at /usr/include/boost/asio/detail/impl/scheduler.ipp:490
#8  0x00007fb28751fed6 in boost::asio::detail::scheduler::run (this=0x55de8e485330, ec=...) at /usr/include/boost/asio/detail/impl/scheduler.ipp:204
#9  0x00007fb287523903 in boost::asio::detail::resolver_service_base::work_scheduler_runner::operator() (this=0x55de8e499508)
    at /usr/include/boost/asio/detail/impl/resolver_service_base.ipp:38
#10 0x00007fb28758de66 in boost::asio::detail::posix_thread::func<boost::asio::detail::resolver_service_base::work_scheduler_runner>::run (this=0x55de8e499500)
    at /usr/include/boost/asio/detail/posix_thread.hpp:86
#11 0x00007fb28751e908 in boost::asio::detail::boost_asio_detail_posix_thread_function (arg=0x55de8e499500)
    at /usr/include/boost/asio/detail/impl/posix_thread.ipp:74
#12 0x00007fb286c30b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#13 0x00007fb286cc1bb4 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100

@shibd shibd merged commit e2de5fc into apache:main Mar 15, 2023
@BewareMyPower BewareMyPower deleted the bewaremypower/resolve-timeout branch March 15, 2023 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Flaky Test] ConnectionFailTest
2 participants