Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Last-minute" responses must not race with timeouts callbacks. #26

Open
antoche opened this issue Nov 6, 2014 · 12 comments
Open

"Last-minute" responses must not race with timeouts callbacks. #26

antoche opened this issue Nov 6, 2014 · 12 comments
Assignees
Labels

Comments

@antoche
Copy link

antoche commented Nov 6, 2014

Unless we set a very high timeout (e.g., 50 seconds), we're hitting this assertion:

riak/request_with_timeout.cxx:44: void riak::request_with_timeout::on_response(std::error_code, size_t, const std::string&): Assertion `not succeeded_ and (!! terminate_request_)' failed.

Is there something we might be doing wrong, or is that a bug in riak-cpp?

@ajtack
Copy link
Owner

ajtack commented Nov 6, 2014

Hi there!

I'm afraid its likely a bug. I've been struggling to find a good first project that exercises the library seriously. A fifty second timeout is way too high -- something must be wrong.

Do you have a good test case I can look at?

-----Algne sõnum-----
Saatja: "Antoine Bouthors" [email protected]
Saadetud: ‎7.‎11.‎2014 0:02
Adressaat: "ajtack/riak-cpp" [email protected]
Teema: [riak-cpp] Assertion `not succeeded_ and (!! terminate_request_)'(#26)

Unless we set a very high timeout (e.g., 50 seconds), we're hitting this assertion:
riak/request_with_timeout.cxx:44: void riak::request_with_timeout::on_response(std::error_code, size_t, const std::string&): Assertion `not succeeded_ and (!! terminate_request_)' failed.
Is there something we might be doing wrong, or is that a bug in riak-cpp?

Reply to this email directly or view it on GitHub.=

@ajtack
Copy link
Owner

ajtack commented Nov 9, 2014

Thought about this a little more -- an assertion shouldn't fire unless there is a bug (some "insane" situation in the code, like a response handler is invoked twice etc. etc.).

I will start having a look and imagining how that assertion could fire, but it would really help me to know something about your use-case, @antoche.

@ajtack
Copy link
Owner

ajtack commented Nov 9, 2014

Alternatively -- do you have logging enabled? I can confirm by looking at your logs whether I'm fixing the right bug or not.

@ajtack ajtack added the bug label Nov 9, 2014
@ajtack ajtack self-assigned this Nov 9, 2014
@antoche
Copy link
Author

antoche commented Nov 9, 2014

I can't send any code as-is currently, but the use case is very simple. We
hit this assertion in our unit test, which simply does a bunch of put and
get on random data.

I could try to strip out the code to send you a test case, but that will
have to wait as I'm away for the next couple weeks.

We're stuck on boost 1.46 so logging is not enabled, but if there's a way
to hook a logging callback we could have libriak use our logger.
On 10/11/2014 3:40 am, "Andres Jaan Tack" [email protected] wrote:

Alternatively -- do you have logging enabled? I can confirm by looking at
your logs whether I'm fixing the right bug or not.


Reply to this email directly or view it on GitHub
#26 (comment).

ajtack added a commit that referenced this issue Nov 17, 2014
@ajtack
Copy link
Owner

ajtack commented Nov 17, 2014

Running theory is that the code is unprepared for a timeout+response race (both are scheduled essentially in parallel). The work above should help me prove this in a test.

@ajtack
Copy link
Owner

ajtack commented Nov 22, 2014

@antoche, my theory was absolutely valid, and in this branch I've committed a fix with tests and everything. Can you confirm with this branch that the issue is resolved? Let me know if you need a particular compiler/boost version to be worked up.

Thanks a ton for this report, by the way. This is quite a critical issue, and I'm glad to have gotten a push to address it.

@ajtack ajtack changed the title Assertion `not succeeded_ and (!! terminate_request_)' "Last-minute" responses must not race with timeouts callbacks. Nov 22, 2014
@antoche
Copy link
Author

antoche commented Nov 23, 2014

Thanks! We'll give it a try. We're on gcc-4.4+boost-1.46.

@ajtack
Copy link
Owner

ajtack commented Nov 24, 2014

Agh, so that particular variant isn't green yet. Should be an easy fix, but
I'm on vacation, so it may be a bit. I think I used a lamba somewhere,
which would be the problem.

2014-11-23 15:12 GMT-06:00 Antoine Bouthors [email protected]:

Thanks! We'll give it a try. We're on gcc-4.4+boost-1.46.


Reply to this email directly or view it on GitHub
#26 (comment).

@antoche
Copy link
Author

antoche commented Nov 24, 2014

Ok, we'll wait until it's ready

On Mon, Nov 24, 2014 at 3:18 PM, Andres Jaan Tack [email protected]
wrote:

Agh, so that particular variant isn't green yet. Should be an easy fix,
but
I'm on vacation, so it may be a bit. I think I used a lamba somewhere,
which would be the problem.

2014-11-23 15:12 GMT-06:00 Antoine Bouthors [email protected]:

Thanks! We'll give it a try. We're on gcc-4.4+boost-1.46.


Reply to this email directly or view it on GitHub
#26 (comment).


Reply to this email directly or view it on GitHub
#26 (comment).

@ajtack
Copy link
Owner

ajtack commented Nov 25, 2014

Branch is green for gcc4.4/boost1.46. Go for it. If it's fixed, this is a good reason for release of v0.2.1

@antoche
Copy link
Author

antoche commented Dec 2, 2014

Hi Andres,

With this version, we're hitting another assertion: riak/utility/boost_deadline_timer_factory.cxx:57: void riak::utility::<unnamed>::run_with_converted_error_code(const std::function<void(std::error_code)>&, const boost::system::error_code&): Assertionfalse' failed`.

Stacktrace:

#4  0x00007faacd803b8b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007faacd7f90ee in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x00007faacd7f9192 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#7  0x00000000006fb508 in run_with_converted_error_code ()
#8  0x00000000006fd159 in boost::asio::detail::wait_handler<std::_Bind<void (*(std::function<void (std::error_code const&)>, std::_Placeholder<1>))(std::function<void (std::error_code)> const&, boost::system::error_code const&)> >::do_complete(boost::asio::detail::task_io_service*, boost::asio::detail::task_io_service_operation*, boost::system::error_code, unsigned long) ()
#9  0x0000000000489147 in complete (owner=..., this=<optimized out>) at (...)/boost-1_46_1-1/include/boost/asio/detail/task_io_service_operation.hpp:36
#10 do_one (this_idle_thread=<optimized out>, lock=..., this=<optimized out>) at (...)/boost-1_46_1-1/include/boost/asio/detail/impl/task_io_service.ipp:279
#11 boost::asio::detail::task_io_service::run (this=<optimized out>, ec=...) at (...)/boost-1_46_1-1/include/boost/asio/detail/impl/task_io_service.ipp:131
#12 0x000000000048943b in boost::asio::io_service::run (this=0x2478be0) at (...)/boost-1_46_1-1/include/boost/asio/impl/io_service.ipp:58
#13 0x000000000060919a in thread_proxy ()
#14 0x00007faace3a2e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#15 0x00007faacd8bdcbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#16 0x0000000000000000 in ?? ()

This is on a very simple test that just runs put/get. This is how we construct the client:

boost::asio::io_service ios;
std::shared_ptr< ::riak::utility::timer_factory > timerFactory( new ::riak::utility::boost_deadline_timer_factory( ios ) );
auto connection = riak::transport::make_single_socket_transport( address, 8087, ios);
auto client = riak::client( connection, &noSiblingResolution, timerFactory );

I couldn't see in the docs how the client should be constructed now with the timer factory interface, so possibly there's something wrong in there?

The put and get are pretty much the same as what's in the examples.

@ajtack
Copy link
Owner

ajtack commented Dec 4, 2014

Ah, yes. :) Tried out all of the timeout cases and forgot to test (in real life) the case where you actually get something over the wire.

I gave it a shot last night, it got tricky. Will get back with a fix soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants