Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POST throws timeout error, even though it gets response #1215

Closed
mieubrisse opened this issue Mar 11, 2021 · 23 comments
Closed

POST throws timeout error, even though it gets response #1215

mieubrisse opened this issue Mar 11, 2021 · 23 comments

Comments

@mieubrisse
Copy link

mieubrisse commented Mar 11, 2021

I'm seeing something that has me very confused. I'm making this request:

            match self
                .client    // A reqwest::blocking::Client
                .post(&self.url)
                .header(CONTENT_TYPE, "application/json")
                .body(request_json.to_string())
                .send()

The server receives the request, and replies as seen in the server logs:

[2021-03-11T01:19:34.370440100Z TRACE jsonrpc_core::io] Request: {"id":1,"jsonrpc":"2.0","method":"getSlot","params":[{"commitment":"confirmed"}]}.
[2021-03-11T01:19:34.370516100Z DEBUG jsonrpc_core::io] Response: {"jsonrpc":"2.0","result":73,"id":1}.

This response gets registered in the client's logs for the request:

2021-03-11T01:19:34.369761900+00:00 [DEBUG] (12) hyper::client::pool: reuse idle connection for ("http", 172.23.0.12:8899)
2021-03-11T01:19:34.370162600+00:00 [DEBUG] (12) hyper::proto::h1::io: flushed 189 bytes
2021-03-11T01:19:34.371146500+00:00 [DEBUG] (12) hyper::proto::h1::io: parsed 3 headers
2021-03-11T01:19:34.371245700+00:00 [DEBUG] (12) hyper::proto::h1::conn: incoming body is content-length (37 bytes)
2021-03-11T01:19:34.371366300+00:00 [DEBUG] (12) hyper::proto::h1::conn: incoming body completed
2021-03-11T01:19:34.371491600+00:00 [DEBUG] (12) hyper::client::pool: pooling idle connection for ("http", 172.23.0.12:8899)
2021-03-11T01:19:34.371842600+00:00 [DEBUG] (12) reqwest::async_impl::client: response '200 OK' for http://172.23.0.12:8899/

and yet, that code returns an Err:

    ....
    8: error sending request for url (http://172.23.0.12:8899/): operation timed out
    9: operation timed out   // Bottom of stacktrace

How can this be?

@seanmonstar
Copy link
Owner

Hm, that does seem pretty confusing! Is it with those small bodies, or have they been clipped for this report? Do you set a timeout on the client builder?

@mieubrisse
Copy link
Author

mieubrisse commented Mar 11, 2021

Thanks for the quick response @seanmonstar ! Yep, it's always the same small body, {"id":1,"jsonrpc":"2.0","method":"getSlot","params":[{"commitment":"confirmed"}]}. The timeout is configured to 30s in the containing class:

const REQUEST_TIMEOUT: Duration = Duration::from_secs(30);

pub struct HttpSender {
    client: Client,
    url: String,
}

impl HttpSender {
    pub fn new(url: String) -> Self {
        Self::new_with_timeout(url, REQUEST_TIMEOUT)
    }

    pub fn new_with_timeout(url: String, timeout: Duration) -> Self {
        let client = Client::builder()
            .timeout(timeout)
            .build()
            .expect("build rpc client");

        Self { client, url }
    }
}

@mieubrisse
Copy link
Author

mieubrisse commented Mar 11, 2021

The timeout does seem to be getting hit though - the "200 OK" line shows up exactly 30 seconds before the outer ERROR logline that indicates the error bubbling up:

2021-03-11T01:42:32.374972200+00:00 [DEBUG] (1) testsuite::testsuite_impl::network_partition_test: Getting current slot for bootstrapper 4...
2021-03-11T01:42:32.375125600+00:00 [DEBUG] (14) hyper::client::pool: reuse idle connection for ("http", 172.23.0.14:8899)
2021-03-11T01:42:32.375447800+00:00 [DEBUG] (14) hyper::proto::h1::io: flushed 189 bytes
2021-03-11T01:42:32.376161400+00:00 [DEBUG] (14) hyper::proto::h1::io: parsed 3 headers
2021-03-11T01:42:32.376343100+00:00 [DEBUG] (14) hyper::proto::h1::conn: incoming body is content-length (37 bytes)
2021-03-11T01:42:32.376526500+00:00 [DEBUG] (14) hyper::proto::h1::conn: incoming body completed
2021-03-11T01:42:32.376948200+00:00 [DEBUG] (14) hyper::client::pool: pooling idle connection for ("http", 172.23.0.14:8899)
2021-03-11T01:42:32.377083600+00:00 [DEBUG] (14) reqwest::async_impl::client: response '200 OK' for http://172.23.0.14:8899/
2021-03-11T01:43:02.351775400+00:00 [ERROR] An error occurred running the test suite executor:
2021-03-11T01:43:02.352409600+00:00 [DEBUG] (4) h2::codec::framed_write: send frame=GoAway { error_code: NO_ERROR, last_stream_id: StreamId(0) }
2021-03-11T01:43:02.355093400+00:00 [DEBUG] (4) h2::proto::connection: Connection::poll; connection error error=NO_ERROR
Error: An error occurred running the test suite executor

Caused by:
    0: An error occurred running the test execution flow
    1: An error occurred setting up & executing test 'networkPartitionTest'
    2: An error occurred executing the test
    3: An error occurred while waiting for the cluster slots to start advancing again
    4: An error occurred getting the current confirmed slots for the bootstrappers
    5: An error occurred getting the current confirmed slot for bootstrapper 4
    6: An error occurred getting the confirmed slot
    7: An error occurred sending the request
    8: error sending request for url (http://172.23.0.14:8899/): operation timed out
    9: operation timed out

One thing worth noting - I'm making a bunch of JSON RPC requests in sequence (synchronously) against the IPs 172.23.0.6, 8, 10, 12, and 14, but it doesn't always fail on the same one. E.g. in this stacktrace that I took just now, it was node 172.23.0.14, whereas in the stacktrace above is was 172.23.0.12.

@seanmonstar
Copy link
Owner

Huh, so what returns the actual timeout error? I'm guessing in your code you've received the Response, and then after that you try to read the body, and that call is what hangs and then times out?

@mieubrisse
Copy link
Author

mieubrisse commented Mar 11, 2021

This is the full match statement:

            match self
                .client
                .post(&self.url)
                .header(CONTENT_TYPE, "application/json")
                .body(request_json.to_string())
                .send()
            {
                Ok(response) => {
                    if !response.status().is_success() {
                        if response.status() == StatusCode::TOO_MANY_REQUESTS
                            && too_many_requests_retries > 0
                        {
                            too_many_requests_retries -= 1;
                            debug!(
                                "Server responded with {:?}, {} retries left",
                                response, too_many_requests_retries
                            );

                            // Sleep for 500ms to give the server a break
                            sleep(Duration::from_millis(500));
                            continue;
                        }
                        return Err(response.error_for_status().unwrap_err().into());
                    }

                    let json: Value = serde_json::from_str(&response.text()?)?;
                    if json["error"].is_object() {
                        return Err(anyhow!(
                            "An error occurred making the JSON RPC request: {}",
                            json["error"].clone(),
                        ));
                    }
                    return Ok(json["result"].clone());
                }
                Err(err) => {
                    return Err(err.into());
                }
            }

AFAICT it's the actual send that goes straight into the outer Err arm because I don't see any of the stuff in the Ok arm (though will add some extra debugging loglines to verify)

@mieubrisse
Copy link
Author

One other thing - I hit a very similar issue with tonic; might be related: hyperium/tonic#575

@mieubrisse
Copy link
Author

mieubrisse commented Mar 11, 2021

Yep, just confirmed with extra debugging @seanmonstar - it's the outermost

                Err(err) => {
                    return Err(err.into());
                }

arm that's getting hit; the error is getting thrown by this piece of code:

                .client
                .post(&self.url)
                .header(CONTENT_TYPE, "application/json")
                .body(request_json.to_string())
                .send()

It's not even making it to the body-reading Ok() arm

@seanmonstar
Copy link
Owner

Hm, you mention tonic... are you using this in an async context? I mean, if you were, I'd expect you'd have hit the debug_assert trying to catch that, but if it didn't trigger, it's really easy for that mixture to cause a deadlock...

@mieubrisse
Copy link
Author

I'll have to beg some ignorance here as I'm a bit of a Rust newb - I have #[tokio::main] on my main function; does that make it async?

@mieubrisse
Copy link
Author

And actually, this reminds me that using prost-generated client code elsewhere in the codebase (unrelated to reqwest call above) wouldn't work without that, which I think means that it is an async context (?)

@seanmonstar
Copy link
Owner

Yes it does (you had to make it async fn main). Can you switch to using the async reqwest client?

@mieubrisse
Copy link
Author

Will give that a go! Question though - what's actually happening, that causes this behaviour? It seems like in both the tonic and this case, once I make a certain number of requests then things get frozen. Before that point, everything works fine

@mieubrisse
Copy link
Author

mieubrisse commented Mar 11, 2021

And actually a followon question: I control the HttpSender/ BlockingClient for now due to a hackaround I'm doing, but that code actually comes from a library I don't control and will . This would mean that my tonic gRPC clients are async, but this library is sync. How do I resolve that discrepancy?

@mieubrisse
Copy link
Author

mieubrisse commented Mar 11, 2021

@seanmonstar I switched to using the async client and it sometimes works, but now I'm seeing exactly the same behaviour as in hyperium/tonic#575 - the request gets received, but then h2 holds on to it forever:

2021-03-11T03:09:22.545098300+00:00 [DEBUG] (1) hyper::client::pool: reuse idle connection for ("http", 172.23.0.14:8899)
2021-03-11T03:09:22.545273800+00:00 [DEBUG] (3) hyper::proto::h1::io: flushed 189 bytes
2021-03-11T03:09:22.545723+00:00 [DEBUG] (4) hyper::proto::h1::io: parsed 3 headers
2021-03-11T03:09:22.546106+00:00 [DEBUG] (4) hyper::proto::h1::conn: incoming body is content-length (37 bytes)
2021-03-11T03:09:22.546312900+00:00 [DEBUG] (4) hyper::proto::h1::conn: incoming body completed
2021-03-11T03:10:59.000939100+00:00 [DEBUG] (3) h2::codec::framed_read: received frame=GoAway { error_code: NO_ERROR, last_stream_id: StreamId(2147483647) }
2021-03-11T03:10:59.001087200+00:00 [DEBUG] (3) h2::codec::framed_read: received frame=Ping { ack: false, payload: [1, 6, 1, 8, 0, 3, 3, 9] }
2021-03-11T03:10:59.001290300+00:00 [DEBUG] (3) h2::codec::framed_write: send frame=Ping { ack: true, payload: [1, 6, 1, 8, 0, 3, 3, 9] }
2021-03-11T03:10:59.001711100+00:00 [DEBUG] (3) h2::codec::framed_write: send frame=GoAway { error_code: NO_ERROR, last_stream_id: StreamId(0) }
2021-03-11T03:10:59.001799600+00:00 [DEBUG] (3) h2::proto::connection: Connection::poll; connection error error=NO_ERROR

That minute-and-a-half gap between 09:22 and 10:59 seems to be a perpetual hang inside h2 or somewhere - the GoAway that finally unblocks it is a timeout in my code that shuts down the server the client is talking to. If I extend that timeout (below), you can see that h2 will never return and the time between the last message and the GoAway will extend but my code will never get control back:

2021-03-11T03:41:43.354133+00:00 [DEBUG] (1) hyper::client::pool: reuse idle connection for ("http", 172.23.0.10:8899)
2021-03-11T03:41:43.354525500+00:00 [DEBUG] (4) hyper::proto::h1::io: flushed 189 bytes
2021-03-11T03:41:43.355227800+00:00 [DEBUG] (4) hyper::proto::h1::io: parsed 3 headers
2021-03-11T03:41:43.355398+00:00 [DEBUG] (4) hyper::proto::h1::conn: incoming body is content-length (37 bytes)
2021-03-11T03:41:43.356124700+00:00 [DEBUG] (4) hyper::proto::h1::conn: incoming body completed
2021-03-11T03:41:43.356769800+00:00 [DEBUG] (4) hyper::client::pool: pooling idle connection for ("http", 172.23.0.10:8899)
2021-03-11T03:41:43.357463600+00:00 [DEBUG] (1) reqwest::async_impl::client: response '200 OK' for http://172.23.0.10:8899/
2021-03-11T03:41:43.357518800+00:00 [DEBUG] (1) testsuite::services_impl::validator::http_sender: Got OK response from send...
2021-03-11T03:41:43.357538100+00:00 [DEBUG] (1) testsuite::services_impl::validator::http_sender: About to read response body...
2021-03-11T03:41:43.357932700+00:00 [DEBUG] (1) testsuite::testsuite_impl::network_partition_test: Bootstrapper 2's current slot is 74
2021-03-11T03:41:43.357959800+00:00 [DEBUG] (1) testsuite::testsuite_impl::network_partition_test: Getting current slot for bootstrapper 3...
2021-03-11T03:41:43.358104900+00:00 [DEBUG] (1) hyper::client::pool: reuse idle connection for ("http", 172.23.0.12:8899)
2021-03-11T03:41:43.358705600+00:00 [DEBUG] (4) hyper::proto::h1::io: flushed 189 bytes
2021-03-11T03:41:43.359773400+00:00 [DEBUG] (4) hyper::proto::h1::io: parsed 3 headers
2021-03-11T03:41:43.360632600+00:00 [DEBUG] (4) hyper::proto::h1::conn: incoming body is content-length (37 bytes)
2021-03-11T03:41:43.361271900+00:00 [DEBUG] (4) hyper::proto::h1::conn: incoming body completed
2021-03-11T03:45:18.604356700+00:00 [DEBUG] (5) h2::codec::framed_read: received frame=GoAway { error_code: NO_ERROR, last_stream_id: StreamId(2147483647) }
2021-03-11T03:45:18.604929900+00:00 [DEBUG] (5) h2::codec::framed_read: received frame=Ping { ack: false, payload: [1, 6, 1, 8, 0, 3, 3, 9] }
2021-03-11T03:45:18.605298400+00:00 [DEBUG] (5) h2::codec::framed_write: send frame=Ping { ack: true, payload: [1, 6, 1, 8, 0, 3, 3, 9] }
2021-03-11T03:45:18.606043100+00:00 [DEBUG] (5) h2::codec::framed_write: send frame=GoAway { error_code: NO_ERROR, last_stream_id: StreamId(0) }
2021-03-11T03:45:18.606491800+00:00 [DEBUG] (5) h2::proto::connection: Connection::poll; connection error error=NO_ERROR
2021-03-11T03:45:18.609186200+00:00 [DEBUG] (5) hyper::proto::h2::client: connection error: Socket not connected (os error 107)

@seanmonstar
Copy link
Owner

I switched to using the async client and it sometimes works, but now I'm seeing exactly the same behaviour as in hyperium/tonic#575

Hm, when I read through the tonic issue, I notice a bunch more block_on calls. Are those also being called inside an async context? Calling block_on inside some Future (like an async fn, or a function that async fn calls) can cause deadlocks. Can you change those to awaits?

what's actually happening, that causes this behaviour?

Do you mean how a blocking API (like block_on, or reqwest::blocking (which uses block_on internally)) can cause a deadlock? I can wax on a little. So, block_on works by trying to poll the future for readiness. If the future happens to be ready, then nothing bad happens. If it isn't ready, then block_on will park the thread until that Future signals it is ready. The problem is that usually, that same Future is expecting the same thread to eventually check the IO driver (probably backed by epoll), and when the OS through epoll says the socket is ready, it signals the Future to be polled again. If the thread is parked (std::thread::park()), then it can't also check the IO driver for a socket status, and so it can deadlock.

@mieubrisse
Copy link
Author

Ohh that's super fascinating - block_on can't be used inside an async context??? I'd thought that was the proper way to force synchronicity when inside an async context. I'll try switching everything to await (though I seem to remember having issues with this weeks ago, as that was the first route I tried before discovering block_on - will circle back if so).

Re. what's actually happening - thanks for the explanation; that's actually really helpful for understanding! And presumably, this is implementation-dependent - i.e. this deadlock only happens in an async context, and in a synchronous context parking the thread works fine.

@seanmonstar
Copy link
Owner

seanmonstar commented Mar 11, 2021

block_on can't be used inside an async context???

Nope, since Rust async usually multiplexes many individual futures on a single thread. An executor usually polls all until not ready, registers in some thing that can keep track cheaply (like epoll), and then waits on said thing to know when any of the those futures are ready again. If the thread is parked, it can't do that.

this deadlock only happens in an async context, and in a synchronous context parking the thread works fine.

The reqwest::blocking client actually spawns a thread to run all of its futures in, so that blocking the current thread doesn't block the runtime thread. So, sometimes the stuff will be ready by the time it hits block_on, and there's no problem. But since networking, timers and scheduling aren't deterministic, sometimes it won't be ready yet, and then you'll experience your deadlock. If you're "lucky", the deadlock always happens and is thus easy to find.

reqwest::blocking works just fine in a completely synchronous context. It just doesn't play well inside an already async one.

@mieubrisse
Copy link
Author

mieubrisse commented Mar 11, 2021

Ahhh that makes way more sense now! So I tried switching to await and got told that this is only available inside an async function. From my understanding, using await would mean that I'd need to bubble the async all the way up my stack until basically everything is async (which would match what you already pointed out - that my main function is async). I don't actually use any multithreading and would love everything to be synchronous for simplicity - it seems like the right move here would be to go the other way and figure out a way to make my main function not require tokio, correct?

@mieubrisse
Copy link
Author

mieubrisse commented Mar 11, 2021

Just tried to remove the tokio instantiation, and got the same error from the first time I tried without it:

thread 'main' panicked at 'there is no reactor running, must be called from the context of a Tokio 1.x runtime', /home/rust/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.1.1/src/runtime/context.rs:18:26
stack backtrace:
   0: rust_begin_unwind
             at ./rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b/library/std/src/panicking.rs:493:5
   1: core::panicking::panic_fmt
             at ./rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b/library/core/src/panicking.rs:92:14
   2: core::option::expect_failed
             at ./rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b/library/core/src/option.rs:1260:5
   3: std::thread::local::LocalKey<T>::with
   4: tokio::io::poll_evented::PollEvented<E>::new
   5: tokio::net::tcp::stream::TcpStream::new
   6: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
   7: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
   8: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
   9: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  10: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  11: <tonic::transport::service::reconnect::Reconnect<M,Target> as tower_service::Service<Request>>::poll_ready
  12: <tower::util::either::Either<A,B> as tower_service::Service<Request>>::poll_ready
  13: <tower::util::boxed::sync::Boxed<S> as tower_service::Service<Request>>::poll_ready
  14: <tonic::transport::service::connection::Connection as tower_service::Service<http::request::Request<tonic::body::BoxBody>>>::poll_ready
  15: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  16: std::thread::local::LocalKey<T>::with
  17: futures_executor::local_pool::block_on
  18: kurtosis_rust_lib::execution::test_suite_executor::TestSuiteExecutor::run
  19: testsuite::main

I did a bunch of reading and think I understand sync/async/block_on/.await a little better. Does this mean that, because tower is involved with tonic, there's no way to get around having the tokio runtime running? (and therefore, no way to use block_on - I must instead make all my functions async and use await?)

@seanmonstar
Copy link
Owner

Tower itself doesn't really need Tokio. If you're using Tonic, there isn't a good way to make it all blocking. Especially if you're making a server, it's probably easiest/best to just make it all async. Otherwise, you usually need a new thread per request.

@mieubrisse
Copy link
Author

mieubrisse commented Mar 11, 2021

Ah sorry, this is actually for a client (which is why multithreading is way less important). That's the bit that has me sorta confused - why I need an async runtime at all for a client that's really just executing a bunch of sequential calls to the server.

(The server is already written/tested, in Go)

@mieubrisse
Copy link
Author

Just found the tokio Discord - will move the discussion there, since that seems to be a better spot for it at this point: https://discord.com/channels/500028886025895936/628706823104626710/819665340362653727

@mieubrisse
Copy link
Author

mieubrisse commented Mar 13, 2021

Closing the loop here, for posterity - the problem was indeed what @seanmonstar said. Having futures::executors::block_on, mixed in with #[tokio::main] on my main function, was causing intermittent deadlocks. I didn't want to have the async contagion spread all the way up my code, so I fixed this by removing all futures::executors::block_ons, creating a Tokio runtime via Runtime::new() as close to the async calls as possible, and using runtime.block_on() to force synchronicity on async functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants