100% CPU usage on client disconnection when serving a streamed response #1764

operutka · 2019-02-15T09:25:51Z

I ran into an issue when load testing my server application based on Hyper 0.12.23. The servers provides very long (video) streams to clients. The problem is that sometimes when a client disconnects, the server ends up in an infinite loop trying to flush the response.

I have a load testing script written in Python that simulates behaviour of these clients. When I run the script directly from command line and stop the script using SIGINT, server load decreases and everything works as expected. However, when I run the script in a Docker container and then I stop the container, the server application suddenly starts consuming a lot of CPU time.

I managed to get the following stack trace:

#0  0x00007fd1148b0ffd in writev () at ../sysdeps/unix/syscall-template.S:84
#1  0x0000560974aad6b7 in mio::net::tcp::TcpStream::write_bufs::hc89b0d02dfa8ddfa ()
#2  0x00005609745ba2f5 in _$LT$tokio_tcp..stream..TcpStream$u20$as$u20$tokio_io..async_write..AsyncWrite$GT$::write_buf::h6c9ba9b717fe0dfb ()
#3  0x0000560974620419 in _$LT$hyper..proto..h1..io..Buffered$LT$T$C$$u20$B$GT$$GT$::flush::hf6e9c87ae11953e9 ()
#4  0x00005609746e6605 in _$LT$hyper..proto..h1..dispatch..Dispatcher$LT$D$C$$u20$Bs$C$$u20$I$C$$u20$T$GT$$GT$::poll_flush::h788fa5e1ea237e85 ()
#5  0x00005609746e4769 in _$LT$hyper..proto..h1..dispatch..Dispatcher$LT$D$C$$u20$Bs$C$$u20$I$C$$u20$T$GT$$GT$::poll_catch::hf5cbb1e929b70897 ()
#6  0x00005609746ac614 in _$LT$hyper..server..conn..upgrades..UpgradeableConnection$LT$I$C$$u20$S$C$$u20$E$GT$$u20$as$u20$futures..future..Future$GT$::poll::h6c066c4ab3ebbcc7 ()
#7  0x000056097460f214 in _$LT$hyper..server..conn..spawn_all..NewSvcTask$LT$I$C$$u20$N$C$$u20$S$C$$u20$E$C$$u20$W$GT$$u20$as$u20$futures..future..Future$GT$::poll::hb0caae453b560ddd ()
#8  0x0000560974abe582 in futures::task_impl::std::set::h752dbb2b3f4dcdbd ()
#9  0x0000560974ac29b6 in _$LT$futures..task_impl..Spawn$LT$T$GT$$GT$::poll_future_notify::h0240129271f3aef3 ()

It's the same for all busy worker threads.

In my opinion, running the load testing script in a Docker container and then stopping it leaves some connections half open and Hyper is probably not able to recognize this. I'm not sure if this can be related to issue #1716.

The text was updated successfully, but these errors were encountered:

operutka · 2019-02-15T10:09:44Z

I can see the same behaviour with the latest Hyper v0.12.24 (tokio v0.1.15, tokio-io v0.1.11). The application works correctly when I put it behind nginx proxy.

seanmonstar · 2019-02-15T18:06:18Z

Hm, sounds like it's looping here?

hyper/src/proto/h1/io.rs

Line 215 in 877606d

let n = try_ready!(self.io.write_buf(&mut self.write_buf.auto()));

Does the log message after it trigger over and over?

operutka · 2019-02-18T13:58:50Z

Yes, I also think it's at this line. I'm not sure which log message you mean. If you mean the stack trace, then it's also correct. Every time I interrupt the process in GDB, I get this stack trace (or a stack trace corresponding to this one but reaching only one of the callers).

seanmonstar · 2019-02-18T18:10:27Z

I meant the debug!("flushed {} bytes", n); that happens just after write_buf line. It'd be useful to know if there is something mis-behaving, since the loop will exit in any of these 3 conditions:

The transport returned WouldBlock (or any other error).
The transport said it wrote Ok(0) bytes, meaning it is closed.
The buffer of chunks reports there is no more bytes to write.

operutka · 2019-03-14T16:16:35Z

Sorry, I did not have much time to analyse it. I got back to it today.

I've discovered that the flushed message appears regularly until I stop the Docker container with my client. Once the container is stopped, the message stops appearing entirely. This would probably suggest that the self.io.write_buf(...) method returns NotReady and the try_ready!(...) macro does early return over and over again.

operutka · 2019-03-14T17:03:50Z

I've managed to create a minimum working example (see the attachment below). It's a simple Hyper server that responds with an infinite streamed response to every request. The project contains also a directory called "test". The directory contains a simple Python script that polls data from the server and a Docker file for creating a Docker image with the Python script.

You can re-create the problem using the following steps:

# unzip the file and enter the project directory and then:
cargo build
# the server will by default listen on 0.0.0.0:12345:
target/debug/hyper-loop

in a separate terminal:

# enter the project directory and then:
cd test
docker build -t hyper-loop-test .
# note: replace 172.17.42.1 with the address of your docker network interface
docker run --name=kill-me hyper-loop-test python3 -u test.py http://172.17.42.1:12345/

and finally kill the "kill-me" Docker container from a separate terminal:

docker rm -f kill-me

The server will never print the "client has left" message as defined in main.rs on line 29 and it will get into an infinite loop.

I've also noticed that it's sensitive to the amount of data the server sends at once. The bigger chunks of data I send, the bigger is the chance for getting stuck in an infinite loop.

Attachment: hyper-loop.zip

seanmonstar · 2019-03-14T20:27:54Z

Thanks for more details! By chance, do you happen to know if it it was logging the debug line "flushed n bytes" in the loop?

operutka · 2019-03-15T14:41:57Z

As I mentioned, the line "flushed n bytes" stops appearing when I stop the Docker container with my client. Here is a trace log that I captured from the server example:

TRACE - loop process - 0 events, 0.000s
TRACE - loop process - 0 events, 0.000s
TRACE -     -> wakeup; idx=3
TRACE - loop process - 0 events, 0.000s
TRACE - Notifier::notify; id=0x7fa820002e70
TRACE -     -> submit internal; idx=3
TRACE - signal_work -- notify; idx=3
TRACE -     -> wakeup; idx=3
TRACE - Task::run; state=Running
TRACE - encoding chunked 340000B
DEBUG - flushed 340009 bytes
TRACE - flushed({role=server}): State { reading: KeepAlive, writing: Body(Encoder { kind: Chunked, is_last: false }), keep_alive: Busy }
TRACE -     -> not ready
TRACE - event Readable Token(4194303)
TRACE - loop process - 1 events, 0.000s
TRACE - Worker::sleep; worker=WorkerId(3)
TRACE -   sleeping -- push to stack; idx=3
TRACE -     -> starting to sleep; idx=3
TRACE - loop process - 0 events, 0.000s
TRACE - event Readable | Writable | Hup Token(4194305)
TRACE - loop process - 1 events, 0.000s
TRACE -     -> wakeup; idx=3
TRACE - loop process - 0 events, 0.000s
TRACE -     -> wakeup; idx=3
TRACE - loop process - 0 events, 0.000s
TRACE - Notifier::notify; id=0x7fa820002e70
TRACE -     -> submit internal; idx=3
TRACE - signal_work -- notify; idx=3
TRACE -     -> wakeup; idx=3
TRACE - Task::run; state=Running
TRACE - encoding chunked 340000B
DEBUG - flushed 340009 bytes
TRACE - flushed({role=server}): State { reading: KeepAlive, writing: Body(Encoder { kind: Chunked, is_last: false }), keep_alive: Busy }
TRACE -     -> not ready
TRACE - event Readable Token(4194303)
TRACE - loop process - 1 events, 0.000s
TRACE - Worker::sleep; worker=WorkerId(3)
TRACE -   sleeping -- push to stack; idx=3
TRACE -     -> starting to sleep; idx=3
TRACE - loop process - 0 events, 0.000s
TRACE - loop process - 0 events, 0.000s
TRACE -     -> wakeup; idx=3
TRACE - loop process - 0 events, 0.000s
TRACE -     -> wakeup; idx=3
TRACE - loop process - 0 events, 0.000s
TRACE - Notifier::notify; id=0x7fa820002e70
TRACE -     -> submit internal; idx=3
TRACE - signal_work -- notify; idx=3
TRACE -     -> wakeup; idx=3
TRACE - Task::run; state=Running
TRACE - encoding chunked 340000B
DEBUG - flushed 271 bytes
TRACE - Notifier::notify; id=0x7fa820002e70
TRACE -     -> not ready
TRACE - event Readable Token(4194303)
TRACE - loop process - 1 events, 0.000s
TRACE - Task::run; state=Running
TRACE - Notifier::notify; id=0x7fa820002e70
TRACE -     -> not ready
TRACE - Task::run; state=Running
TRACE - Notifier::notify; id=0x7fa820002e70
TRACE -     -> not ready
TRACE - Task::run; state=Running
TRACE - Notifier::notify; id=0x7fa820002e70
TRACE -     -> not ready
...

The last three lines repeat over and over again. Clearly, there's a HUP event processed by the event loop (the line TRACE - event Readable | Writable | Hup Token(4194305)) but the application keeps calling the flush method even though it probably should not do it.

seanmonstar · 2019-03-15T22:22:06Z

Ah yes, this is #1716. hyper by default allows for half-closed transports (calling shutdown(Write) sends a HUP). Often times, sockets will also error on write with EPIPE, I'm not sure why it wouldn't in your case.

operutka · 2019-03-16T11:08:52Z

Thank you for the info. Setting http1_half_close to false in the server builder helped.

I'm just wondering if this should not be the default behaviour. I realize it's a breaking change for some applications. However, if half-closed connections are enabled by default, it'll lead to this problem sooner or later. I guess that we don't want production applications to end up with 100% CPU usage for no apparent reason. It might be even considered a DoS vulnerability.

Regarding EPIPE - in my opinion, writing into a (half) closed socket does not necessarily have to end up with EPIPE as the remote peer might be behind firewall/NAT. In such case, the remote end would simply drop all incoming packets instead of sending an RST TCP packet.

seanmonstar added C-bug Category: bug. Something is wrong. This is bad! A-http1 Area: HTTP/1 specific. labels Feb 15, 2019

seanmonstar mentioned this issue Oct 18, 2019

feat(server): change http1_half_close option default to disabled #1981

Merged

seanmonstar closed this as completed in #1981 Oct 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

100% CPU usage on client disconnection when serving a streamed response #1764

100% CPU usage on client disconnection when serving a streamed response #1764

operutka commented Feb 15, 2019

operutka commented Feb 15, 2019

seanmonstar commented Feb 15, 2019

operutka commented Feb 18, 2019

seanmonstar commented Feb 18, 2019

operutka commented Mar 14, 2019

operutka commented Mar 14, 2019

seanmonstar commented Mar 14, 2019

operutka commented Mar 15, 2019

seanmonstar commented Mar 15, 2019

operutka commented Mar 16, 2019

100% CPU usage on client disconnection when serving a streamed response #1764

100% CPU usage on client disconnection when serving a streamed response #1764

Comments

operutka commented Feb 15, 2019

operutka commented Feb 15, 2019

seanmonstar commented Feb 15, 2019

operutka commented Feb 18, 2019

seanmonstar commented Feb 18, 2019

operutka commented Mar 14, 2019

operutka commented Mar 14, 2019

seanmonstar commented Mar 14, 2019

operutka commented Mar 15, 2019

seanmonstar commented Mar 15, 2019

operutka commented Mar 16, 2019