Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

100% CPU usage on client disconnection when serving a streamed response #1764

Closed
operutka opened this issue Feb 15, 2019 · 10 comments · Fixed by #1981
Closed

100% CPU usage on client disconnection when serving a streamed response #1764

operutka opened this issue Feb 15, 2019 · 10 comments · Fixed by #1981
Labels
A-http1 Area: HTTP/1 specific. C-bug Category: bug. Something is wrong. This is bad!

Comments

@operutka
Copy link

I ran into an issue when load testing my server application based on Hyper 0.12.23. The servers provides very long (video) streams to clients. The problem is that sometimes when a client disconnects, the server ends up in an infinite loop trying to flush the response.

I have a load testing script written in Python that simulates behaviour of these clients. When I run the script directly from command line and stop the script using SIGINT, server load decreases and everything works as expected. However, when I run the script in a Docker container and then I stop the container, the server application suddenly starts consuming a lot of CPU time.

I managed to get the following stack trace:

#0  0x00007fd1148b0ffd in writev () at ../sysdeps/unix/syscall-template.S:84
#1  0x0000560974aad6b7 in mio::net::tcp::TcpStream::write_bufs::hc89b0d02dfa8ddfa ()
#2  0x00005609745ba2f5 in _$LT$tokio_tcp..stream..TcpStream$u20$as$u20$tokio_io..async_write..AsyncWrite$GT$::write_buf::h6c9ba9b717fe0dfb ()
#3  0x0000560974620419 in _$LT$hyper..proto..h1..io..Buffered$LT$T$C$$u20$B$GT$$GT$::flush::hf6e9c87ae11953e9 ()
#4  0x00005609746e6605 in _$LT$hyper..proto..h1..dispatch..Dispatcher$LT$D$C$$u20$Bs$C$$u20$I$C$$u20$T$GT$$GT$::poll_flush::h788fa5e1ea237e85 ()
#5  0x00005609746e4769 in _$LT$hyper..proto..h1..dispatch..Dispatcher$LT$D$C$$u20$Bs$C$$u20$I$C$$u20$T$GT$$GT$::poll_catch::hf5cbb1e929b70897 ()
#6  0x00005609746ac614 in _$LT$hyper..server..conn..upgrades..UpgradeableConnection$LT$I$C$$u20$S$C$$u20$E$GT$$u20$as$u20$futures..future..Future$GT$::poll::h6c066c4ab3ebbcc7 ()
#7  0x000056097460f214 in _$LT$hyper..server..conn..spawn_all..NewSvcTask$LT$I$C$$u20$N$C$$u20$S$C$$u20$E$C$$u20$W$GT$$u20$as$u20$futures..future..Future$GT$::poll::hb0caae453b560ddd ()
#8  0x0000560974abe582 in futures::task_impl::std::set::h752dbb2b3f4dcdbd ()
#9  0x0000560974ac29b6 in _$LT$futures..task_impl..Spawn$LT$T$GT$$GT$::poll_future_notify::h0240129271f3aef3 ()

It's the same for all busy worker threads.

In my opinion, running the load testing script in a Docker container and then stopping it leaves some connections half open and Hyper is probably not able to recognize this. I'm not sure if this can be related to issue #1716.

@operutka
Copy link
Author

I can see the same behaviour with the latest Hyper v0.12.24 (tokio v0.1.15, tokio-io v0.1.11). The application works correctly when I put it behind nginx proxy.

@seanmonstar
Copy link
Member

Hm, sounds like it's looping here?

let n = try_ready!(self.io.write_buf(&mut self.write_buf.auto()));

Does the log message after it trigger over and over?

@seanmonstar seanmonstar added C-bug Category: bug. Something is wrong. This is bad! A-http1 Area: HTTP/1 specific. labels Feb 15, 2019
@operutka
Copy link
Author

Yes, I also think it's at this line. I'm not sure which log message you mean. If you mean the stack trace, then it's also correct. Every time I interrupt the process in GDB, I get this stack trace (or a stack trace corresponding to this one but reaching only one of the callers).

@seanmonstar
Copy link
Member

I meant the debug!("flushed {} bytes", n); that happens just after write_buf line. It'd be useful to know if there is something mis-behaving, since the loop will exit in any of these 3 conditions:

  • The transport returned WouldBlock (or any other error).
  • The transport said it wrote Ok(0) bytes, meaning it is closed.
  • The buffer of chunks reports there is no more bytes to write.

@operutka
Copy link
Author

Sorry, I did not have much time to analyse it. I got back to it today.

I've discovered that the flushed message appears regularly until I stop the Docker container with my client. Once the container is stopped, the message stops appearing entirely. This would probably suggest that the self.io.write_buf(...) method returns NotReady and the try_ready!(...) macro does early return over and over again.

@operutka
Copy link
Author

I've managed to create a minimum working example (see the attachment below). It's a simple Hyper server that responds with an infinite streamed response to every request. The project contains also a directory called "test". The directory contains a simple Python script that polls data from the server and a Docker file for creating a Docker image with the Python script.

You can re-create the problem using the following steps:

# unzip the file and enter the project directory and then:
cargo build
# the server will by default listen on 0.0.0.0:12345:
target/debug/hyper-loop

in a separate terminal:

# enter the project directory and then:
cd test
docker build -t hyper-loop-test .
# note: replace 172.17.42.1 with the address of your docker network interface
docker run --name=kill-me hyper-loop-test python3 -u test.py http://172.17.42.1:12345/

and finally kill the "kill-me" Docker container from a separate terminal:

docker rm -f kill-me

The server will never print the "client has left" message as defined in main.rs on line 29 and it will get into an infinite loop.

I've also noticed that it's sensitive to the amount of data the server sends at once. The bigger chunks of data I send, the bigger is the chance for getting stuck in an infinite loop.

Attachment: hyper-loop.zip

@seanmonstar
Copy link
Member

Thanks for more details! By chance, do you happen to know if it it was logging the debug line "flushed n bytes" in the loop?

@operutka
Copy link
Author

As I mentioned, the line "flushed n bytes" stops appearing when I stop the Docker container with my client. Here is a trace log that I captured from the server example:

TRACE - loop process - 0 events, 0.000s
TRACE - loop process - 0 events, 0.000s
TRACE -     -> wakeup; idx=3
TRACE - loop process - 0 events, 0.000s
TRACE - Notifier::notify; id=0x7fa820002e70
TRACE -     -> submit internal; idx=3
TRACE - signal_work -- notify; idx=3
TRACE -     -> wakeup; idx=3
TRACE - Task::run; state=Running
TRACE - encoding chunked 340000B
DEBUG - flushed 340009 bytes
TRACE - flushed({role=server}): State { reading: KeepAlive, writing: Body(Encoder { kind: Chunked, is_last: false }), keep_alive: Busy }
TRACE -     -> not ready
TRACE - event Readable Token(4194303)
TRACE - loop process - 1 events, 0.000s
TRACE - Worker::sleep; worker=WorkerId(3)
TRACE -   sleeping -- push to stack; idx=3
TRACE -     -> starting to sleep; idx=3
TRACE - loop process - 0 events, 0.000s
TRACE - event Readable | Writable | Hup Token(4194305)
TRACE - loop process - 1 events, 0.000s
TRACE -     -> wakeup; idx=3
TRACE - loop process - 0 events, 0.000s
TRACE -     -> wakeup; idx=3
TRACE - loop process - 0 events, 0.000s
TRACE - Notifier::notify; id=0x7fa820002e70
TRACE -     -> submit internal; idx=3
TRACE - signal_work -- notify; idx=3
TRACE -     -> wakeup; idx=3
TRACE - Task::run; state=Running
TRACE - encoding chunked 340000B
DEBUG - flushed 340009 bytes
TRACE - flushed({role=server}): State { reading: KeepAlive, writing: Body(Encoder { kind: Chunked, is_last: false }), keep_alive: Busy }
TRACE -     -> not ready
TRACE - event Readable Token(4194303)
TRACE - loop process - 1 events, 0.000s
TRACE - Worker::sleep; worker=WorkerId(3)
TRACE -   sleeping -- push to stack; idx=3
TRACE -     -> starting to sleep; idx=3
TRACE - loop process - 0 events, 0.000s
TRACE - loop process - 0 events, 0.000s
TRACE -     -> wakeup; idx=3
TRACE - loop process - 0 events, 0.000s
TRACE -     -> wakeup; idx=3
TRACE - loop process - 0 events, 0.000s
TRACE - Notifier::notify; id=0x7fa820002e70
TRACE -     -> submit internal; idx=3
TRACE - signal_work -- notify; idx=3
TRACE -     -> wakeup; idx=3
TRACE - Task::run; state=Running
TRACE - encoding chunked 340000B
DEBUG - flushed 271 bytes
TRACE - Notifier::notify; id=0x7fa820002e70
TRACE -     -> not ready
TRACE - event Readable Token(4194303)
TRACE - loop process - 1 events, 0.000s
TRACE - Task::run; state=Running
TRACE - Notifier::notify; id=0x7fa820002e70
TRACE -     -> not ready
TRACE - Task::run; state=Running
TRACE - Notifier::notify; id=0x7fa820002e70
TRACE -     -> not ready
TRACE - Task::run; state=Running
TRACE - Notifier::notify; id=0x7fa820002e70
TRACE -     -> not ready
...

The last three lines repeat over and over again. Clearly, there's a HUP event processed by the event loop (the line TRACE - event Readable | Writable | Hup Token(4194305)) but the application keeps calling the flush method even though it probably should not do it.

@seanmonstar
Copy link
Member

Ah yes, this is #1716. hyper by default allows for half-closed transports (calling shutdown(Write) sends a HUP). Often times, sockets will also error on write with EPIPE, I'm not sure why it wouldn't in your case.

@operutka
Copy link
Author

Thank you for the info. Setting http1_half_close to false in the server builder helped.

I'm just wondering if this should not be the default behaviour. I realize it's a breaking change for some applications. However, if half-closed connections are enabled by default, it'll lead to this problem sooner or later. I guess that we don't want production applications to end up with 100% CPU usage for no apparent reason. It might be even considered a DoS vulnerability.

Regarding EPIPE - in my opinion, writing into a (half) closed socket does not necessarily have to end up with EPIPE as the remote peer might be behind firewall/NAT. In such case, the remote end would simply drop all incoming packets instead of sending an RST TCP packet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-http1 Area: HTTP/1 specific. C-bug Category: bug. Something is wrong. This is bad!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants