Skip to content
This repository has been archived by the owner on Sep 13, 2018. It is now read-only.

Performance overhead of tokio-proto #149

Open
jonhoo opened this issue Mar 1, 2017 · 7 comments
Open

Performance overhead of tokio-proto #149

jonhoo opened this issue Mar 1, 2017 · 7 comments

Comments

@jonhoo
Copy link

jonhoo commented Mar 1, 2017

I've been using tarpc, which builds on tokio and tokio-proto, in a high-performance project recently, in which we benchmark against memcached. We were seeing lower throughput than expected, which caused me to start digging into where the performance overhead was coming from (with a lot of help from @tikue). This brought me down a deep rabbit hole, but uncovered some interesting data that I figured I'd share here.

To help with profiling, I built several micro-benchmarks that use incrementally more higher-level libraries for doing a very simple continuous ping-pong RPC call. The first uses tokio-core directly, the second tokio-proto::pipeline, the third tokio-proto::multiplex, and the last builds on top of tarpc. For comparison, I have also included a relatively unoptimized memcached client.

Except for the memcached benchmark, all the others use a single reactor core shared by both the server and client, driven by a single thread, to minimize any cross-thread and cross-process overheads. The benchmark is over TCP using IPv4, and consists of doing a fixed number of trivial RPC calls, one at a time, and then reporting the average latency per call. For memcached, the server is obviously in a separate process (started with -t 1), but the client otherwise behaves the same.

The numbers I'm seeing on my laptop (a Lenovo X1 Carbon, Intel i7-5600U CPU @ 2.60GHz, Linux 4.9.11) are as follows:

tokio 6µs/call
tokio-proto-pipeline 12µs/call
tokio-proto-multiplex 17µs/call
tarpc 19µs/call
memcached 15µs/call

I was surprised to see that tokio-proto introduces such significant overhead, and that multiplex does so much worse (relatively speaking) than pipeline. To put these numbers into perspective, the latency of pure tokio translates into ~170k RPCs/second for one core (which I believe is quite close to what it could theoretically do given system call overheads and such), whereas multiplexed tokio-proto is closer to 55k RPCs/s (again, per core). That is about 3x. I don't know if performance has been a goal thus far (I'm guessing probably not, as the fundamentals are still being worked out), but 3x sounds like something it should be possible to shave a bit off.

In the spirit of starting that process, I've also added a profiling script for the aforementioned benchmarks that produce both perf reports and flamegraphs (given below). Each higher layer adds some overhead (which, to be fair, is to be expected), and hopefully this data and the benchmarks can help identify which of those overheads it may be possible to trim down. It would be awesome if tokio-proto was not just a library that was nice to work with, but also one that was blazingly fast. We're quite close to that, but shaving off those last µs would make tokio-proto seriously impressive!

tokio

tokio

tokio-proto::pipeline

tokio-proto-pipeline

tokio-proto::multiplex

tokio-proto-multiplex

tarpc

tarpc

@alexcrichton
Copy link
Contributor

Just wanted to say thanks for the detailed report! Note that it's definitely the intention for tokio-proto to be as fast as possible, so the report is much appreciated :)

@alexcrichton
Copy link
Contributor

Ok so taking a look into this, the first thing I noticed is that the tokio/tokio-pipeline are doing pretty different operations. The tokio server was just a plain echo server while the tokio-pipeline server was doing integer parsing and such. I made a few changes which basically does:

  • Removes a box from tokio-proto
  • Update the tokio server to implement the same protocol as tokio-proto (hand written though, may be wrong!)

With those changes I'm locally seeing the tokio server take 7us/call and the proto server taking 8us/call. I started out at 6us/call and 9us/call, so not quite the double discrepancy you saw yourself! I'm locally running Linux 4.4.0 w/ a i7-4770.

I'm curious, how do those changes affect the benchmarks on your own machine?

@jonhoo
Copy link
Author

jonhoo commented Mar 1, 2017

Ah, yes, the benchmarks have evolved a little over time, and I haven't been good enough about backporting the changes. In particular, the tokio-proto benchmark was built on top of the example code, which does number doubling. I wanted to just have it be an echo server like tokio, but couldn't get that working as quickly.

$ (mine) target/release/tokio; target/release/tokio-proto-pipeline; target/release/tokio-proto-multiplex; target/release/tarpc; target/release/memcached
tokio 7µs/call
tokio-proto-pipeline 12µs/call
tokio-proto-multiplex 17µs/call
tarpc 20µs/call
memcached 14µs/call
$ (alexcrichton) target/release/tokio; target/release/tokio-proto-pipeline; target/release/tokio-proto-multiplex; target/release/tarpc; target/release/memcached
tokio 8µs/call
tokio-proto-pipeline 12µs/call
tokio-proto-multiplex 16µs/call
tarpc 21µs/call
memcached 14µs/call

So doesn't seem to make too much of a difference. You're right that parsing a number and boxing a future makes a slight difference, but there's definitely something more fundamental going on here.

@alexcrichton
Copy link
Contributor

Ok, thanks for confirming! Taking a look at some runs of perf I don't see anything obvious per-se that's popping out at me.

In general these sorts of overheads are constant, so if you have a 100us request/response cycle time then there's unlikely to be any real meaningful difference between tokio/tokio-proto/w/e. I believe @carllerche mentioned that not much work has gone into optimizing tokio-proto-multiplex so that may explain that discrepancy.

One other final point may be that right now the proto servers are using EasyBuf where the tokio version is using much more raw buffers. Most of the time a super high-performance app would be unlikely to use EasyBuf literally other than to get it running initially, so swapping that out with something more custom may also be useful to test out here.

@rozaliev
Copy link

rozaliev commented Mar 5, 2017

I've also had a lower throughput than expected in my app but found out that's because of this constant

const MAX_IN_FLIGHT_REQUESTS: usize = 32;

const MAX_IN_FLIGHT_REQUESTS: usize = 32;

You might wanna set it higher than 32 and test again. In my case, the difference was AFAIR 20k vs 200k

@tikue
Copy link

tikue commented Mar 5, 2017

@rozaliev this test was of roundtrip latency, not throughput.

@carllerche
Copy link
Member

@rozaliev configuring the max in-flight requests is related to: #112

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants