-
Notifications
You must be signed in to change notification settings - Fork 82
Performance overhead of tokio-proto #149
Comments
Just wanted to say thanks for the detailed report! Note that it's definitely the intention for tokio-proto to be as fast as possible, so the report is much appreciated :) |
Ok so taking a look into this, the first thing I noticed is that the tokio/tokio-pipeline are doing pretty different operations. The tokio server was just a plain echo server while the tokio-pipeline server was doing integer parsing and such. I made a few changes which basically does:
With those changes I'm locally seeing the tokio server take 7us/call and the proto server taking 8us/call. I started out at 6us/call and 9us/call, so not quite the double discrepancy you saw yourself! I'm locally running Linux 4.4.0 w/ a i7-4770. I'm curious, how do those changes affect the benchmarks on your own machine? |
Ah, yes, the benchmarks have evolved a little over time, and I haven't been good enough about backporting the changes. In particular, the $ (mine) target/release/tokio; target/release/tokio-proto-pipeline; target/release/tokio-proto-multiplex; target/release/tarpc; target/release/memcached
tokio 7µs/call
tokio-proto-pipeline 12µs/call
tokio-proto-multiplex 17µs/call
tarpc 20µs/call
memcached 14µs/call
$ (alexcrichton) target/release/tokio; target/release/tokio-proto-pipeline; target/release/tokio-proto-multiplex; target/release/tarpc; target/release/memcached
tokio 8µs/call
tokio-proto-pipeline 12µs/call
tokio-proto-multiplex 16µs/call
tarpc 21µs/call
memcached 14µs/call So doesn't seem to make too much of a difference. You're right that parsing a number and boxing a future makes a slight difference, but there's definitely something more fundamental going on here. |
Ok, thanks for confirming! Taking a look at some runs of perf I don't see anything obvious per-se that's popping out at me. In general these sorts of overheads are constant, so if you have a 100us request/response cycle time then there's unlikely to be any real meaningful difference between tokio/tokio-proto/w/e. I believe @carllerche mentioned that not much work has gone into optimizing tokio-proto-multiplex so that may explain that discrepancy. One other final point may be that right now the proto servers are using |
I've also had a lower throughput than expected in my app but found out that's because of this constant
tokio-proto/src/streaming/multiplex/server.rs Line 113 in 596005a
You might wanna set it higher than 32 and test again. In my case, the difference was AFAIR 20k vs 200k |
@rozaliev this test was of roundtrip latency, not throughput. |
I've been using tarpc, which builds on
tokio
andtokio-proto
, in a high-performance project recently, in which we benchmark againstmemcached
. We were seeing lower throughput than expected, which caused me to start digging into where the performance overhead was coming from (with a lot of help from @tikue). This brought me down a deep rabbit hole, but uncovered some interesting data that I figured I'd share here.To help with profiling, I built several micro-benchmarks that use incrementally more higher-level libraries for doing a very simple continuous ping-pong RPC call. The first uses
tokio-core
directly, the secondtokio-proto::pipeline
, the thirdtokio-proto::multiplex
, and the last builds on top oftarpc
. For comparison, I have also included a relatively unoptimizedmemcached
client.Except for the memcached benchmark, all the others use a single reactor core shared by both the server and client, driven by a single thread, to minimize any cross-thread and cross-process overheads. The benchmark is over TCP using IPv4, and consists of doing a fixed number of trivial RPC calls, one at a time, and then reporting the average latency per call. For memcached, the server is obviously in a separate process (started with
-t 1
), but the client otherwise behaves the same.The numbers I'm seeing on my laptop (a Lenovo X1 Carbon, Intel i7-5600U CPU @ 2.60GHz, Linux 4.9.11) are as follows:
I was surprised to see that
tokio-proto
introduces such significant overhead, and thatmultiplex
does so much worse (relatively speaking) thanpipeline
. To put these numbers into perspective, the latency of puretokio
translates into ~170k RPCs/second for one core (which I believe is quite close to what it could theoretically do given system call overheads and such), whereas multiplexedtokio-proto
is closer to 55k RPCs/s (again, per core). That is about 3x. I don't know if performance has been a goal thus far (I'm guessing probably not, as the fundamentals are still being worked out), but 3x sounds like something it should be possible to shave a bit off.In the spirit of starting that process, I've also added a profiling script for the aforementioned benchmarks that produce both
perf
reports and flamegraphs (given below). Each higher layer adds some overhead (which, to be fair, is to be expected), and hopefully this data and the benchmarks can help identify which of those overheads it may be possible to trim down. It would be awesome iftokio-proto
was not just a library that was nice to work with, but also one that was blazingly fast. We're quite close to that, but shaving off those last µs would maketokio-proto
seriously impressive!tokio
tokio-proto::pipeline
tokio-proto::multiplex
tarpc
The text was updated successfully, but these errors were encountered: