-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transport Performance in the wild #2586
Comments
This was also tested on a 50Mbps connection and performed similarly. It seems like QUIC is simply getting at least half the throughput of TCP, regardless of the bandwidth of the underlying connection. What congestion control algorithm are we implementing? |
@Stebalien quic-go is currently using New Reno. This is definitely not a congestion control issue, every congestion controller should be able to saturate a pipe. I set up Linode servers in NJ and Frankfurt, and I'm getting similar results as @vyzo (I'm using a 100MB file). TCP:
QUIC:
It looks like TCP is about 2.5x faster than QUIC in these tests. As far as I can see, this seems to be the issue I described in https://docs.google.com/document/d/1JWOpigjvM79OqmNn5Ja_RpuQZGQfIm8QYpeR-5So9Lo/. Setting the kernel buffer sizes as suggested in the first section of that document on both nodes leads the following result:
Now QUIC is (roughly) as fast as TCP. The problem (see #2255) here is that an application can't modify the maximum receive buffer size (this requires root privileges), and the default size is too small for high-bandwidth links like the one tested here. I'm not sure how to solve this problem. |
The streams test is interesting. Not sure what the issue is there, doesn't seem to be related to packetization or congestion control. Maybe it's a flow control issue. I'll investigate. |
Thanks @marten-seemann. The UDP receive buffer size is a reasonable explanation, and very unfortunate that it can't be set by applications. |
We're not completely powerless here. We can set the size up the maximum |
Do we know how Chrome deals with this? |
Good question. I don't really know. Maybe they're using a different API to allows them to read packets from the buffer more frequently than Go does? Or maybe they don't care that much about bandwidths > 100 Mbit/s? |
Eh? There's a reason research into better congestion control algorithms is ongoing. But in this case, you're probably right. (posting publically so we have a record of the discussion) What about using REUSEPORT to open multiple sockets and listening on multiple sockets? According to https://blog.cloudflare.com/how-to-receive-a-million-packets/, this should improve performance as each socket will get a separate receive buffer. |
I'm not sure I understand how REUSEPORT would work with UDP. What we could do though is to listen on multiple ports, and use Server Preferred Address to ask clients to migrate to those.
This article is interesting, thanks for pointing me there. I also noticed that pinning the sending go routine to a CPU improves multi-core performance. The problem here is that as a library, I feel uncomfortable to make this pinning decision. It seems that the application would be the more appropriate place to decide this. However, the syscall has to be made from the go routine that wishes to be pinned. |
After reading a bit more, I think I understand the concept now. You'd have multiple UDP listeners on the same port. I'll play around with that a bit. My fear is that packets would be distributed randomly over the different listeners, leading to a high degree of (perceived) reordering, which in turn would trigger loss recovery. But I'll have to confirm that by an experiment. |
We may have to tune loss recovery as well. I assume it would be based on time not just reordering. |
@Stebalien Loss recovery uses both reordering in packet number space as well as in time space. |
I think the load balancing is system dependent. |
I've run these tests on localhost with a 1MiB rmem and it doesn't appear to make a difference so there are probably multiple bottlenecks. Unless the following wasn't the correct approach:
|
Other optimizations to try:
ReadBatch (connection needs to be a UDPConn for this to work):
This didn't help in my testing, but then again, nothing I did seemed to make a difference. |
I've been trying to use this library recently and have also encountered this issue. Tests I wrote and run some very dirty tests locally to determine if the bottleneck is in the networking stack, the go syscall interface, or simply within this library itself. I compared The sending process wrote the data in 4K chunks in all instances (to make things a little easier to compare). I didn't test UDP via localhost because of it's lossy nature. Ring buffer implementation based on this one which is based on this one Results
I'd expect the difference between E and F to be a lot smaller. Obviously the lack of the TCP logic gives E an advantage but not by that much of a margin. Additionally it's my understanding that QUIC is as good as TCP on non-lossy uncongested links. As an aside I did also play with the kcp-go library and found it to be on par with the current performance of this library. Root Cause? My conclusion is that although the networking layer under QUIC has an impact, it's not the main bottleneck for achieving near TLS/TCP performance. Hence the read packets in batches discussed above will help but probably not a lot. I did some profiling to try and find the root cause, but ran out of time to look at this. Without a good way to copy and paste this data, here's a screenshot from the profiler. It seems to indicate to me that Hope that's helpful. |
I've cleaned up my test code and pushed it to github. After refactoring the numbers from the test results are slightly different, but ultimately the gap and conclusions remain the same. |
Of course this is a totally unfair comparison because quic-go won't send packets larger than ~1400 bytes and TCP will max out the MTU.
Really, if you're running on localhost, there's no way QUIC will beat TCP. How could it? You're comparing an application running in user space with an in-kernel TCP implementation. You can't expect any performance benefits unless you're running over real network infrastructure. |
Closing this issue since the performance problems that @vyzo reported have been fixed in the mean time. |
Thanks for the insights. I would have expected it to perform similar to the in kernel TCP + user space TLS. I mean within signal digit percentage points. I'll continue my work in another direction. |
I wrote a couple of simple programs to test libp2p transport performance, and specifically compare QUIC to TCP (see https://github.com/vyzo/libp2p-perf-test).
Running a server in a linode in NJ and a client in a linode in Frankfurt, I have observed the following:
Specifically, I used a 1GiB file with random data, with the timings below.
Transferring the file 3 times over TCP-vs-QUIC:
Transferring the file twice using 2 parallel streams:
cc @Stebalien
The text was updated successfully, but these errors were encountered: