Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transport Performance in the wild #2586

Closed
vyzo opened this issue Jun 4, 2020 · 20 comments
Closed

Transport Performance in the wild #2586

vyzo opened this issue Jun 4, 2020 · 20 comments

Comments

@vyzo
Copy link

vyzo commented Jun 4, 2020

I wrote a couple of simple programs to test libp2p transport performance, and specifically compare QUIC to TCP (see https://github.com/vyzo/libp2p-perf-test).

Running a server in a linode in NJ and a client in a linode in Frankfurt, I have observed the following:

  • QUIC is almost 3x slower than TCP
  • Adding more streams slows down QUIC transfers even further.

Specifically, I used a 1GiB file with random data, with the timings below.

Transferring the file 3 times over TCP-vs-QUIC:

root@li1494-172:~# for x in {1..3}; do ./go/bin/test-client /ip4/50.116.48.114/tcp/4001/p2p/QmUgxn8vaVgVDnSxM6qhQzyeXTYXesjJM7iG9TomZjpPcR; done
2020/06/04 18:23:39 Connecting to QmUgxn8vaVgVDnSxM6qhQzyeXTYXesjJM7iG9TomZjpPcR
2020/06/04 18:23:39 Connected; requesting data...
2020/06/04 18:23:39 Transfering data...
2020/06/04 18:24:09 Received 1073741824 bytes in 30.149238941s
2020/06/04 18:24:09 Connecting to QmUgxn8vaVgVDnSxM6qhQzyeXTYXesjJM7iG9TomZjpPcR
2020/06/04 18:24:10 Connected; requesting data...
2020/06/04 18:24:10 Transfering data...
2020/06/04 18:24:47 Received 1073741824 bytes in 37.456968339s
2020/06/04 18:24:48 Connecting to QmUgxn8vaVgVDnSxM6qhQzyeXTYXesjJM7iG9TomZjpPcR
2020/06/04 18:24:48 Connected; requesting data...
2020/06/04 18:24:48 Transfering data...
2020/06/04 18:25:17 Received 1073741824 bytes in 29.308343925s
root@li1494-172:~# for x in {1..3}; do ./go/bin/test-client /ip4/50.116.48.114/udp/4001/quic/p2p/QmUgxn8vaVgVDnSxM6qhQzyeXTYXesjJM7iG9TomZjpPcR; done
2020/06/04 18:25:32 Connecting to QmUgxn8vaVgVDnSxM6qhQzyeXTYXesjJM7iG9TomZjpPcR
2020/06/04 18:25:32 Connected; requesting data...
2020/06/04 18:25:32 Transfering data...
2020/06/04 18:27:17 Received 1073741824 bytes in 1m44.911661928s
2020/06/04 18:27:18 Connecting to QmUgxn8vaVgVDnSxM6qhQzyeXTYXesjJM7iG9TomZjpPcR
2020/06/04 18:27:18 Connected; requesting data...
2020/06/04 18:27:18 Transfering data...
2020/06/04 18:28:52 Received 1073741824 bytes in 1m34.259246794s
2020/06/04 18:28:52 Connecting to QmUgxn8vaVgVDnSxM6qhQzyeXTYXesjJM7iG9TomZjpPcR
2020/06/04 18:28:52 Connected; requesting data...
2020/06/04 18:28:52 Transfering data...
2020/06/04 18:30:35 Received 1073741824 bytes in 1m42.629025709s

Transferring the file twice using 2 parallel streams:

root@li1494-172:~# ./go/bin/test-client -streams 2 /ip4/50.116.48.114/tcp/4001/p2p/QmUgxn8vaVgVDnSxM6qhQzyeXTYXesjJM7iG9TomZjpPcR
2020/06/04 19:21:54 Connecting to QmUgxn8vaVgVDnSxM6qhQzyeXTYXesjJM7iG9TomZjpPcR
2020/06/04 19:21:55 Connected; requesting data...
2020/06/04 19:21:55 Transferring data in 2 parallel streams
2020/06/04 19:22:52 Received 2147483648 bytes in 57.743506072s
root@li1494-172:~# ./go/bin/test-client -streams 2 /ip4/50.116.48.114/udp/4001/quic/p2p/QmUgxn8vaVgVDnSxM6qhQzyeXTYXesjJM7iG9TomZjpPcR
2020/06/04 19:23:04 Connecting to QmUgxn8vaVgVDnSxM6qhQzyeXTYXesjJM7iG9TomZjpPcR
2020/06/04 19:23:05 Connected; requesting data...
2020/06/04 19:23:05 Transferring data in 2 parallel streams
2020/06/04 19:26:48 Received 2147483648 bytes in 3m43.026014572s

cc @Stebalien

@Stebalien
Copy link

This was also tested on a 50Mbps connection and performed similarly. It seems like QUIC is simply getting at least half the throughput of TCP, regardless of the bandwidth of the underlying connection.

What congestion control algorithm are we implementing?

@marten-seemann
Copy link
Member

@Stebalien quic-go is currently using New Reno. This is definitely not a congestion control issue, every congestion controller should be able to saturate a pipe.

I set up Linode servers in NJ and Frankfurt, and I'm getting similar results as @vyzo (I'm using a 100MB file).

TCP:

for x in {1..3}; do test-client /ip4/172.104.238.12/tcp/4001/p2p/QmRTAnTAw52LvCMoiLNoqursFdtBB667T7bg1MwhiAAyM5; done
2020/06/05 03:59:23 Connecting to QmRTAnTAw52LvCMoiLNoqursFdtBB667T7bg1MwhiAAyM5
2020/06/05 03:59:24 Connected; requesting data...
2020/06/05 03:59:24 Transfering data...
2020/06/05 03:59:27 Received 107373568 bytes in 3.454417986s
2020/06/05 03:59:27 Connecting to QmRTAnTAw52LvCMoiLNoqursFdtBB667T7bg1MwhiAAyM5
2020/06/05 03:59:28 Connected; requesting data...
2020/06/05 03:59:28 Transfering data...
2020/06/05 03:59:31 Received 107373568 bytes in 3.487911302s
2020/06/05 03:59:31 Connecting to QmRTAnTAw52LvCMoiLNoqursFdtBB667T7bg1MwhiAAyM5
2020/06/05 03:59:32 Connected; requesting data...
2020/06/05 03:59:32 Transfering data...
2020/06/05 03:59:35 Received 107373568 bytes in 3.453340903s

QUIC:

for x in {1..3}; do test-client /ip4/172.104.238.12/udp/4001/quic/p2p/QmerTaap6DtY1my87HBAxHTY2AnppZD7gVp81tWVzus5Q8; done
2020/06/05 03:55:11 Connecting to QmerTaap6DtY1my87HBAxHTY2AnppZD7gVp81tWVzus5Q8
2020/06/05 03:55:12 Connected; requesting data...
2020/06/05 03:55:12 Transfering data...
2020/06/05 03:55:20 Received 107373568 bytes in 8.067990447s
2020/06/05 03:55:20 Connecting to QmerTaap6DtY1my87HBAxHTY2AnppZD7gVp81tWVzus5Q8
2020/06/05 03:55:20 Connected; requesting data...
2020/06/05 03:55:20 Transfering data...
2020/06/05 03:55:28 Received 107373568 bytes in 8.268315672s
2020/06/05 03:55:30 Connecting to QmerTaap6DtY1my87HBAxHTY2AnppZD7gVp81tWVzus5Q8
2020/06/05 03:55:30 Connected; requesting data...
2020/06/05 03:55:30 Transfering data...
2020/06/05 03:55:39 Received 107373568 bytes in 8.797650973s

It looks like TCP is about 2.5x faster than QUIC in these tests.

As far as I can see, this seems to be the issue I described in https://docs.google.com/document/d/1JWOpigjvM79OqmNn5Ja_RpuQZGQfIm8QYpeR-5So9Lo/. Setting the kernel buffer sizes as suggested in the first section of that document on both nodes leads the following result:

for x in {1..3}; do test-client /ip4/172.104.238.12/udp/4001/quic/p2p/QmRTAnTAw52LvCMoiLNoqursFdtBB667T7bg1MwhiAAyM5; done
2020/06/05 03:58:30 Connecting to QmRTAnTAw52LvCMoiLNoqursFdtBB667T7bg1MwhiAAyM5
2020/06/05 03:58:30 Connected; requesting data...
2020/06/05 03:58:30 Transfering data...
2020/06/05 03:58:34 Received 107373568 bytes in 4.104785229s
2020/06/05 03:58:35 Connecting to QmRTAnTAw52LvCMoiLNoqursFdtBB667T7bg1MwhiAAyM5
2020/06/05 03:58:35 Connected; requesting data...
2020/06/05 03:58:35 Transfering data...
2020/06/05 03:58:38 Received 107373568 bytes in 3.695985142s
2020/06/05 03:58:39 Connecting to QmRTAnTAw52LvCMoiLNoqursFdtBB667T7bg1MwhiAAyM5
2020/06/05 03:58:39 Connected; requesting data...
2020/06/05 03:58:39 Transfering data...
2020/06/05 03:58:43 Received 107373568 bytes in 3.95789164s

Now QUIC is (roughly) as fast as TCP.

The problem (see #2255) here is that an application can't modify the maximum receive buffer size (this requires root privileges), and the default size is too small for high-bandwidth links like the one tested here. I'm not sure how to solve this problem.

@marten-seemann
Copy link
Member

The streams test is interesting. Not sure what the issue is there, doesn't seem to be related to packetization or congestion control. Maybe it's a flow control issue. I'll investigate.

@vyzo
Copy link
Author

vyzo commented Jun 5, 2020

Thanks @marten-seemann. The UDP receive buffer size is a reasonable explanation, and very unfortunate that it can't be set by applications.

@marten-seemann
Copy link
Member

We're not completely powerless here. We can set the size up the maximum net.core.rmem_max, and we can query how large it currently is. So what we could do is output a warning message if the buffer size is too small. Not ideal in any way, but at least it's something.

@lucas-clemente
Copy link
Member

Do we know how Chrome deals with this?

@marten-seemann
Copy link
Member

Good question. I don't really know. Maybe they're using a different API to allows them to read packets from the buffer more frequently than Go does? Or maybe they don't care that much about bandwidths > 100 Mbit/s?

@Stebalien
Copy link

Stebalien commented Jun 5, 2020

This is definitely not a congestion control issue, every congestion controller should be able to saturate a pipe.

Eh? There's a reason research into better congestion control algorithms is ongoing.

But in this case, you're probably right.


(posting publically so we have a record of the discussion)

What about using REUSEPORT to open multiple sockets and listening on multiple sockets? According to https://blog.cloudflare.com/how-to-receive-a-million-packets/, this should improve performance as each socket will get a separate receive buffer.

@marten-seemann
Copy link
Member

What about using REUSEPORT to open multiple sockets and listening on multiple sockets?

I'm not sure I understand how REUSEPORT would work with UDP. What we could do though is to listen on multiple ports, and use Server Preferred Address to ask clients to migrate to those.

According to https://blog.cloudflare.com/how-to-receive-a-million-packets/, this should improve performance as each socket will get a separate receive buffer.

This article is interesting, thanks for pointing me there. I also noticed that pinning the sending go routine to a CPU improves multi-core performance. The problem here is that as a library, I feel uncomfortable to make this pinning decision. It seems that the application would be the more appropriate place to decide this. However, the syscall has to be made from the go routine that wishes to be pinned.

@marten-seemann
Copy link
Member

I'm not sure I understand how REUSEPORT would work with UDP. What we could do though is to listen on multiple ports, and use Server Preferred Address to ask clients to migrate to those.

After reading a bit more, I think I understand the concept now. You'd have multiple UDP listeners on the same port. I'll play around with that a bit. My fear is that packets would be distributed randomly over the different listeners, leading to a high degree of (perceived) reordering, which in turn would trigger loss recovery. But I'll have to confirm that by an experiment.

@Stebalien
Copy link

We may have to tune loss recovery as well. I assume it would be based on time not just reordering.

@marten-seemann
Copy link
Member

@Stebalien Loss recovery uses both reordering in packet number space as well as in time space.
I played around with SO_REUSEPORT, and it looks like it might be able to improve multi-connection performance (see #2597). As packets are deterministically routed by their remote address, this won't have any effect on single-connection performance.

@Stebalien
Copy link

I think the load balancing is system dependent.

@Stebalien
Copy link

I've run these tests on localhost with a 1MiB rmem and it doesn't appear to make a difference so there are probably multiple bottlenecks. Unless the following wasn't the correct approach:

sudo sysctl -w net.core.rmem_max=$(( 1024 * 1024 ))
sudo sysctl -w net.core.rmem_default=$(( 1024 * 1024 ))

@Stebalien
Copy link

Other optimizations to try:

ReadBatch (connection needs to be a UDPConn for this to work):

diff --git a/packet_handler_map.go b/packet_handler_map.go
index acce56c0..5b5abfe6 100644
--- a/packet_handler_map.go
+++ b/packet_handler_map.go
@@ -13,6 +13,7 @@ import (
 	"github.com/lucas-clemente/quic-go/internal/protocol"
 	"github.com/lucas-clemente/quic-go/internal/utils"
 	"github.com/lucas-clemente/quic-go/internal/wire"
+	"golang.org/x/net/ipv4"
 )
 
 type statelessResetErr struct {
@@ -241,17 +242,30 @@ func (h *packetHandlerMap) close(e error) error {
 
 func (h *packetHandlerMap) listen() {
 	defer close(h.listening)
+
+	c := ipv4.NewPacketConn(h.conn.(*net.UDPConn))
+	msgs := make([]ipv4.Message, 100)
+	bufs := make([]*packetBuffer, 100)
+	for i := range bufs {
+		bufs[i] = getPacketBuffer()
+	}
+	for i, buf := range bufs {
+		msgs[i].Buffers = [][]byte{buf.Data[:protocol.MaxReceivePacketSize]}
+	}
 	for {
-		buffer := getPacketBuffer()
-		data := buffer.Data[:protocol.MaxReceivePacketSize]
 		// The packet size should not exceed protocol.MaxReceivePacketSize bytes
 		// If it does, we only read a truncated packet, which will then end up undecryptable
-		n, addr, err := h.conn.ReadFrom(data)
+		count, err := c.ReadBatch(msgs, 0)
+		for i := 0; i < count; i++ {
+			h.handlePacket(msgs[i].Addr, bufs[i], msgs[i].Buffers[0][:msgs[i].N])
+			newBuf := getPacketBuffer()
+			bufs[i] = newBuf
+			msgs[i].Buffers[0] = newBuf.Data[:protocol.MaxReceivePacketSize]
+		}
 		if err != nil {
 			h.close(err)
 			return
 		}
-		h.handlePacket(addr, buffer, data[:n])
 	}
 }

This didn't help in my testing, but then again, nothing I did seemed to make a difference.

@astrolox
Copy link

astrolox commented Jul 6, 2021

I've been trying to use this library recently and have also encountered this issue.

Tests

I wrote and run some very dirty tests locally to determine if the bottleneck is in the networking stack, the go syscall interface, or simply within this library itself.

I compared
A. sending 2GB via TCP via localhost with 65k MTU
B. sending 2GB via QUIC via localhost with 65k MTU
C. sending 2GB via TLS via TCP via localhost with 65k MTU
D. sending 2GB via memory ring buffer pipe with 1500 MTU
E. sending 2GB via TLS via an in memory ring buffer pipe with 1500 MTU
F. sending 2GB via QUIC via an in memory ring buffer pipe with 1500 MTU

The sending process wrote the data in 4K chunks in all instances (to make things a little easier to compare).
The receiving process counted and discarded the data.

I didn't test UDP via localhost because of it's lossy nature.

Ring buffer implementation based on this one which is based on this one

Results

Test Time ~ Speed Thoughts
A. TCP localhost 2.581s 811 MiB/s Ok base line (expected better)
B. TLS+TCP localhost 3.083s 679 MiB/s ~ 16% slower than just TCP
C. QUIC localhost 7.432s 282 MiB/s ~ 60% slower than TLS+TCP
D. Ring buffer 479.575ms 4.3 GiB/s Ok base line (expected better)
E. TLS ring buffer (no TCP) 2.601s 805 MiB/s Aprox same as using the network
F. QUIC ring buffer 6.124s 342 MiB/s ~ 60% slower than just TLS

I'd expect the difference between E and F to be a lot smaller. Obviously the lack of the TCP logic gives E an advantage but not by that much of a margin. Additionally it's my understanding that QUIC is as good as TCP on non-lossy uncongested links.

As an aside I did also play with the kcp-go library and found it to be on par with the current performance of this library.

Root Cause?

My conclusion is that although the networking layer under QUIC has an impact, it's not the main bottleneck for achieving near TLS/TCP performance. Hence the read packets in batches discussed above will help but probably not a lot.

I did some profiling to try and find the root cause, but ran out of time to look at this. Without a good way to copy and paste this data, here's a screenshot from the profiler. It seems to indicate to me that github.com/lucas-clemente/quic-go.(*packetPacker).PackPacket needs optimization.

CPU Profiler Screenshot

Hope that's helpful.
I'll try to tidy up the testing code and share it if you think that could be useful.

@astrolox
Copy link

astrolox commented Jul 8, 2021

I've cleaned up my test code and pushed it to github.
https://github.com/k42-software/go-stream-speed

After refactoring the numbers from the test results are slightly different, but ultimately the gap and conclusions remain the same.

@marten-seemann
Copy link
Member

A. sending 2GB via TCP via localhost with 65k MTU
B. sending 2GB via QUIC via localhost with 65k MTU

Of course this is a totally unfair comparison because quic-go won't send packets larger than ~1400 bytes and TCP will max out the MTU.

C. sending 2GB via TLS via TCP via localhost with 65k MTU
D. sending 2GB via memory ring buffer pipe with 1500 MTU
E. sending 2GB via TLS via an in memory ring buffer pipe with 1500 MTU
F. sending 2GB via QUIC via an in memory ring buffer pipe with 1500 MTU

Really, if you're running on localhost, there's no way QUIC will beat TCP. How could it? You're comparing an application running in user space with an in-kernel TCP implementation. You can't expect any performance benefits unless you're running over real network infrastructure.

@marten-seemann
Copy link
Member

Closing this issue since the performance problems that @vyzo reported have been fixed in the mean time.

@astrolox
Copy link

astrolox commented Jul 8, 2021

Thanks for the insights.

I would have expected it to perform similar to the in kernel TCP + user space TLS. I mean within signal digit percentage points.

I'll continue my work in another direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants