Transport Performance in the wild #2586

vyzo · 2020-06-04T20:25:54Z

I wrote a couple of simple programs to test libp2p transport performance, and specifically compare QUIC to TCP (see https://github.com/vyzo/libp2p-perf-test).

Running a server in a linode in NJ and a client in a linode in Frankfurt, I have observed the following:

QUIC is almost 3x slower than TCP
Adding more streams slows down QUIC transfers even further.

Specifically, I used a 1GiB file with random data, with the timings below.

Transferring the file 3 times over TCP-vs-QUIC:

root@li1494-172:~# for x in {1..3}; do ./go/bin/test-client /ip4/50.116.48.114/tcp/4001/p2p/QmUgxn8vaVgVDnSxM6qhQzyeXTYXesjJM7iG9TomZjpPcR; done
2020/06/04 18:23:39 Connecting to QmUgxn8vaVgVDnSxM6qhQzyeXTYXesjJM7iG9TomZjpPcR
2020/06/04 18:23:39 Connected; requesting data...
2020/06/04 18:23:39 Transfering data...
2020/06/04 18:24:09 Received 1073741824 bytes in 30.149238941s
2020/06/04 18:24:09 Connecting to QmUgxn8vaVgVDnSxM6qhQzyeXTYXesjJM7iG9TomZjpPcR
2020/06/04 18:24:10 Connected; requesting data...
2020/06/04 18:24:10 Transfering data...
2020/06/04 18:24:47 Received 1073741824 bytes in 37.456968339s
2020/06/04 18:24:48 Connecting to QmUgxn8vaVgVDnSxM6qhQzyeXTYXesjJM7iG9TomZjpPcR
2020/06/04 18:24:48 Connected; requesting data...
2020/06/04 18:24:48 Transfering data...
2020/06/04 18:25:17 Received 1073741824 bytes in 29.308343925s
root@li1494-172:~# for x in {1..3}; do ./go/bin/test-client /ip4/50.116.48.114/udp/4001/quic/p2p/QmUgxn8vaVgVDnSxM6qhQzyeXTYXesjJM7iG9TomZjpPcR; done
2020/06/04 18:25:32 Connecting to QmUgxn8vaVgVDnSxM6qhQzyeXTYXesjJM7iG9TomZjpPcR
2020/06/04 18:25:32 Connected; requesting data...
2020/06/04 18:25:32 Transfering data...
2020/06/04 18:27:17 Received 1073741824 bytes in 1m44.911661928s
2020/06/04 18:27:18 Connecting to QmUgxn8vaVgVDnSxM6qhQzyeXTYXesjJM7iG9TomZjpPcR
2020/06/04 18:27:18 Connected; requesting data...
2020/06/04 18:27:18 Transfering data...
2020/06/04 18:28:52 Received 1073741824 bytes in 1m34.259246794s
2020/06/04 18:28:52 Connecting to QmUgxn8vaVgVDnSxM6qhQzyeXTYXesjJM7iG9TomZjpPcR
2020/06/04 18:28:52 Connected; requesting data...
2020/06/04 18:28:52 Transfering data...
2020/06/04 18:30:35 Received 1073741824 bytes in 1m42.629025709s

Transferring the file twice using 2 parallel streams:

root@li1494-172:~# ./go/bin/test-client -streams 2 /ip4/50.116.48.114/tcp/4001/p2p/QmUgxn8vaVgVDnSxM6qhQzyeXTYXesjJM7iG9TomZjpPcR
2020/06/04 19:21:54 Connecting to QmUgxn8vaVgVDnSxM6qhQzyeXTYXesjJM7iG9TomZjpPcR
2020/06/04 19:21:55 Connected; requesting data...
2020/06/04 19:21:55 Transferring data in 2 parallel streams
2020/06/04 19:22:52 Received 2147483648 bytes in 57.743506072s
root@li1494-172:~# ./go/bin/test-client -streams 2 /ip4/50.116.48.114/udp/4001/quic/p2p/QmUgxn8vaVgVDnSxM6qhQzyeXTYXesjJM7iG9TomZjpPcR
2020/06/04 19:23:04 Connecting to QmUgxn8vaVgVDnSxM6qhQzyeXTYXesjJM7iG9TomZjpPcR
2020/06/04 19:23:05 Connected; requesting data...
2020/06/04 19:23:05 Transferring data in 2 parallel streams
2020/06/04 19:26:48 Received 2147483648 bytes in 3m43.026014572s

cc @Stebalien

The text was updated successfully, but these errors were encountered:

Stebalien · 2020-06-04T21:17:07Z

This was also tested on a 50Mbps connection and performed similarly. It seems like QUIC is simply getting at least half the throughput of TCP, regardless of the bandwidth of the underlying connection.

What congestion control algorithm are we implementing?

marten-seemann · 2020-06-05T04:14:25Z

@Stebalien quic-go is currently using New Reno. This is definitely not a congestion control issue, every congestion controller should be able to saturate a pipe.

I set up Linode servers in NJ and Frankfurt, and I'm getting similar results as @vyzo (I'm using a 100MB file).

TCP:

for x in {1..3}; do test-client /ip4/172.104.238.12/tcp/4001/p2p/QmRTAnTAw52LvCMoiLNoqursFdtBB667T7bg1MwhiAAyM5; done
2020/06/05 03:59:23 Connecting to QmRTAnTAw52LvCMoiLNoqursFdtBB667T7bg1MwhiAAyM5
2020/06/05 03:59:24 Connected; requesting data...
2020/06/05 03:59:24 Transfering data...
2020/06/05 03:59:27 Received 107373568 bytes in 3.454417986s
2020/06/05 03:59:27 Connecting to QmRTAnTAw52LvCMoiLNoqursFdtBB667T7bg1MwhiAAyM5
2020/06/05 03:59:28 Connected; requesting data...
2020/06/05 03:59:28 Transfering data...
2020/06/05 03:59:31 Received 107373568 bytes in 3.487911302s
2020/06/05 03:59:31 Connecting to QmRTAnTAw52LvCMoiLNoqursFdtBB667T7bg1MwhiAAyM5
2020/06/05 03:59:32 Connected; requesting data...
2020/06/05 03:59:32 Transfering data...
2020/06/05 03:59:35 Received 107373568 bytes in 3.453340903s

QUIC:

for x in {1..3}; do test-client /ip4/172.104.238.12/udp/4001/quic/p2p/QmerTaap6DtY1my87HBAxHTY2AnppZD7gVp81tWVzus5Q8; done
2020/06/05 03:55:11 Connecting to QmerTaap6DtY1my87HBAxHTY2AnppZD7gVp81tWVzus5Q8
2020/06/05 03:55:12 Connected; requesting data...
2020/06/05 03:55:12 Transfering data...
2020/06/05 03:55:20 Received 107373568 bytes in 8.067990447s
2020/06/05 03:55:20 Connecting to QmerTaap6DtY1my87HBAxHTY2AnppZD7gVp81tWVzus5Q8
2020/06/05 03:55:20 Connected; requesting data...
2020/06/05 03:55:20 Transfering data...
2020/06/05 03:55:28 Received 107373568 bytes in 8.268315672s
2020/06/05 03:55:30 Connecting to QmerTaap6DtY1my87HBAxHTY2AnppZD7gVp81tWVzus5Q8
2020/06/05 03:55:30 Connected; requesting data...
2020/06/05 03:55:30 Transfering data...
2020/06/05 03:55:39 Received 107373568 bytes in 8.797650973s

It looks like TCP is about 2.5x faster than QUIC in these tests.

As far as I can see, this seems to be the issue I described in https://docs.google.com/document/d/1JWOpigjvM79OqmNn5Ja_RpuQZGQfIm8QYpeR-5So9Lo/. Setting the kernel buffer sizes as suggested in the first section of that document on both nodes leads the following result:

for x in {1..3}; do test-client /ip4/172.104.238.12/udp/4001/quic/p2p/QmRTAnTAw52LvCMoiLNoqursFdtBB667T7bg1MwhiAAyM5; done
2020/06/05 03:58:30 Connecting to QmRTAnTAw52LvCMoiLNoqursFdtBB667T7bg1MwhiAAyM5
2020/06/05 03:58:30 Connected; requesting data...
2020/06/05 03:58:30 Transfering data...
2020/06/05 03:58:34 Received 107373568 bytes in 4.104785229s
2020/06/05 03:58:35 Connecting to QmRTAnTAw52LvCMoiLNoqursFdtBB667T7bg1MwhiAAyM5
2020/06/05 03:58:35 Connected; requesting data...
2020/06/05 03:58:35 Transfering data...
2020/06/05 03:58:38 Received 107373568 bytes in 3.695985142s
2020/06/05 03:58:39 Connecting to QmRTAnTAw52LvCMoiLNoqursFdtBB667T7bg1MwhiAAyM5
2020/06/05 03:58:39 Connected; requesting data...
2020/06/05 03:58:39 Transfering data...
2020/06/05 03:58:43 Received 107373568 bytes in 3.95789164s

Now QUIC is (roughly) as fast as TCP.

The problem (see #2255) here is that an application can't modify the maximum receive buffer size (this requires root privileges), and the default size is too small for high-bandwidth links like the one tested here. I'm not sure how to solve this problem.

marten-seemann · 2020-06-05T06:19:10Z

The streams test is interesting. Not sure what the issue is there, doesn't seem to be related to packetization or congestion control. Maybe it's a flow control issue. I'll investigate.

vyzo · 2020-06-05T07:06:22Z

Thanks @marten-seemann. The UDP receive buffer size is a reasonable explanation, and very unfortunate that it can't be set by applications.

marten-seemann · 2020-06-05T08:16:08Z

We're not completely powerless here. We can set the size up the maximum net.core.rmem_max, and we can query how large it currently is. So what we could do is output a warning message if the buffer size is too small. Not ideal in any way, but at least it's something.

lucas-clemente · 2020-06-05T08:34:27Z

Do we know how Chrome deals with this?

marten-seemann · 2020-06-05T08:37:03Z

Good question. I don't really know. Maybe they're using a different API to allows them to read packets from the buffer more frequently than Go does? Or maybe they don't care that much about bandwidths > 100 Mbit/s?

Stebalien · 2020-06-05T15:28:14Z

This is definitely not a congestion control issue, every congestion controller should be able to saturate a pipe.

Eh? There's a reason research into better congestion control algorithms is ongoing.

But in this case, you're probably right.

(posting publically so we have a record of the discussion)

What about using REUSEPORT to open multiple sockets and listening on multiple sockets? According to https://blog.cloudflare.com/how-to-receive-a-million-packets/, this should improve performance as each socket will get a separate receive buffer.

marten-seemann · 2020-06-09T05:31:06Z

What about using REUSEPORT to open multiple sockets and listening on multiple sockets?

I'm not sure I understand how REUSEPORT would work with UDP. What we could do though is to listen on multiple ports, and use Server Preferred Address to ask clients to migrate to those.

According to https://blog.cloudflare.com/how-to-receive-a-million-packets/, this should improve performance as each socket will get a separate receive buffer.

This article is interesting, thanks for pointing me there. I also noticed that pinning the sending go routine to a CPU improves multi-core performance. The problem here is that as a library, I feel uncomfortable to make this pinning decision. It seems that the application would be the more appropriate place to decide this. However, the syscall has to be made from the go routine that wishes to be pinned.

marten-seemann · 2020-06-09T09:42:54Z

I'm not sure I understand how REUSEPORT would work with UDP. What we could do though is to listen on multiple ports, and use Server Preferred Address to ask clients to migrate to those.

After reading a bit more, I think I understand the concept now. You'd have multiple UDP listeners on the same port. I'll play around with that a bit. My fear is that packets would be distributed randomly over the different listeners, leading to a high degree of (perceived) reordering, which in turn would trigger loss recovery. But I'll have to confirm that by an experiment.

Stebalien · 2020-06-09T15:29:55Z

We may have to tune loss recovery as well. I assume it would be based on time not just reordering.

marten-seemann · 2020-06-10T05:43:03Z

@Stebalien Loss recovery uses both reordering in packet number space as well as in time space.
I played around with SO_REUSEPORT, and it looks like it might be able to improve multi-connection performance (see #2597). As packets are deterministically routed by their remote address, this won't have any effect on single-connection performance.

Stebalien · 2020-06-10T15:01:23Z

I think the load balancing is system dependent.

Stebalien · 2020-06-10T23:07:46Z

I've run these tests on localhost with a 1MiB rmem and it doesn't appear to make a difference so there are probably multiple bottlenecks. Unless the following wasn't the correct approach:

sudo sysctl -w net.core.rmem_max=$(( 1024 * 1024 ))
sudo sysctl -w net.core.rmem_default=$(( 1024 * 1024 ))

Stebalien · 2020-06-11T00:25:23Z

Other optimizations to try:

Read multiple packets at once with https://godoc.org/golang.org/x/net/ipv4#PacketConn.ReadBatch.
Tighten the read loop, and maybe lock the thread to the goroutine?

ReadBatch (connection needs to be a UDPConn for this to work):

diff --git a/packet_handler_map.go b/packet_handler_map.go
index acce56c0..5b5abfe6 100644
--- a/packet_handler_map.go
+++ b/packet_handler_map.go
@@ -13,6 +13,7 @@ import (
 	"github.com/lucas-clemente/quic-go/internal/protocol"
 	"github.com/lucas-clemente/quic-go/internal/utils"
 	"github.com/lucas-clemente/quic-go/internal/wire"
+	"golang.org/x/net/ipv4"
 )
 
 type statelessResetErr struct {
@@ -241,17 +242,30 @@ func (h *packetHandlerMap) close(e error) error {
 
 func (h *packetHandlerMap) listen() {
 	defer close(h.listening)
+
+	c := ipv4.NewPacketConn(h.conn.(*net.UDPConn))
+	msgs := make([]ipv4.Message, 100)
+	bufs := make([]*packetBuffer, 100)
+	for i := range bufs {
+		bufs[i] = getPacketBuffer()
+	}
+	for i, buf := range bufs {
+		msgs[i].Buffers = [][]byte{buf.Data[:protocol.MaxReceivePacketSize]}
+	}
 	for {
-		buffer := getPacketBuffer()
-		data := buffer.Data[:protocol.MaxReceivePacketSize]
 		// The packet size should not exceed protocol.MaxReceivePacketSize bytes
 		// If it does, we only read a truncated packet, which will then end up undecryptable
-		n, addr, err := h.conn.ReadFrom(data)
+		count, err := c.ReadBatch(msgs, 0)
+		for i := 0; i < count; i++ {
+			h.handlePacket(msgs[i].Addr, bufs[i], msgs[i].Buffers[0][:msgs[i].N])
+			newBuf := getPacketBuffer()
+			bufs[i] = newBuf
+			msgs[i].Buffers[0] = newBuf.Data[:protocol.MaxReceivePacketSize]
+		}
 		if err != nil {
 			h.close(err)
 			return
 		}
-		h.handlePacket(addr, buffer, data[:n])
 	}
 }

This didn't help in my testing, but then again, nothing I did seemed to make a difference.

astrolox · 2021-07-06T15:52:12Z

I've been trying to use this library recently and have also encountered this issue.

Tests

I wrote and run some very dirty tests locally to determine if the bottleneck is in the networking stack, the go syscall interface, or simply within this library itself.

I compared
A. sending 2GB via TCP via localhost with 65k MTU
B. sending 2GB via QUIC via localhost with 65k MTU
C. sending 2GB via TLS via TCP via localhost with 65k MTU
D. sending 2GB via memory ring buffer pipe with 1500 MTU
E. sending 2GB via TLS via an in memory ring buffer pipe with 1500 MTU
F. sending 2GB via QUIC via an in memory ring buffer pipe with 1500 MTU

The sending process wrote the data in 4K chunks in all instances (to make things a little easier to compare).
The receiving process counted and discarded the data.

I didn't test UDP via localhost because of it's lossy nature.

Ring buffer implementation based on this one which is based on this one

Results

Test	Time	~ Speed	Thoughts
A. TCP localhost	2.581s	811 MiB/s	Ok base line (expected better)
B. TLS+TCP localhost	3.083s	679 MiB/s	~ 16% slower than just TCP
C. QUIC localhost	7.432s	282 MiB/s	~ 60% slower than TLS+TCP
D. Ring buffer	479.575ms	4.3 GiB/s	Ok base line (expected better)
E. TLS ring buffer (no TCP)	2.601s	805 MiB/s	Aprox same as using the network
F. QUIC ring buffer	6.124s	342 MiB/s	~ 60% slower than just TLS

I'd expect the difference between E and F to be a lot smaller. Obviously the lack of the TCP logic gives E an advantage but not by that much of a margin. Additionally it's my understanding that QUIC is as good as TCP on non-lossy uncongested links.

As an aside I did also play with the kcp-go library and found it to be on par with the current performance of this library.

Root Cause?

My conclusion is that although the networking layer under QUIC has an impact, it's not the main bottleneck for achieving near TLS/TCP performance. Hence the read packets in batches discussed above will help but probably not a lot.

I did some profiling to try and find the root cause, but ran out of time to look at this. Without a good way to copy and paste this data, here's a screenshot from the profiler. It seems to indicate to me that github.com/lucas-clemente/quic-go.(*packetPacker).PackPacket needs optimization.

Hope that's helpful.
I'll try to tidy up the testing code and share it if you think that could be useful.

astrolox · 2021-07-08T13:51:50Z

I've cleaned up my test code and pushed it to github.
https://github.com/k42-software/go-stream-speed

After refactoring the numbers from the test results are slightly different, but ultimately the gap and conclusions remain the same.

marten-seemann · 2021-07-08T14:29:42Z

A. sending 2GB via TCP via localhost with 65k MTU
B. sending 2GB via QUIC via localhost with 65k MTU

Of course this is a totally unfair comparison because quic-go won't send packets larger than ~1400 bytes and TCP will max out the MTU.

C. sending 2GB via TLS via TCP via localhost with 65k MTU
D. sending 2GB via memory ring buffer pipe with 1500 MTU
E. sending 2GB via TLS via an in memory ring buffer pipe with 1500 MTU
F. sending 2GB via QUIC via an in memory ring buffer pipe with 1500 MTU

Really, if you're running on localhost, there's no way QUIC will beat TCP. How could it? You're comparing an application running in user space with an in-kernel TCP implementation. You can't expect any performance benefits unless you're running over real network infrastructure.

marten-seemann · 2021-07-08T14:30:01Z

Closing this issue since the performance problems that @vyzo reported have been fixed in the mean time.

astrolox · 2021-07-08T15:08:08Z

Thanks for the insights.

I would have expected it to perform similar to the in kernel TCP + user space TLS. I mean within signal digit percentage points.

I'll continue my work in another direction.

marten-seemann added the performance label Jun 5, 2020

marten-seemann mentioned this issue Jun 10, 2020

allow applications to use SO_REUSEPORT #2597

Open

marten-seemann mentioned this issue Sep 16, 2020

increase UDP receive buffer size #2791

Merged

marten-seemann mentioned this issue Oct 25, 2020

use batch read / writes, if available #2607

Open

bt90 mentioned this issue Mar 2, 2021

Package sysctl configuration to raise UDP buffer size on Linux syncthing/syncthing#7417

Merged

bt90 mentioned this issue May 8, 2021

Improve QUIC performance syncthing/syncthing#7636

Closed

marten-seemann closed this as completed Jul 8, 2021

This was referenced Mar 16, 2023

Network Stability rollout for quic/qos/fee markets solana-labs/solana#23211

Closed

Quic perferomance solana-labs/solana#30776

Closed

michaljurecko mentioned this issue Jul 10, 2024

feat: Network transport (2) - TCP/KCP protocol can be switched by cfg keboola/keboola-as-code#1895

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transport Performance in the wild #2586

Transport Performance in the wild #2586

vyzo commented Jun 4, 2020

Stebalien commented Jun 4, 2020

marten-seemann commented Jun 5, 2020

marten-seemann commented Jun 5, 2020

vyzo commented Jun 5, 2020

marten-seemann commented Jun 5, 2020

lucas-clemente commented Jun 5, 2020

marten-seemann commented Jun 5, 2020

Stebalien commented Jun 5, 2020 •

edited

Loading

marten-seemann commented Jun 9, 2020

marten-seemann commented Jun 9, 2020

Stebalien commented Jun 9, 2020

marten-seemann commented Jun 10, 2020

Stebalien commented Jun 10, 2020

Stebalien commented Jun 10, 2020

Stebalien commented Jun 11, 2020

astrolox commented Jul 6, 2021

astrolox commented Jul 8, 2021

marten-seemann commented Jul 8, 2021

marten-seemann commented Jul 8, 2021

astrolox commented Jul 8, 2021

Transport Performance in the wild #2586

Transport Performance in the wild #2586

Comments

vyzo commented Jun 4, 2020

Stebalien commented Jun 4, 2020

marten-seemann commented Jun 5, 2020

marten-seemann commented Jun 5, 2020

vyzo commented Jun 5, 2020

marten-seemann commented Jun 5, 2020

lucas-clemente commented Jun 5, 2020

marten-seemann commented Jun 5, 2020

Stebalien commented Jun 5, 2020 • edited Loading

marten-seemann commented Jun 9, 2020

marten-seemann commented Jun 9, 2020

Stebalien commented Jun 9, 2020

marten-seemann commented Jun 10, 2020

Stebalien commented Jun 10, 2020

Stebalien commented Jun 10, 2020

Stebalien commented Jun 11, 2020

astrolox commented Jul 6, 2021

astrolox commented Jul 8, 2021

marten-seemann commented Jul 8, 2021

marten-seemann commented Jul 8, 2021

astrolox commented Jul 8, 2021

Stebalien commented Jun 5, 2020 •

edited

Loading