end-to-end anonymous seeding and download performance test #2548

synctext · 2016-09-21T11:47:07Z

Privacy with usability is the core goal of our ongoing 11 year effort.

This end-to-end test is designed to identify problems in our whole Tribler chain, from session key exchange to rendezvous peer discovery.

Scenario:

Create a new file, generate magnet, and seed it
Start anonymous download on different machine and seed anonymously
repeat up to 8x the start download, start 1 new downloader every minute.
Disconnect the original public seeder, after the 10th minute in the experiment
Start an anonymous download on a different machine using only a hash
Observe 1) speed of discovery of hidden seeds, 2) PEX mechanism, and 3) total download speed

synctext · 2018-03-25T12:15:26Z

This 2016 issue has overlap with newer #3325 issue.

ichorid · 2018-04-04T11:43:45Z

I've created and ran a local performance test based on LXC containers.
The test setup is:
seeder <=> router <=> leecher

Seeder is libtorrent clien_test utility.
Router is tunnel_helper Tribler plugin for Twisted
Leecher is a leecher.py from latency-testing repo

My machine is Xeon W-2133 3,6 GHz.

Direct download (without router) is 150 MBytes/sec.
Routed anonymous download is 1,2 MBytes/sec.
Router Twisted process hogs 125% of CPU, distributed between several physical cores.
Leecher/seeder CPU usage is negligible.

Our anonymous download speed is completely obliterated by Python performance on the router.

EDIT:
Leecher CPU usage is non-negligible. It is about 107% CPU. Thus, it is only 15% smaller than that of the router.

ichorid · 2018-04-10T12:31:07Z

To estimate the limits of Twisted networking performance, I did a small packet forwarding test with LXC containers and nc | pv | dd method: nc <= twisted forwarder <= nc. As an alternative forwarder, I used nc | dd | nc.

	/dev/zero	/dev/urandom
`nc <= twisted <= nc` TCP	320 MBytes/s	180 MBytes/s
`nc <= twisted <= nc` UDP	120 MBytes/s	120 MBytes/s
`nc <= nc <= nc` TCP	300~500 MBytes/s	180 MBytes/s
`nc <= nc <= nc` UDP	270 MBytes/s	170 MBytes/s

It is interesting to note that nc|dd|nc sandwich uses almost 250 % CPU in total, in a kind of functional parallelism. Twisted, on the other hand, uses only a single CPU - always at 100%.
From this table, it is obvious that Twisted UDP performance is the bottleneck in this test. However, we would be very happy to attain even 1/10th of it.

synctext · 2018-04-10T19:44:21Z

Great to hear Python+twisted does not need replacing.

ichorid · 2018-04-12T16:44:34Z

Another simple experiment: CPU usage when bombing Tribler tunnel_helper exit node with packets. Performance of nc -u -l | pv | dd of=/dev/null and simpler Twisted-based, do-nothing socket listener are measured to estimate baseline performance:

	nc+pv+dd	Twisted listen	Tribler abcd*600	Tribler ffffffff+abcd*600	Tribler ffffffffffffffe+abcd*600
1000 Mbit/s	170%	93%	172%
100 Mbit/s	27%	34%	172%	172%	185%
50 Mbit/s	20%	20%	172%	172%	185%
25 Mbit/s	9%	10%	172%	172%	185%
10 Mbit/s	4%	5%	62%	60%	96%
5 Mbit/s		5%	31%	32%	50%
1 Mbit/s		1,3%	8%	8%	10%

CPU performance is given as the sum of CPU usage of all relevant subprocesses as printed by htop.
Tribler's 172% is the sum of 11 python subprocesses of ~9% each and a single process of ~40%. (BTW, where the hell the remaining ~40% hide ?)

Tribler was tested with 3 different types of packets: "random" packet of abcd*300, and packets with two types of headers for tunnel community. Tribler tunnel helper was not alternated in any way, so it just listend to the port, picks up packets, tries to parse them, sees them being garbage and drops them.

With ff..fe header Tribler tries to decode the packet even for unknown ID (that's already a potential problem). Hitting 96% of CPU at around 1 Mbytes/s of throughput corresponds well with my previous test of Tribler tunnels. That test showed ~125% of CPU load at 1,2 Mbytes/s (though, the exit node was both receiving and sending packets).

Aside from the potential pitfall of exception handling, it is highly improbable for a network application to demonstrate better performance when it both receives and sends packets, than when it just receives packets. Therefore, the performance bottleneck in the path of a packet lies somewhere between socket input and tunnel_community.on_data() method.

ichorid · 2018-04-13T09:49:21Z

One bottleneck identified: each active community adds about 50% of CPU load at 100Mbit/s UDP bombing test. The problem is, Endpoint.notify_listeners() notifies each listener via broadcast:

    def notify_listeners(self, packet):
 
        for listener in self._listeners:
            if listener.use_main_thread:
                reactor.callFromThread(self._deliver_later, listener, packet)
            elif reactor.running:
                reactor.callInThread(self._deliver_later, listener, packet)

This is very inefficient because each community (listener) does its own pattern matching. It is improbable that a single packet would be addressed simultaneously to several communities at once. And even in this case it should be processed only by the targeted communities.

We need to implement a central tree-like packet demultiplexer, or something like that.

ichorid · 2018-04-13T10:17:41Z

A quick hack that stops notifying all communities except hidden tunnel community after 15 seconds of running Tribler increases torrent download speed from 0,9 Mbyte/s to 3,9 Mbyte/s. (Hack should be applied to both leecher and exit node.) Exit node's CPU usage drops to 98%. Of all Twisted threads only one still generates 20% CPU load, and other threads stay idle. The rest of the load is generated by the main thread, I presume.

Apparently, we now hit another bottleneck.

ichorid · 2018-04-13T14:12:00Z

Another experiment: measure pure bandwidth of exit node without leecher, in the direction seeder->exit_node->leecher. (Community bypass hack enabled!)
To do that, we first disable method for taking down circuits, so the established circuit/tunnel would last forever. Next, we initiate the usual seeder<-exit_node<-leecher, so the circuit would be initiated properly by the exit node. We sniff the exit port of the circuit Twisted protocol. Now we take down both seeder and leecher client apps, so only the exit node Tribler instance is there.
And finally we start bombing the exit node with UDP packets from the seeder side, pretending to be the seeder app. We do it by using correct (sniffed) src and dst UDP ports. On the both seeder and leecher we run bmon to see the actual bandwidth.
All of this is possible because UDP is stateless.

The result:
Seeder bombs with 139 MByte/s, Leecher receives 21,4 Mbytes/s. Exit node at 100% CPU.
Verdict: our next bottleneck is at the leecher's side.

Some screenshots from Kcachegrindin' a yappi profile of the exit node in this mode:

Giant "unknown" blob:

Contents of the blob:

So, now its mostly SSL stuff.

synctext · 2018-04-13T16:03:23Z

Impressive progress!
You discovered an architectural violation. Communities should be completely isolated from each other and thus only 1 community should process 1 incoming UDP packer. Sadly, I don't know if AllChannel or other communities critically depend on this feature, which is a bug/hack.

Smart twist in this experiment design! Seeder is replaced by a UDP sender with a fixed test pattern that does not do any congestion control. Fixed rate blasting at the exit node, which then uses full onion encryption to deliver the test pattern to the leecher. Leecher attempt to decypher the incoming data into Tor-like encrypted Bittorrent traffic.

What would 4 cores running 4 seeders with 4 cores with 4 leechers do? The single exit node and real congestion control would hint at bottleneck.

ichorid · 2018-04-16T12:24:32Z

Experiment: multiple leechers vs multiple seeders:

1 leecher x 2 seeders: 4,3 Mbytes/s (exit node CPU 98%)
1 leecher x 4 seeders: 4,5 Mbytes/s (exit node CPU 98%)
2 (1+1) leechers x 4 (2+2) seeders (2 independent torrents): 5,3 Mbytes/s (exit node CPU 120%)

These results show that we have bottlenecks both in the leecher and the exit node. Most probably, there are 2 bottlenecks: one in the direction of upload (ACK packets from leecher to seeder), and another one still in the direction of download (DATA packets from seeder to leecher).

ichorid · 2018-04-17T09:50:47Z

A superficial analysis of profiler charts revealed another potential problem: we recreate the crypto object every time we need to encrypt or decrypt something. This includes a very costly operation of cipher initialization in OpenSSL, that is responsible for 17% of total CPU usage by the exit node, according to the profiler.

UPDATE: we're using GMAC mode with AES encryption, so if we want to have each packet independent of each other, we must use different IV for every packet/encryption. No low-hanging fruit there. Still, constant calls to _CipherContext.__init__ seem suspicious. Maybe we're not doing it properly, and there is room for improvement. For example, latest versions of SSL recommend using the compound AESGCM procedure instead of setting up AES+GCM by hand.

ichorid · 2018-04-17T15:21:13Z

Profiler screenshots showing mysterious "cycle 10" encompassing 80% of CPU ticks (probably, Twisted related):

EDIT: according to @devos50, "cycle 10" is Twisted itself. Mystery solved.

synctext · 2018-04-17T15:29:35Z

mental note: earlier discovery was that Twisted "thread pools" cost performance. By moving to main-thread, numerous context switches are avoided and performance could also double here. @qstokkink @devos50

ichorid · 2018-04-17T17:02:53Z

Currently, we poll sockets ourselves in an endless loop.
This is clearly not the thing Twisted developers were expecting from their users, since Twisted has facilities like DatagramReceived and LoopingCall. Most probably, our "endless" loop severely messes up Twisted polling schedule.
In a simple experiment I replaced UDPEndpoint._loop with LoopingCall() (0.001 sec. period).
Performance of 1 seed vs 1 leech increased to 4,5 Mbytes/s, and performance of 2+2 seeds vs 1+1 leeches increased to 7,5 Mbytes/s. On the other hand, the " UDP bombing test" performance dropped to 2 Mbytes/s!
We're doing something wrong with polling it manually. Remind that pure Twisted DatagramReceived based forwarding was capable of pushing 120 Mbytes/s.
It is clear that there is room for improvement here.

ichorid · 2018-04-17T17:08:49Z

@synctext, tried that. Disabling "thread pools" does not affect performance in my tests, but removes "unknown" objects in profiler, and makes profiling Twisted manageable.

qstokkink · 2018-04-17T18:44:52Z

Your comments touch upon a great many Python and Twisted nuances and intricacies, let me try and organize this into a shortlist:

The UDPEndpoint does not run on a Twisted thread, but on a (built-in) Thread.
The UDPEndpoint does not use a socket managed by Twisted, rather it uses a (built-in) socket.
By using a LoopingCall you are scheduling the socket send/receive logic onto the (Twisted) main thread. This may or may not be the Python entry thread.
By doing this you are avoiding expensive context switches into the Twisted threadpool, but, this also means that a busy socket will slow down all of the other application logic. This is your dip in throughput under load.
In the Twisted back-end the Twisted threadpool is just a fixed set of (built-in) Thread objects, 20 by default, onto which contexts are loaded and run. As we have previously discovered, this is painfully slow. This packing into little balls of context is what is causing your unknown origin processing.
The native Python socket implementation can actually handle well over 800 MB/s.

Final points:

None of this would be a problem if Python had real threading instead of this GIL stuff.
Twisted is a performance killer. Don't let it fool you.

ichorid · 2018-04-17T20:16:49Z

@qstokkink, thanks for a detailed explanation!
But why would we even have UDPEndpoing in a built-in thread, if context switching makes our application (at least exit node) slower? What's the point? We don't want 800 MB/s of theoretical throughput in a single part of our application if the rest of the application can't make more than 4 MBytes/s. We want a single thread of our application to do about 20-50Mbytes/s: that's enough to saturate any modern broadband connection.

qstokkink · 2018-04-18T08:06:39Z

@ichorid we did not have the time, nor the indication that we should do this before. If you want, you can try implementing a full Twisted-based endpoint.

ichorid · 2018-04-18T14:20:24Z

UDP bombing with crypto disabled: 70 MBytes/s on the exit node.
Regular download with crypto disabled: 7,5 MBytes/s.
Something fishy happens on the leecher side...

ichorid · 2018-04-18T15:41:05Z

While UDP bombing the exit node, its memory usage remains constant. However, the leecher's memory usage grows very fast, as if it internally buffers all the packets it can't process. Even when the stream of packets ends, the leecher's CPU usage remains 100% for a long time, until it finishes processing the buffered data. Its memory usage drops somewhat along the way. This happens even with crypto disabled.
Another interesting detail is that the "bomb" packets are received and buffered by leecher, but they are not sent down through the socks proxy to libtorrent - there is no traffic on the loopback interface.

ichorid · 2018-04-25T08:47:29Z

@qstokkink, when I add this line, the download starts and dies almost immediately.

ichorid · 2018-04-25T08:52:14Z

@synctext, it's really a shame that uvloop does not work as a drop-in replacement of reactor loop with Windows (((
However, this fact does not stop us from using libuv for implementing a C-based endpoint, or using uvloop to speed-up Linux Python 3.6+ performance, etc. According to authors, uvloop does not provide any API, it is just a more efficient replacement for the Python 3 core reactor.

ichorid · 2018-04-25T08:54:46Z

We have seen good enough benchmark performance for encrypted tunnel throughput (14 Mbytes/s). However, torrent download speed is 4 times worse. Why?

egbertbouman · 2018-04-25T14:30:52Z

Just got this when doing direct calls:

Traceback (most recent call last):
  File "C:\Users\Egbert\Desktop\git\tribler\TriblerGUI\event_request_manager.py", line 127, in on_read_data
    raise RuntimeError(json_dict["event"]["text"])
RuntimeError: [Failure instance: Traceback: <type 'exceptions.RuntimeError'>: dictionary changed size during iteration
C:\Python27\lib\site-packages\twisted\internet\base.py:1199:run
C:\Python27\lib\site-packages\twisted\internet\base.py:1208:mainLoop
C:\Python27\lib\site-packages\twisted\internet\base.py:828:runUntilCurrent
C:\Python27\lib\site-packages\twisted\internet\task.py:239:__call__
--- <exception caught here> ---
C:\Python27\lib\site-packages\twisted\internet\defer.py:150:maybeDeferred
C:\Users\Egbert\Desktop\git\tribler\Tribler\pyipv8\ipv8\messaging\anonymization\community.py:217:do_circuits
C:\Users\Egbert\Desktop\git\tribler\Tribler\pyipv8\ipv8\messaging\anonymization\community.py:242:do_remove
C:\Users\Egbert\Desktop\git\tribler\Tribler\community\triblertunnel\community.py:327:remove_circuit
C:\Users\Egbert\Desktop\git\tribler\Tribler\community\triblertunnel\dispatcher.py:87:circuit_dead
C:\Users\Egbert\Desktop\git\tribler\Tribler\community\triblertunnel\dispatcher.py:87:<genexpr>
]

qstokkink · 2018-04-26T04:22:12Z

The dictionary size change is due to multiple "threads" accessing the same dictionary. We could use something like in IPv8:

lock = RLock()
def synchronized(f):
    def wrapper(self, *args, **kwargs):
        with lock:
            return f(self, *args, **kwargs)
return wrapper

And then annotate the class methods with @synchronized.

ichorid · 2018-04-26T09:14:22Z

That's one option. Another one is to investigate what lines access the dictionary and try to remove that access completely. Or shift all the usages into a single "thread". BTW, this would help with further isolating IPv8 and preparing it to move to a separate process.

ichorid · 2018-05-03T14:10:00Z

OK, the communication scheme between old Dispersy listener and new IPv8 listener is completely broken and brain-dead . It messed up the results of my old experiments.
When I correctly disable Dispersy and enable LoopingCall()-based endpoint on both router and leecher the performance goes to 8 Mbytes/sec! It is important to note that broadcast fix has no effect now, and UDP bombing nets stable 22/11 Mbytes/sec router/leecher performance.

ichorid · 2018-05-03T14:34:47Z

Different call methods effect with LoopingCall endpoint :

Way of calling `deliver_later()	leech encrypted MB/s	router encrypted MB/s	Leak
`CallInThread`	1,3	21,5	No
`CallFromThread`	11	22	No
Direct call	13	22	No

So, it's safe to use CallFromThread, since there is no leak now.

Indeed, Twisted is magic.

qstokkink · 2018-05-04T07:32:14Z

@ichorid Cool: you just fixed 10-year-old crappy code; when are we getting the PR?

synctext · 2018-05-04T10:42:56Z

8 Mbytes/sec! as leecher. Impressive. PR and Jenkins job would be most welcome. We could install a rule that future code contributions should not reduce anon download speed. On all supported platforms have performance regression tests.

ichorid · 2018-05-04T12:05:52Z

I have just reimplemented IPv8 endpoint with Twisted's DatagramProtocol. The results are basically the same as in previous experiment, except the performance is 10% worse, and CPU is 100% only when there are actual packets going through the system.

p.s.
GUI-based Tribler instantly jumps to 6.1 Mbytes/s download speed on Ubuntu 18.04 torrent! Yay!

ichorid · 2018-05-04T13:13:02Z

Screenshot of the day:

synctext · 2018-12-03T15:22:17Z

Latest scientific related work in this area, from OSDI 2018 (still server-based; not suitable for Youtube-like systems):

Karaoke: Distributed Private Messaging Immune to Passive Traffic Analysis
Karaoke achieves high performance by addressing two challenges faced by prior systems. The first is that differential privacy requires continuously adding noise messages, which leads to high overheads. Karaoke avoids this using optimistic indistinguishability: in the common case, Karaoke reveals no information to the adversary, and Karaoke clients can detect precisely when information may be revealed (thus requiring less noise). The second challenge lies in generating sufficient noise in a distributed system where some nodes may be malicious. Prior work either required each server to generate enough noise on its own, or used expensive verifiable shuffles to prevent any message loss. Karaoke achieves high performance using efficient noise verification, generating noise across many servers and using Bloom filters to efficiently check if any noise messages have been discarded. These techniques allow our
prototype of Karaoke to achieve a latency of 6.8 seconds for 2M users.

Like Stadium, Karaoke is distributed over many machines, and must ensure that malicious servers do not
compromise privacy. Stadium uses zero-knowledge proofs (e.g., verifiable shuffles) for this purpose, whereas Karaoke relies on more efficient Bloom filter checks.

ichorid · 2019-02-17T14:48:31Z

From https://magic.io/blog/uvloop-blazing-fast-python-networking/

uvloop on Python vs Go and Twisted: Raw TCP

uvloop on Python vs Go: HTTP echo server

From https://www.nexedi.com/NXD-Document.Blog.UVLoop.Python.Benchmark:
uvloop on Python vs Go: HTTP server with some minimal data processing

This tells us the bottleneck is typically not the reactor loop, but Python itself.

synctext · 2019-02-17T15:25:32Z

How would these graphs look with UDP usage?

UVloop with 80.000 req/sec at 10 KiB = 800 MByte/sec, correct?

ichorid · 2019-02-17T18:22:08Z

@synctext , yes, that is 800 MBytes/sec. It is known that in simple networking benchmarks Go is typically 2-3 times slower than bare C/C++. That correlates well with other results obtained with libuv, e.g. CJDNS (~1GByte/s without auth/encryption).
It is hard to say what the numbers would be if UDP was used instead. However, I'm sure the proportion would stay the same.

synctext · 2019-06-14T10:34:06Z

Please post a performance picture of your solution for #4567 within this issue @dsluijk @NULLx76.

synctext · 2021-04-21T09:32:30Z

We lack a structured performance evaluation suite. The content of #3325 is not available in easy to use dashboard #4999.
This old 2016 issue is still relevant. We played around with various Twisted setting, now we are in the Python3 Async era. No idea if we're still at the 6.1 MByte/sec mark from May 2018, shown above. planning. After the channels plus search made progress, this is next to finish and not visit again for a while.

Related work, see the beautiful writeup at the top NSDI conference. Zero-deployment of these ideas, just simulation code https://github.com/tschorsch/nstor Hard to beat that with our hard work.

xoriole · 2024-01-25T13:16:15Z

Since @egbertbouman is the responsible person for the new tunnels upgrade using Rust and there seems already good signs of progress there, I'm unassigning myself and assigning him.

ichorid · 2024-07-24T19:48:29Z

BoringTun is a userspace Wireguard library written in sans-io style. This means it will be incredibly easy to plug it in to use IPv8 as the transport.

So, why not switch our home-grown tunneling solution to the de-facto industry standard for little to no development cost?

synctext added this to the Backlog milestone Sep 21, 2016

synctext assigned devos50 Sep 21, 2016

synctext mentioned this issue Sep 21, 2016

attack-resilient micro-economy for media #1

Open

15 tasks

devos50 removed their assignment Nov 22, 2016

qstokkink added the type: enhancement label Sep 6, 2017

synctext mentioned this issue Feb 6, 2018

Perfect Darknet Roadmap Ticket #3

Open

12 tasks

synctext assigned xoriole and ichorid Mar 25, 2018

This comment has been minimized.

Sign in to view

This was referenced May 7, 2018

Reimplement UDP endpoint listener with Twisted DatagramProtocol Tribler/py-ipv8#126

Merged

Add IPv8-based listening point adapter Tribler/dispersy#566

Merged

drew2a added the component: tunnels label Jan 15, 2021

This was referenced Oct 21, 2021

Usability roadmap ticket #6480

Closed

Vadim's testament #6481

Closed

devos50 unassigned ichorid Oct 20, 2022

synctext mentioned this issue Oct 20, 2022

Install and test releases before signing #6308

Open

xoriole assigned egbertbouman and unassigned xoriole Jan 25, 2024

qstokkink removed this from the Backlog milestone Aug 23, 2024

end-to-end anonymous seeding and download performance test #2548

end-to-end anonymous seeding and download performance test #2548

Comments

synctext commented Sep 21, 2016

synctext commented Mar 25, 2018

ichorid commented Apr 4, 2018 • edited Loading

ichorid commented Apr 10, 2018 • edited Loading

synctext commented Apr 10, 2018

ichorid commented Apr 12, 2018 • edited Loading

ichorid commented Apr 13, 2018 • edited Loading

ichorid commented Apr 13, 2018 • edited Loading

ichorid commented Apr 13, 2018 • edited Loading

synctext commented Apr 13, 2018 • edited Loading

ichorid commented Apr 16, 2018 • edited Loading

ichorid commented Apr 17, 2018 • edited Loading

ichorid commented Apr 17, 2018 • edited Loading

synctext commented Apr 17, 2018

ichorid commented Apr 17, 2018 • edited Loading

ichorid commented Apr 17, 2018

qstokkink commented Apr 17, 2018

ichorid commented Apr 17, 2018 • edited Loading

qstokkink commented Apr 18, 2018

ichorid commented Apr 18, 2018

ichorid commented Apr 18, 2018 • edited Loading

This comment has been minimized.

This comment has been minimized.

ichorid commented Apr 25, 2018

ichorid commented Apr 25, 2018

ichorid commented Apr 25, 2018

egbertbouman commented Apr 25, 2018

qstokkink commented Apr 26, 2018

ichorid commented Apr 26, 2018

ichorid commented May 3, 2018 • edited Loading

ichorid commented May 3, 2018 • edited Loading

qstokkink commented May 4, 2018

synctext commented May 4, 2018

ichorid commented May 4, 2018 • edited Loading

ichorid commented May 4, 2018

synctext commented Dec 3, 2018

ichorid commented Feb 17, 2019

synctext commented Feb 17, 2019 • edited Loading

ichorid commented Feb 17, 2019 • edited Loading

synctext commented Jun 14, 2019

synctext commented Apr 21, 2021

xoriole commented Jan 25, 2024

ichorid commented Jul 24, 2024

ichorid commented Apr 4, 2018 •

edited

Loading

ichorid commented Apr 10, 2018 •

edited

Loading

ichorid commented Apr 12, 2018 •

edited

Loading

ichorid commented Apr 13, 2018 •

edited

Loading

ichorid commented Apr 13, 2018 •

edited

Loading

ichorid commented Apr 13, 2018 •

edited

Loading

synctext commented Apr 13, 2018 •

edited

Loading

ichorid commented Apr 16, 2018 •

edited

Loading

ichorid commented Apr 17, 2018 •

edited

Loading

ichorid commented Apr 17, 2018 •

edited

Loading

ichorid commented Apr 17, 2018 •

edited

Loading

ichorid commented Apr 17, 2018 •

edited

Loading

ichorid commented Apr 18, 2018 •

edited

Loading

ichorid commented May 3, 2018 •

edited

Loading

ichorid commented May 3, 2018 •

edited

Loading

ichorid commented May 4, 2018 •

edited

Loading

synctext commented Feb 17, 2019 •

edited

Loading

ichorid commented Feb 17, 2019 •

edited

Loading