-
Notifications
You must be signed in to change notification settings - Fork 451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
end-to-end anonymous seeding and download performance test #2548
Comments
This 2016 issue has overlap with newer #3325 issue. |
I've created and ran a local performance test based on LXC containers.
My machine is Xeon W-2133 3,6 GHz. Direct download (without router) is 150 MBytes/sec. Our anonymous download speed is completely obliterated by Python performance on the router. EDIT: |
To estimate the limits of Twisted networking performance, I did a small packet forwarding test with LXC containers and
It is interesting to note that |
Great to hear Python+twisted does not need replacing. |
Another simple experiment: CPU usage when bombing Tribler
CPU performance is given as the sum of CPU usage of all relevant subprocesses as printed by Tribler was tested with 3 different types of packets: "random" packet of With Aside from the potential pitfall of exception handling, it is highly improbable for a network application to demonstrate better performance when it both receives and sends packets, than when it just receives packets. Therefore, the performance bottleneck in the path of a packet lies somewhere between socket input and |
One bottleneck identified: each active community adds about 50% of CPU load at 100Mbit/s UDP bombing test. The problem is, def notify_listeners(self, packet):
for listener in self._listeners:
if listener.use_main_thread:
reactor.callFromThread(self._deliver_later, listener, packet)
elif reactor.running:
reactor.callInThread(self._deliver_later, listener, packet) This is very inefficient because each community (listener) does its own pattern matching. It is improbable that a single packet would be addressed simultaneously to several communities at once. And even in this case it should be processed only by the targeted communities. We need to implement a central tree-like packet demultiplexer, or something like that. |
A quick hack that stops notifying all communities except hidden tunnel community after 15 seconds of running Tribler increases torrent download speed from 0,9 Mbyte/s to 3,9 Mbyte/s. (Hack should be applied to both leecher and exit node.) Exit node's CPU usage drops to 98%. Of all Twisted threads only one still generates 20% CPU load, and other threads stay idle. The rest of the load is generated by the main thread, I presume. Apparently, we now hit another bottleneck. |
Another experiment: measure pure bandwidth of exit node without leecher, in the direction The result: Some screenshots from Kcachegrindin' a yappi profile of the exit node in this mode: So, now its mostly SSL stuff. |
Impressive progress! Smart twist in this experiment design! Seeder is replaced by a UDP sender with a fixed test pattern that does not do any congestion control. Fixed rate blasting at the exit node, which then uses full onion encryption to deliver the test pattern to the leecher. Leecher attempt to decypher the incoming data into Tor-like encrypted Bittorrent traffic. What would 4 cores running 4 seeders with 4 cores with 4 leechers do? The single exit node and real congestion control would hint at bottleneck. |
Experiment: multiple leechers vs multiple seeders:
These results show that we have bottlenecks both in the leecher and the exit node. Most probably, there are 2 bottlenecks: one in the direction of upload (ACK packets from leecher to seeder), and another one still in the direction of download (DATA packets from seeder to leecher). |
A superficial analysis of profiler charts revealed another potential problem: we recreate the crypto object every time we need to encrypt or decrypt something. This includes a very costly operation of cipher initialization in OpenSSL, that is responsible for 17% of total CPU usage by the exit node, according to the profiler. UPDATE: we're using GMAC mode with AES encryption, so if we want to have each packet independent of each other, we must use different IV for every packet/encryption. No low-hanging fruit there. Still, constant calls to |
Profiler screenshots showing mysterious "cycle 10" encompassing 80% of CPU ticks (probably, Twisted related): EDIT: according to @devos50, "cycle 10" is Twisted itself. Mystery solved. |
mental note: earlier discovery was that Twisted "thread pools" cost performance. By moving to main-thread, numerous context switches are avoided and performance could also double here. @qstokkink @devos50 |
Currently, we poll sockets ourselves in an endless loop. |
@synctext, tried that. Disabling "thread pools" does not affect performance in my tests, but removes "unknown" objects in profiler, and makes profiling Twisted manageable. |
Your comments touch upon a great many Python and Twisted nuances and intricacies, let me try and organize this into a shortlist:
Final points:
|
@qstokkink, thanks for a detailed explanation! |
@ichorid we did not have the time, nor the indication that we should do this before. If you want, you can try implementing a full Twisted-based endpoint. |
UDP bombing with crypto disabled: 70 MBytes/s on the exit node. |
While UDP bombing the exit node, its memory usage remains constant. However, the leecher's memory usage grows very fast, as if it internally buffers all the packets it can't process. Even when the stream of packets ends, the leecher's CPU usage remains 100% for a long time, until it finishes processing the buffered data. Its memory usage drops somewhat along the way. This happens even with crypto disabled. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@qstokkink, when I add this line, the download starts and dies almost immediately. |
@synctext, it's really a shame that uvloop does not work as a drop-in replacement of reactor loop with Windows ((( |
We have seen good enough benchmark performance for encrypted tunnel throughput (14 Mbytes/s). However, torrent download speed is 4 times worse. Why? |
Just got this when doing direct calls:
|
The dictionary size change is due to multiple "threads" accessing the same dictionary. We could use something like in IPv8: lock = RLock()
def synchronized(f):
def wrapper(self, *args, **kwargs):
with lock:
return f(self, *args, **kwargs)
return wrapper And then annotate the class methods with |
That's one option. Another one is to investigate what lines access the dictionary and try to remove that access completely. Or shift all the usages into a single "thread". BTW, this would help with further isolating IPv8 and preparing it to move to a separate process. |
OK, the communication scheme between old Dispersy listener and new IPv8 listener is completely broken and brain-dead . It messed up the results of my old experiments. |
Different call methods effect with
So, it's safe to use Indeed, Twisted is magic. |
@ichorid Cool: you just fixed 10-year-old crappy code; when are we getting the PR? |
8 Mbytes/sec! as leecher. Impressive. PR and Jenkins job would be most welcome. We could install a rule that future code contributions should not reduce anon download speed. On all supported platforms have performance regression tests. |
I have just reimplemented IPv8 endpoint with Twisted's p.s. |
Latest scientific related work in this area, from OSDI 2018 (still server-based; not suitable for Youtube-like systems): Karaoke: Distributed Private Messaging Immune to Passive Traffic Analysis Like Stadium, Karaoke is distributed over many machines, and must ensure that malicious servers do not |
From https://magic.io/blog/uvloop-blazing-fast-python-networking/
From https://www.nexedi.com/NXD-Document.Blog.UVLoop.Python.Benchmark: This tells us the bottleneck is typically not the reactor loop, but Python itself. |
How would these graphs look with UDP usage? UVloop with 80.000 req/sec at 10 KiB = 800 MByte/sec, correct? |
@synctext , yes, that is 800 MBytes/sec. It is known that in simple networking benchmarks Go is typically 2-3 times slower than bare C/C++. That correlates well with other results obtained with libuv, e.g. CJDNS (~1GByte/s without auth/encryption). |
We lack a structured performance evaluation suite. The content of #3325 is not available in easy to use dashboard #4999. Related work, see the beautiful writeup at the top NSDI conference. Zero-deployment of these ideas, just simulation code https://github.com/tschorsch/nstor Hard to beat that with our hard work. |
Since @egbertbouman is the responsible person for the new tunnels upgrade using Rust and there seems already good signs of progress there, I'm unassigning myself and assigning him. |
Privacy with usability is the core goal of our ongoing 11 year effort.
This end-to-end test is designed to identify problems in our whole Tribler chain, from session key exchange to rendezvous peer discovery.
Scenario:
The text was updated successfully, but these errors were encountered: