-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slow anonymous downloads: Crypto CPU bottleneck #1882
Comments
@synctext I don't think this is related to 6.6 |
Approach: our exit node helpers are maxed out in CPU. It will be extremely valuable to profile them, identify hotspots, test improvements, and benchmark various changes in the field. |
I would love to profile them yes. I already have a hunch where bottlenecks are, but it would be great having a profiler output data that can be visualized by e.g. Vallgrind to verify and confirm the bottlenecks. |
Anonymous downloads are quite slow. As of V6.6 pre-release version we have with a solid i7 CPU core at 125% only 330 KByte/sec. Confirms @whirm his view. Both the tunnel helpers and anon downloader are CPU constrained. |
I have made some measurements for the endpoint.py throughput per logging level: Long story short: choosing the wrong logging level can cost you around 400 MB/s throughput. |
@qstokkink hah. Just yesterday I was talking about removing unnecessary log calls when the big cleanup happens with @devos50. |
Ok, I finally have a design I am happy with and which will provide a significant speed increase. The main issues I faced were the following:
So the shocking design I came up with (working with Dispersy, instead of against it, for a change), is as follows:
This means that the main process controls the circuits (the amount, their statistics and stuff) but no longer needs to be the man in the middle for data passing through. In turn, this means that the main process' Dispersy is freed up and therefore becomes a lot faster. An added benefit of this approach, is that it is a lot easier to implement than the options described in (2) and (3) too, leaving the TunnelCommunity almost completely the same. These are the reasons for why I believe this to be the superior design, instead of the one of (2) as discussed during the last group session/meeting. Feel free to provide comments or critiques. |
wow, wait, what? @qstokkink So one Dispersy instance per community. So we fork a new Python process for each community, with their own IPv4 listen socket, own database, and walker. Is the process doing the setup on a different listen port then the Tunnel community itself? How does this work with UDP puncturing. Are we still puncturing the correct listen socket? Sounds quite interesting. I would suggestion doing quick prototyping and get the Tunnel community in it's own Dispersy process. |
@synctext Correct: one TunnelCommunity for one Dispersy with one unique listen port for one subprocess/forked process. Note that this will also be the only community loaded for a subprocess. This should work perfectly fine with the UDP puncturing. Because all of the TunnelCommunity messages are sent directly through the endpoint with NoAuthentication() the database of each subprocess is hardly used (only for Dispersy internals). In spite of this, there is definitely a case to be made for having a shared set of walker candidates and a shared database between subprocesses, as you suggested. |
@qstokkink sounds like an interesting design. I'm curious to know how this works out. I'm a huge supporter of splitting Tribler in various processes, however, we should be careful that we are not overcomplicating communication between processes since I think it's important that (external) developers can easily modify the communication protocol used between the Dispersy processes. Another advantage of splitting it up in processes is separation of concerns: developers can focus on a single part of the system (and in the future, even implement a small part of the system in another language). With this design, we can utilise all these additional cores in the CPU 👍 |
@devos50 If anything, this should even simplify communications. The three messages being transferred are the following:
That's it. EDIT: Sorry, there is a fourth: the notifications. |
morning, sooo.... |
@synctext Correct, each subprocess will need to be punctured separately. If this is really a problem, the design can also use a single port: some very nasty code in endpoint.py already takes care of this. Do note that using the single port will already hit the performance pretty hard. |
soooo v2... |
Finished the comments/style corrections and sanity check. Almost there.. I might even have the PR done by tonight. The TODO list has become quite short. |
@qstokkink looking forward to it! Note that you don't have to squash everything into a single commit. Please make a logical units of works that make sense and make sure your commit history is clean (consists of distinct changes so no fixes on fixes). |
Woops, not happening tonight, I accidentally merged in some |
Repeating: #2106 (comment)
Together: First priority: PooledTunnelCommunity stable 👏 |
Further future (by Yawning Angel): #1066 (comment) |
Related work fro thesis from MIT Riffle: An Efficient Communication System With Strong Anonymity, uses central servers, lacks incentive mechanism. |
@qstokkink nice results, so you are using hidden seeding and not plain seeding? |
@devos50 This is plain seeding, the y-axis is me being too lazy to edit the label. I also ran this on the 48-core machine (ergo 48 processes per peer) with the same amount of peers (8) and the results are pretty interesting (ignore the 2-hops, which fell back to non-anonymous downloading): EDIT: By the way, is bbq still O.K.? The above experiment had 1152 circuits and 384 processes running concurrently on the same machine. |
18 MByte/sec. Our users are going to love that. Real strange scaling to 48cores. great thesis material. |
@synctext Well from the point of the 2 seeders they have to serve 6 times as many downloads/leechers. Because of the download mechanism's overhead, it is better for a seeder to serve one leecher at high speed than multiple leechers with lower speed. This problem should disappear if the amount of seeders scales up with the amount of deployed proxies. |
:-) How do you lock a 64GByte, 48-core machine? What resource ran out? |
@synctext I have no way of knowing what resource ran out (probably either CPU or sockets ran out). Since it hasn't come back online yet, I can only assume it was the CPU and the building has burned down. And, yes we might have to put a warning symbol above certain settings, or along a slider. |
@synctext @qstokkink I will probably go to EWI tomorrow to reboot the thing :) |
@qstokkink the building did not burn down and bbq is up again :) (took me some effort to get access since my card was not working correctly...) |
@devos50 Thanks! I'll run a less flamboyant experiment from now on (which I know it can handle). |
latest thesis results: MSc_Thesis_v2.pdf |
Cardinal & Closing graph of thesis: fast anon download. or extra chapter with multi-core Javascript + fancy homomorphic crypto math! When using the 48-core BBQ and entire DAS5 together it should be possible to show nice anon download speed graphs. Goal is to have performance towards 48x on our 48-core download machine. DAS5 then acts as a dedicated seeding and relaying cluster. Showstopper (as always) : Gumby |
After some additional runs and behavior analysis, it seems like organising the DAS5-bbq experiment will require some fundamental code changes in the pooled tunnel code. Therefore, I have decided to definitively pull the plug on that and focus on the fancy crypto chapter instead. |
in science formulas get more respect then running code.. |
MSc_Thesis_v4.pdf
|
(First) Release candidate: |
Good story flow!
|
Fixed in second release candidate: P.S. Apparently differnt passes the spelling checker 😕 |
Good, 6 pages with the math fundamentals. |
This 2016 ticket did not yet focus on the latency and uTP protocol. #2620 is dedicated to this. This 2016 ticket documents important ideas from the multi-core and crypto CPU load. Note, we still have stuff open. closing. |
The CPU seams to be the reason for slow <1 MByte/sec anonymous downloads.
possible problem
Running crypto on twisted thread blocks all other Tribler activity. Unclear if we needs 256bit GCM mode. Anything that checks a signature, decrypt a message, etc. needs to be traced.
possible long-term solution
Separate thread for the tunnel community or even an isolated thread for each community.
Low hanging fruit: parallel execution of relaying, make it multi-threaded. Real threads: Twisted reactor versus import multiprocessing...
goal
benchmark raw openssl C code GCM MBps.
Create a minimal benchmark comparing current situation in Tribler with alternatives. Not re-using Tribler code, but a performance test processing 10.000 UDP
EDIT: use 10k UDP packets through exit node as benchmark.
The text was updated successfully, but these errors were encountered: