-
Notifications
You must be signed in to change notification settings - Fork 451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scalability of Dispersy channels #2106
Comments
A weekly small progress report:
|
thnx for the update. curious. If you have time, please insert a picture here of VisualDispersy for easy smartphone viewing.. |
In other news: I supercharged Dispersy with some selective multithreading. This brought the 10 nodes, 10 messages-per-node experiment down from 6 minutes to 10 seconds on my localhost. Which was thusly fast, that it crashed my VisualDispersy tool. The 3 nodes, one node with 10000 messages was brought down from over 2 hours to 4 minutes. Note that this uses the fastest settings, it can still be slowed down at will to conserve system resources. One kind-of-big problem which remains in the synchronization department is the message delivery to the communities. It seems some kind of buffer, in between the endpoint and the community overlay, is holding the messages and then delivering them all in one shot. For example, for a node the first 9400/10000 messages will suddenly pour in 2 minutes after the first packet was received and then it will continue to receive messages at a steady pace. This behavior is the real killer for the 1M messages community. Once I have finished polishing up the code (properly logging/handling exceptions, addings comments, hunting down thread id asserts, etc.) I will share it. |
message batching is that killer! |
wow, that is a solid performance evaluation + dramatic improvement. Impressive thesis content! Looking forward to seeing a demo Monday. Message batching was introduced as a possible measure to improve performance. The idea was to process a few messages at once. It would reduce context switches and use IO more efficiently... |
I did disable the batch feature (with the flag), so the buffering is occurring somewhere where it shouldn't be. Right now I assume it to be in the horrible spaghetti of community.on_messages. |
self._dispersy._delay() interesting.... that might be from the time when peer not always included their full public key in messages, just the hash. Then messages got delayed until the full public key was known. That really can be refactored out.. |
Continuing with the conversion refactor: I also have a Gist for Protocol Buffers: Some observations:
All-in-all (also to save @whirm some work managing packages) Protocol Buffers might actually be the best choice. EDIT: |
Small milestone: The standalone serialization/conversion has reached 100% unit test coverage and is ready for integration into Tribler. |
Interesting, so what is the gain from using (capnp vs) protobuf vs. struct.pack? I was thinking to also include this in the tunnel community on Tribler. If protobuf is in the official debian and ubuntu repos then it is indeed more favorable to use those. |
The gain of capnp and protobuf over struct is in part readability of the message structures and also easy backward compatibility. Note that these approaches may in some cases (rare) waste a few bytes over hardcore manual struct definitions (although this is usually covered by smart message packing/compression). For an example of the readability, the votecast message is defined as follows right now: '!20shl' In Protocol Buffers you would define the exact same thing as:
The goal is to (eventually) have all communities use this. |
And performance wise? Your protobuf wrapper really looks useful. |
Solid progress. |
@lfdversluis Captain proto is faster because it actually stores objects serialized in memory. Protocol Buffers does not however, this makes it slower (https://capnproto.org/ has a nice bar plot of the two). |
yeah the infinite speed thingy I am familiar with. Was wondering if you had done any comparison and stress testing :D |
After extensive whiteboard work, I think I have a new community design everyone can be happy with. The design both allows backward compatibility and forward compatibility, without having to change Dispersy. Note that this immediately adopts the single walker for all communities. Here is the class overview, which I will explain below: Backward compatibility/old communitiesThe old communities will continue to exist (for the time being), but will exist solely to forward messages to and from the new community. This allows for phasing out the old communities without loss of data. The new CommunityManager will exist alongside these old communites, behaving like any other Dispersy community. The CommunityManagerThis will serve as the mediator between all of the new communities and Dispersy. It will handle sharing/sending all Protocol Buffers message definitions from the new communities. The advantage of this single-community-in-the-middle behavior is that (1) only a single walker is used, (2) all new communities would use a New communitiesNew communities will no longer have to deal with Dispersy directly. Instead of If anyone has any critiques, questions and/or feedback: please share them. |
I am already quite convinced about your protobuf wrapper, although I am not sure how much time we gain using it (I know it is faster than struct.pack which we use in all conversion.py files now). If we decide to take this approach, I would be happy to make the anon-tunnels use this system, but to what extent do you think this will require rewriting? I cannot spend months on the tunnels, so an indication would be nice. If @synctext likes this approach and gives the green light to take this path, then I hope you can explain us all your plan in more detail :) |
@lfdversluis this can work alongside 'old' channels so there is no need to immediately switch. To answer your question: assuming the CommunityManager exposes all of the required functionality correctly you could probably switch the community code itself in a day (+- 6 hours). However changing all of the unit tests as well, getting the code peer reviewed and running into unexpected issues will make the process probably take 2 weeks. |
epic weekend work! Please note that backwards compatibility is not needed. If we can release a new Tribler with a fresh, lighter, and faster AllChannel, that is all OK. We reset all the votes. Another release could break and upgrade search, tunnel, etc. If we can avoid breakage of compatibility with little work, that's obviously preferred. |
@qstokkink we currently have no dedicated unit tests for each individual community. Since the (big fat) wx tests will be removed soon, the coverage in the I don't think I really understood the idea of the Also, are you planning to refactor the (old) communities one by one or are we going to change all communities to adopt your new design immediately? Other than that, I like the design and I definitely look forward to more stable and easy-to-use communities 👍 |
@devos50 Not having any unit tests does definitely speed up the adoption process of this new scheme. Furthermore, writing new tests should be a lot easier now. The old communities are for backward compatibility, such that in the transition period between Tribler version switches, communities do not get torn in two. This would happen because of the switch in wire protocol, which would make it impossible for new versions to enter an old version's community and vice versa. By keeping the support for the old protocol for a bit, you can perform the switch between old communities and new communities more gracefully. The added benefit of being able to cope with the old communities, without breaking the new ones, is that you can indeed switch over the communities one by one. On the other hand, it might make more sense to handle the port in a single pull request. I am not sure what the best approach would be. |
Alright, I finished compiling the list of current Tribler wire-format messages, containing data/member/field types and aliases. Just in case anyone wants to know what a particular message looks like on the wire right now. This will be the base for the Protocol Buffers definitions, such that transitioning will be as painless as possible. One particular thing that caught my eye, is that some communities are overwriting the introduction request and response. This will have to change if (or when) a single walker is used in Tribler. EDIT: I finished porting the AllChannel messages, here is the real-world example of how the Serializer would work. EDIT 2: All of the .proto definitions have been finished (see https://github.com/qstokkink/TriblerProtobufSerialization/tree/triblermessages). Moving to integration with Tribler and porting communities. |
Something threw a wrench in the works, (very likely) preventing a pull request from being available this monday: the communities use some of the Dispersy routing information in their logic. Out of the 37 header fields, 8 are currently being used inside the communities , 1 is deprecated due to the switch to Protocol Buffers and 1 is a duplicate field. I expect this to delay the refactoring by 1 or 2 days. EDIT: Actually, here is the new base community class. This is as good as it will get without performing some major refactoring inside the Dispersy project and the Tribler communities. We should discuss details and where the code should exist next monday. |
Some somewhat exciting news: the first functional community port (allchannel) runs its gumby experiment without producing errors. For convenience, here is the port checklist:
And the backward-compatibility checklist:
|
@whirm How do you want to review this? One round of comments this week or One Giant PR when done? devel...qstokkink:protobufserialization_rc |
Thesis material:
Together: First priority: PooledTunnelCommunity stable 👏 |
Thesis storyline: dispersy is just a use-case, now fast & usable. 48 cores == scalability ? Key target for thesis final experiment: 1 anonymous file download on a 16 or 48 core machine. |
@egbertbouman |
Issues moved to: #1150 (comment) |
This performance analysis work seeks to understand the effectiveness of the sync mechanism.
Linked to: #2039 .
For thesis work first explain that sync to 10 peers failed for example Flood community:-)
Repeat with minimal example community. Use DAS4 or DAS5 for 10..1000 Peer examples.
Repeat with actual channel community with torrent collection.
Redo this with the Q-algorithm, sync of magnet links only, scale to 1M (small) items.
The text was updated successfully, but these errors were encountered: