Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tribler code review #3646

Closed
ichorid opened this issue May 24, 2018 · 31 comments
Closed

Tribler code review #3646

ichorid opened this issue May 24, 2018 · 31 comments
Assignees

Comments

@ichorid
Copy link
Contributor

ichorid commented May 24, 2018

This is the big list of things I found out during the review.

TODO:

  • run_tribler.py: __main__ runs twice, refactor in 2-3 separate files.
  • Session object should be split up into several smaller objects and remixed with TriblerLaunchMany. There are several completely dead methods that are never used anywhere (except unit tests).
  • DBUpgrader should be de-Twistified and moved to run before Twisted reactor.
  • LaunchManyCore should be completely refactored by splitting into the series of smaller objects/methods. Its torrent-handling functionality should be moved to the torrent manager.
  • APIImplementation dir should be scrapped completely after refactoring LaunchManyCore.
  • RemoteTorrentHandler.py contains almost completely dead code. EDIT: it used to collect torrents in the background for anonymity and quality of service, but was broken a long time ago. Not gonna fix, should just redo this in AllChannel2.0.
  • SqliteCacheDBHandler.py is full of awful old code that should basically be completely gone when we move to AllChannel2.0. For example, PeerDB is only written, and is never read. We should keep SQL usage to a minimum to be as stateless as possible.
  • Notifier.py - subjects should not be pre-set in Notifier. Instead, they should be defined dynamically by users of Notifier object, in their respective modules.
  • Our SQL access routines are scattered all over the code, implementing the same functionality many times. Maybe we should switch to a unified modern solution, like ORM.
  • Libtorrent managing code: 💥 💣 💣 💣 💣 💣 💣 💣 💣 😱
    2k lines of code in 2(two!) objects! And some more high-level stuff is spread out through the rest of the codebase. There is a LOT of valuable experience in this code, so refactoring should be done carefully.
  • TorrentDef.py - includes dumb TorrentDefNoMetainfo object that forces everyone to check the object type. Should be refactored to become one with TorrentDef. Which should be moved into TorrentMgr module.
  • TriblerStatistics object should be used as a parent class for Session, instead of being wrapped around.
  • I stopped after half an hour of reading Dispersy code. My sanity is still precious to me...
  • Reviewing every file in SOCKS proxy is not required, everything is fine there.
  • Various flavors of Torrent object are added again and again by different subsystems...
  • In general, the most common problem is stuff being spread all over the place. It breaks isolation and greatly impedes understanding the code.
Subsystem Overall state Outstanding problems Solution
Startup Complete mess Spread through several big objects and files. Rearrange objects and methods into a single module.
Torrent handling 19-th century cookbook: lots of cool recipes, funky language, no structure. There are several types of distinct torrent objects with repeating functionality; all torrent-managing code is in 2 giant objects; no clear sync/async separation; pieces of code are scattered all around the codebase; Step 1: move everything into a single module; Step 2: unify torrent objects; Step 3: refactor to no end;
RemoteTorrentHandler Evolutionary atavism Does not work Delete and forget.
SQL handling Ancient magic Everyone do their own magic. Gradually shift to ORM, while writing new and refactoring old code.
Core utility code (Tribler.Modules.*) A collection of random pages torn from cookbooks of different centuries. Lack of structure. Gradually move everything in its place.
Dispersy The Necronomicon Cthulhu Fhtagn! Reimplement Search, Channels, AllChannel with IPv8, scalable local gossip algorithm and DHT.
TFTP module Minion of the Old Gods Obsolete Delete and forget.
SOCKS proxy Neat and tidy Is in the Tribler, instead of IPv8 TunnelCommunity Move to IPv8 TunnelCommunity.
REST API Neat and tidy Everything is fine, almost no cruft there. Praise and care.
Torrent Checker Neat enough Code is fine, though the algorithm is not effecient. Enhance the algorithm.
Category filter, Credit mining OK Could use some cleaning, but basically OK. Keep clean
Tribler Config Telephone book Ok, but a lot of boilerplate code and copy-paste make the potential programmer prone to error. Use Python .dict trick to reduce code size 3 times. And make individual subsystems add options on their own to common object.
TriblerChain adapter A love note to IPv8 No problems. At all. Praise and care.
TriblerTunnel Another love note to IPv8, much more passionate and longer. Some methods are too long and not grouped. Otherwise, everything is OK. Create member objects for subtasks and split some methods into more readable form.
Market community Writings of a heretical sect of madmen who split from the Church of the Old Gods to establish a cult of their own. Total overengineering. I don't know.
@ichorid ichorid self-assigned this May 24, 2018
@ichorid
Copy link
Contributor Author

ichorid commented May 25, 2018

Monitor-bezel-crushing image of Tribler module imports.

  • On the left side, you could see the Gratutious Spaghetti Pasta of Dispersy. Basically, everything is calling everything there...
  • Tribler market and tunnel community are connected to it by a feeble umbilical cord of import is_valid_address and import call_on_reactor_thread respectively.
  • Tribler.community.market is in the center. It shows some signs of hierarchy.
  • IPv8-related imports show clear hierarchy, though its a little bit flat.

packages_tribler out

Zoomed in.
Dispersy:
disp_pack

Market:
market_pack

IPv8:
ipv8

(p.s. used pyreverse from pylint to generate these. Had to remove all decorators, since it can't handle different decorators with the same name, or when a decorator has the same name as the function.)

@ichorid
Copy link
Contributor Author

ichorid commented May 25, 2018

It seems RemoteTorrentHandler contains completely dead code. Like, it's not used anywhere, or never ran. And when you break it, nothing happens, Tribler works OK. It seems to only write torrents DB, and never read it.

Guys, could someone comment on this issue?

@qstokkink
Copy link
Contributor

It should be used here in the SearchCommunity:

self._rtorrent_handler = tribler_session.lm.rtorrent_handler

You should be able to trigger this through searching for a torrent.

@ichorid
Copy link
Contributor Author

ichorid commented May 26, 2018

The only place where something _rtorrent_handler is used is:

self._rtorrent_handler.download_torrent(candidate, infohash, priority=LOW_PRIO_COLLECTING,

But it is never run!

infohashes_to_collect = [infohash for infohash in to_collect_dict

is always empty. You could literally comment out .download_torrent, and Tribler GUI works as usual. Channels, search, anonymous downloads, mining - everything works.

So, what the rtorrent_handler subsystem is intended to do?

@ichorid
Copy link
Contributor Author

ichorid commented May 26, 2018

Basically, message.candidate is always empty here:

to_collect_dict.setdefault(infohash, []).append(message.candidate)

@qstokkink
Copy link
Contributor

@egbertbouman can you say anything about this?

@synctext
Copy link
Member

It should collect 50k of torrents slowly in the background. The TFTP binary transfer of torrents never worked smoothly. You found another skeleton in our collection... We have an open issue for magnet links only Tribler operation. Discussion started 3 years ago #1150

@devos50
Copy link
Contributor

devos50 commented May 26, 2018

I can probably answer your question about the remote torrent handler module.

Another purpose of the module, besides what @synctext says, is to speed up the process of fetching torrent metadata. In particular, when clicking on a torrent in Tribler (i.e. a search result or a torrent inside a channel), it would try to fetch the torrent from other candidates. The torrent file itself would be sent using the TFTP protocol and the data was sent over the Dispersy endpoint (since this endpoint was already NAT punctured by Dispersy). I even think there is a separate prefix for this.

It used to be this way in the old GUI. When I built the new GUI, I added support for fetching torrent metadata using HTTP/UDP trackers and the DHT. Since (re)adding support for the TFTP transfer of torrent metadata was not very high on my priority list, this feature was never added back to the GUI. The code still exists in the original code base, together with a few tests (which are not really unit tests...). I highly doubt whether we want to re-add this feature again. For now, I would be fine if you remove the TFTP module and code that manages fetching torrents from others. Let's discuss this next week 👍

@egbertbouman
Copy link
Member

@devos50 The RemoteTorrentHandler used to do a lot more than just TFTP torrent collecting, it's also capable of collecting from magnet/DHT/candidates. All this functionality is still in there, but I guess we stopped using this while creating the new GUI.

As far as TFTP goes, I don't think this has ever worked properly. If you look at the code you'll see the that it will only try to collect torrents that we already have 😢 This means that collect-requests are still coming in, but we never actually discover a new torrent.

@ichorid
Copy link
Contributor Author

ichorid commented May 28, 2018

I talked to everyone involved, and it seems the consensus is: "we used this stuff in the old days, but now its irrelevant, so it should go". So, its removal is on the list.

@synctext
Copy link
Member

Well, before it is deleted... We need to know the consequences. No full .torrent collecting anymore? Have measurements proven that magnet-based BEP9 collecting is now fast enough? Is there a design for anonymous downloads and getting the .torrent?

@ichorid
Copy link
Contributor Author

ichorid commented May 28, 2018

@synctext, I've just ran Tribler with this stuff completely disabled, and it worked fine. Search, channels, downloads - everything ran fine. Torrent collecting is redundant now.

@egbertbouman
Copy link
Member

egbertbouman commented May 28, 2018

@synctext I looked into the TFTP torrent collecting, and at the moment it's not working at all. That means that we are only collecting torrents from the DHT/trackers after some user interaction (e.g., downloading a torrent). For testing purposes, I did a 1-line fix, and Tribler magically started collecting TFTP torrents. So, if we are OK with the amount of torrents we are collecting right now, we could remove it.

Note that because we are not currently collecting torrents in the background, we are essentially gossiping all our downloads within the SearchCommunity...

@synctext
Copy link
Member

Thank you for finding this out!
Quality of search result has been going down in Tribler, in my direct experience. That explains it, the torrent collected has gone down. How many torrents do people have on average? How many people have the healthy 50k torrents? Every Tribler client used to broadcast these health stats. How many dead torrents in the cache? We need a single dev for the content health I believe.

@ichorid
Copy link
Contributor Author

ichorid commented May 28, 2018

@synctext, we're gonna scrape all of this in favor of AllChannel2.0, the Heir to Dispersy. We definitely want to have some mechanism for background torrent metadata exchange, at least for anonymity purposes. However, there is no point in reclaiming this old code.
For "Dispersy v1.9 refactored" release we could do away with the simple mechanism of returning more than one torrent on search/update request. That would be just enough to support anonymity and quality of service.

@synctext
Copy link
Member

51% attacks are now real

We first need a spam solution, protection against an attack in AllChannels mechanisms with merely $1000 worth of cloud servers. With solid foundation we can focus on fancy features. Please focus on linking Trustchain non-cheap identities to content discovery. Perfect if we can all do that in 1 upgrade, but I prefer to first remove the SyncAll broadcast in AllChannel community, upgrade to IPv8, and discard all channels votes with Trustchain > 2 hops (and a future Max-Edge-Flow > 1 GByte).

That is a lot of features to get flawlessly deployed, correct?

@ichorid
Copy link
Contributor Author

ichorid commented May 29, 2018

Trustchain non-cheap identities to content discovery. Perfect if we can all do that in 1 upgrade, but I prefer to first remove the SyncAll broadcast in AllChannel community, upgrade to IPv8, and discard all channels votes with Trustchain > 2 hops (and a future Max-Edge-Flow > 1 GByte).

- that is exactly what I meant by "Dispersy v1.9 refactored"!
To put to use the years of Tribler experience, we have to keep the minimal core of successful code/algorithms from AllChannel/Dispersy. To remove obstacles to future developments, we have to port it to IPv8. To make it secure enough for MarketCommunity, we have to add Trustchain linking.
No one talks about fancy features ))

@ichorid
Copy link
Contributor Author

ichorid commented May 29, 2018

Each object in our codebase has its own SQL access methods...
Maybe we should discuss using ORM?

It would greatly simplify and unify all SQL-related stuff.

@synctext
Copy link
Member

Yes, ORM is also on my 2019 wish list.

Our running old-IPv8 ORM code:
http://svn.tribler.org/abc/branches/giorgos/datasync/sqlalchemy/schema.py

Please take the time to read about our operational 2010 work with this code: https://repository.tudelft.nl/islandora/object/uuid:52b586ea-6144-4b4e-a5a1-b05255ce493a

Plus IPv8 in 2010 with ORM:
http://kayapo.tribler.org/trac/wiki/IPv8

@synctext
Copy link
Member

There is a LOT of valuable experience in this code, so refactoring should be done carefully.

Impressive how fast you comprehend our code base.

@ichorid
Copy link
Contributor Author

ichorid commented May 30, 2018

@synctext, I have to, to be able to keep up with the pace of development...

@ichorid
Copy link
Contributor Author

ichorid commented Jun 7, 2018

IPv8 issues:
Boilerplate code:

auth = BinMemberAuthenticationPayload(self.my_peer.public_key.key_to_bin()).to_pack_list()
dist = GlobalTimeDistributionPayload(self.claim_global_time()).to_pack_list()
payload = MyMessage(self.BLABLABLA).to_pack_list()

is a classic example of a boilerplate code that is used in every message in every community.
@qstokkink , is there some reason why this is not moved to IPv8 _ez_pack, and the programmer have to do it by hand every time?

@qstokkink
Copy link
Contributor

@ichorid the GlobalTimeDistributionPayload is deprecated and I want everyone making a message to be discouraged from putting it in there, by forcing them to add this extra line.

@ichorid
Copy link
Contributor Author

ichorid commented Jun 11, 2018

@qstokkink, but it is in your wiki tutorial...
What should we use instead of it?

@qstokkink
Copy link
Contributor

@ichorid good point, I'll remove that.

@ichorid
Copy link
Contributor Author

ichorid commented Aug 17, 2018

I've done some quick check for duplicates in Triber with Clonedigger.
Market community is the first victim:

Source files: 33
Click here to show/hide file names

Clones detected: 71

753 of 3066 lines are duplicates (24.56%)

Parameters
clustering_threshold = 10
distance_threshold = 5
size_threshold = 5
hashing_depth = 1
clusterize_using_hash = False
clusterize_using_dcup = False

(p.s. the analysis does not include unit tests.)

market_duplicates.pdf

@devos50 , I guess this kind of analysis could be useful for your efforts with Market.

@ichorid
Copy link
Contributor Author

ichorid commented Aug 17, 2018

Looks like IPv8 has some duplicates too:

Source files: 8

Clones detected: 138

872 of 6993 lines are duplicates (12.47%)

Parameters
clustering_threshold = 10
distance_threshold = 5
size_threshold = 5
hashing_depth = 1
clusterize_using_hash = False
clusterize_using_dcup = False

ipv8_duplicates.pdf

@ichorid
Copy link
Contributor Author

ichorid commented Aug 17, 2018

Bear in mind, though, that Clonedigger's algorithm is language-agnostic and prone to false positives. So, take these stats with a proper grain of salt.

@ichorid
Copy link
Contributor Author

ichorid commented Aug 18, 2018

For the whole Tribler/next, without Dispersy and Test directories, and without submodules:

Source files: 272

Clones detected: 2822

4892 of 30018 lines are duplicates (16.30%)

Parameters
clustering_threshold = 10
distance_threshold = 5
size_threshold = 5
hashing_depth = 1
clusterize_using_hash = False
clusterize_using_dcup = False

tribler_duplicates.pdf

@qstokkink qstokkink added this to the Backlog milestone Nov 27, 2018
@ichorid
Copy link
Contributor Author

ichorid commented Feb 23, 2019

Anyone willing to continue this work - welcome!
Closing it for now.

@ichorid ichorid closed this as completed Feb 23, 2019
@ichorid
Copy link
Contributor Author

ichorid commented Jun 17, 2019

After a year of refactoring, Dispersy is dead and Market is moved to a submodule. Here are the updated diagrams:
image

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants