Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vadim's testament #6481

Closed
ichorid opened this issue Oct 21, 2021 · 6 comments · Fixed by #7726
Closed

Vadim's testament #6481

ichorid opened this issue Oct 21, 2021 · 6 comments · Fixed by #7726
Labels
type: memo Stuff that can't be solved

Comments

@ichorid
Copy link
Contributor

ichorid commented Oct 21, 2021

Related to #143

Preface

Over the last few years, we successfully solved every major technical problem inherited from previous Tribler dev generations. We radically updated Tribler codebase, established a robust code architecture and brought code support up to industry standards. Also, we've cut down on every non-essential feature, focusing development on two things only: metadata delivery and anonymous downloads.

During this journey, we identified a number of technical and scientific problems that block the Tribler project from reaching its goals. We tended not to tackle those in fear of bogging down our understaffed team, focusing on "low-hanging fruits" instead.

Gentlemen! 🚬
I inform you that all the low-hanging fruits are gone, only the hard ones remain. Here is the list.

Network-level problems

Solutions for these are obvious, but we never put enough effort into these, because we never had enough qualified manpower.

🚫 DHT calls blocked over tunnels 🚫

BitTorrent uses Mainline DHT to find nodes that seed an infohash. Mainline DHT is susceptible to various types of attacks, including DDOS. To solve this problem, BitTorrent libraries use spam control methods, blocking peers that send too many requests. Different clients employ different criteria for detecting DDOS attempts. The problem is Tribler DHT requests are all sent through exit nodes, which may look like a single node to DHT peers, triggering spam control. The result is half-dead impaired torrent info fetching and unreliable health info.
One solution to this problem could be caching or performing DHT requests on exit nodes on behalf of tunnel users. However, this could result in non-technical problems ©️ 👮‍♂️

🐌 Slow tunnels performance 🐌

Our anonymous tunnels code is implemented in Python (though the crypto library is low level). Performance is very, very bad compared to VPNs: we do 5 MBytes/s at best and 0,5 MBytes average for an overseeded torrent on a fast modern PC, while a typical VPN (such as Wireguard) will use the whole bandwidth available to the host (about 20-80 MBytes/s). The reason is Python copies strings on slicing, resulting in useless waste of memory and CPU cycles(solved). The reason is slow data exchange between Python lower-level libraries. The solution is to implement a shim IPv8 tunnels - SOCKS endpoint in a lower-level language.

The problem is exacerbated by Libtorrent's bad performance when using uTP.

2️⃣ Support for BitTorrent 2.0 2️⃣

LibTorrent 2.0 is actively pushing BitTorrent 2.0 standard, which moves to a more secure, 32-byte SHA-256 hash. Supporting it will touch every part of Tribler codebase, as 20 bytes hashes size is hardcoded and expected everywhere.

Token economy

Currently, our token economy does not do anything useful, but instead just perplexes the users and annoys the developers. Here are the reasons:

🕳️ The exitnode blackhole problem🕳️

When Tribler downloads something through an exit node, the user pays it the corresponding amount of bandwidth tokens. The problem is, 99.99% of seeds are non-Tribler, meaning that exit nodes pay no one. Essentially, exit nodes act the role of super-seeds for the network, but they never spend their tokens. The result is, exit nodes become "supermassive black holes" of Tribler economy, constantly dragging regular users to a negative balance. And negative numbers piss off people, incentivizing them to either stop using Tribler, or just regularly delete their identities to whitewash their balance.
There is another problem adding more complexity to the issue: when someone from outside the Tribler network exchanges traffic with a Tribler hidden seeder, or just an anonymous downloader, the traffic leaving the Tribler network will never be paid back, ultimately making the economy deflatory.
And no, the simple solution of prioritizing Tribler peers will destroy performance: instead of a hundred fast peers, we would use a single slow one.
One possible solution is to stop showing balances altogether and instead, show the user's relative ranking.

👎 Hidden seeding is ~broken~ useless 👎

Receiving UDP over SOCKS is broken in Libtorrent
When a Tribler starts seeding a torrent in "hidden seeding mode", the torrent will only be available through exit nodes. Unexpectedly, the seeding ratio of the hidden torrent will always be near-zero. The reason is: BitTorrent protocol prefers the 🏎️fastest🏎️ seeds, but hidden seeding is always 🐢slower🐢 than direct seeding.
The result is, hidden seeding does not help with recuperating the bandwidth tokens user spent for anonymous downloads, further breaking Tribler token economy. This is a vicious circle: Tribler users don't seed because no one pays them for seeding, because there is no incentive for Tribler users to download from other Tribler users.

⬆️ Prioritization of users is semi-functional ⬇️

Currently, there is just a single mechanism of prioritizing users on exit nodes: if the user's balance goes too low, the user will have a lower probability of getting service. (@devos50 , correct me if I'm wrong) The mechanism is very primitive, barely working, and is trivial to circumvent by whitewashing.

🍾 The exitnode bottleneck and inflection to self-sufficiency 🍾

A simple back-of-envelop calculation shows that for the Tribler network to become self-sufficient, there should be about ** one million** Tribler users online at every given moment. The inflexion point is 500 000 users: after that, most of the traffic will be served from inside the "Tribler bubble":
inf

However, even then all the traffic will have to go through the exit nodes. The reason is, in the current architecture of Tribler anonymization network there is no such thing as "hidden-seeding-only" exit nodes. I.e. if there is a million users and no exit nodes, hidden seeders will not be able to connect.
The trivial solution is to add a special class of "pseudo-exit-nodes" that only allow connections to hidden seeders. (@egbertbouman correct me on this if I am wrong.)

💰 Credit mining fiasco 💰

In an attempt to bootstrap Tribler into a self-sufficient ecosystem, we tried to implement a "Credit mining" system that should have allowed users to "invest" some disc space and traffic into seeding Tribler torrents to get token rewards. Unfortunately, the only thing it provided to the users is a constant stream of lost tokens and disappointment. Eventually, we removed this feature.
The reasons why it failed are multiple (e.g. the hidden seeding problem described above), but the ultimate one is: BitTorrent is a non-zero-sum game. If everyone is using the same algorithm and downloading the same popular torrent in hopes to profit from it, every megabyte of the torrent they provide to others play against them. In fact, the simplest analysis shows the series of "wins" for a torrent starting from a single peer resembles a harmonic series, grows incredibly slowly because of diminishing returns.
The solution to this problem is two-fold:

  1. devise different token prices for different torrents
  2. stop trying to replicate the money economy and instead come up with a social rating system
🤑 Deanonymization by payouts 🤑 In the current architecture, we do payouts immediately. In combination with an open ledger, this could allow deanonymizing people by their traffic patterns. Solution: implement [deferred (and possibly fuzzy) payouts](https://github.com//issues/4255).
🤝 Unite the economies of Metadata, Anonymity, and Seeding 🤝

If Tribler is ever going to reach its goal of creating the attack resilient economy for media, users must be able to provide value and trade different kinds of services in it. There are three primary ways how a user can benefit the media-economy:

  1. provide anonymization services (being an exit node or an intermediary peer)
  2. provide seeding services (storing data for rare torrents)
  3. provide metadata enrichment services (create and maintain channels, categorize metatada, add tags, etc.)

All three kinds of activities must reside in the same, single economic space. This means either using a single token to reward all three kinds of activities, or creating three different systems of tokens that can be traded on a free market.

Content layer

Due to numerous social engineering mistakes, design errors and architectural miscalculations, the current Tribler content layer "Channels 2.0" failed to reach its goal of becoming a decentralized alternative to web-based BitTorrent trackers.

💫 BitTorrent is unsuitable as the Channels backend 💫

Big torrents collections cannot be created or maintained by a single user. Therefore, the bigger the collection, the more people a required to maintain it and edit it simultaneously. If the data is stored as a torrent, each change results in creating a new infohash for the whole collection, eventually leading to swarm fragmentation. Thus, collaborating on, or just regularly copying data from a large external source becomes nearly impossible. Clearly, BitTorrent is unsuitable as collaboration platform backend.
Also, using torrents as a backend involves pretty complex logic for packing the data into append-only files and dealing with asynchronous events from an external entity (Libtorrent).
The solution to this (and other) architectural problems would be abandoning BitTorrent as the Channels backend altogether and instead fetch data on-demand from other peers. Migrating to BitTorrent 2.0 will not help, because the problem is not storing many small files per se, but the rate of change being proportional to the number of contributors of a channel (e.g. quadratic if every user is a contributor).

✏️ No crowdsourcing instruments ✏️

The ability for users to author metadata cooperatively is a critical requirement for Tribler Project to reach its goals.
Channels 2.0 system was initially designed as permissioned, except for the top-level federated channels list (the "Discovered" tab). The plan was to begin from permissioned and then add permissionless crowdsourcing elements, such as the ability for users to create "pull requests" into other's channels. Unfortunately, we became distracted by gigantomania and non-essential features.

The upcoming tags system is a step in the right direction, although its decision to not use the Channels 2.0 backend pose the question about how those two systems are going to be integrated in the future.

Also, users must be able to communicate with each other to discuss information organization. No crowdsourcing system could exist without discussion between participants.

🔍 Channels search is lame 🔎

Our Channels search algorithm is very simple and inefficient: ask five random neighbours for some results on a keyword. That's all. No dynamic deepening of the search, no walks, no deduplication, no popularity concerns, no indexing. Just 5️⃣ random hosts 🤷 This is a clear obstacle for the Tribler content layer to become usable. One thing that probably saves us at the moment is the proliferation of Free-For-All (FFA) entries due to local caching when users search for popular keywords.

Some related issues: #2250 #2547

🧟 Brain-dead data transfer: 99.99% of Channels data unused 🧟

Channels 2.0 design is based on transferring the log of signed channel changes and then replaying it on the user machine to put those entries in the user's local DB. The processing is very slow and inefficient for bigger channels and unreliable for smaller channels. An analysis of possible solutions to the transfer problem shows that there is no good design at all, if we are downloading full channels.
The real problem is data integration: there are no databases that allow transferring and merging indexes in sub-linear time. Also, a human being is only interested in a handful of torrents from a million-torrents channel, wasting 99.99%. Moving everything around is a brain-dead 🧟 waste of bandwidth, CPU cycles and users' time. Google does not ask the user to download the whole index before usage, really 😉
The solution is to spread the data around in the network and fetch it dynamically, using the network itself as an index.

изображение

(Be warned that that transferring complete SQLite DB's is no option because of security issues.)

🍿 "Popular" suggestions are 95% garbage 🍿

The "Popular" tab is served contents by Popular Community. There are multiple problems with how that works: first of all, the info is not propagated transitively, for the fear of spam. Also, Popular Community uses the push-based gossip model, which still keeps the overlay susceptible to flood-spam attacks, but does not allow for the "initial boost" feature of pull-based gossip (that makes Channels discovery usable). The push-based model also prevents us from doing aggressive walks for research purposes, as we did with TrustChain and Channels. Also, there is no bias for newer torrents in the "Popular" tab. The result is, "Popular" tab only shows 2-3 torrents that are really popular at the moment: the rest are typically 2-15(!) years old torrents that showed some great number of seeds at that moment, and for some reason (probably due to DHT bugs) still sometimes show high number of seeds. The problem is exacerbated by the fact that BEP33 checks for seeds are unreliable, and there is basically no way to tell the real number of seeds for a torrent without connecting to the corresponding swarm. In Tribler, these connections always go through exit nodes, which often results in DHT spam filter triggers and unnecessary exit nodes load.
The problem of popular torrents is a complex one. Basically, we must design a distributed algorithm for collectively checking a set of entries (infohashes) and sorting those entries dynamically based on a dynamic property (number of seeds). And don't forget the exit nodes bottleneck and the constant danger of spam!

Some solutions to this problem could be:

  • cache health data on exit nodes (dangerous due to non-technical problems)
  • establish a separate class of "health-checker supernodes".
  • stop propagating health data and instead go for some relative-popularity, time-based heuristic like VSIDS
  • switch to pull-based gossip and add dynamic boost based on the state of the local database
  • split the health checking work between neighbouring peers in a semi-structured way
👄📏 Family filter is Victorean-nun-meets-Nazi overzealous 👄📏

Originally, our Family Filter was a quick hack made of bag-of-words-grep from some Dutch porn site. That's early-80s state-of-the-art! We definitely need something smarter, e.g. Bert.

🐈🐈 "Just show 100 most popular torrents" will never work 🐈🐈

Content popularity for file-sharing networks and the Web is fundamentally different: file-sharing is much more flat at the top.
Essentially, there are no "Google" or "Facebook" among torrent files. The reasons are many:

  • movies are available in many languages
  • movies are available in many resolutions
  • movies are fast-lived, popular only for a short amount of time
    Basically, file-sharing is about streaming media: games, shows, movies, music - and the media really streams. Torrent popularity is short-lived. Thus, the strategy of "let's just show 1000 most popular things" will never work with torrents: people's taste for movies is much more diverse than their taste for websites.
    Also, torrent collections are products of specific communities. Copying content will not copy the associated community: the collection will remain dead 💀. Best case, if some algorithm would be copying contents continuously, Tribler will forever remain just a mirror of those web-based trackers, like the Internet Wayback Machine. Do people use that often? How many people even now of the Internet Archive?
    Moreover, copying content from another platform disincentivizes Tribler users from creating functional channels and communities around those. Users just "choke" on those big piles of content, unable to change those or use in their own projects.
    изображение

The way to solve this is to acknowledge that user communities co-evolve with the content they produce and their environment. Instead of focusing on copycatting 🐈 🐈 contents from others, we should focus on developing efficient crowdsourcing tools for the Tribler community. One can spend infinite amounts of gas trying to start a fire - it will never become 🔥self-sufficient🔥 if the logs are 💧wet💧.

User interface

Aside from the usual discussion about using Web stack instead of QT, our GUI has a looooooong way to go in regard to style and usability...

:goberserk: UI looks :goberserk: In general, Tribler UI looks lame and outdated, like a thing made by a schoolkid in early 2000s (which it essentially is 🤦‍♂️ ). First of all, none of the other torrent clients uses a dark scheme. The thing just does not associate well with what torrents clients do - sending files around. It would be very nice if we could provide both light and dark themes with Tribler: unfortunately, this will require refactoring the QT CSS mess - at the moment, the stylesheets are scattered all around `.ui` files and `.py` files. Some colours are even set up in the code manually. QT bugs do not help with this task either. In general, the solution should be: "move all the stylesheets into a separate file, leave .UI files unstyled". Also, it will require creating position-specific subclasses for many widgets in the GUI.

Second, our UI is just... no eye-candy? Inconsistent? Here is an example of a sleek modern PyQT GUI (PyOneDark):
GUI

Third, lots of little usability details are missing: the keyboard focus is not there where it is expected, we don't use keyboard shortcuts, we raise dialogs on every occasion, dialogs look boring without any icons, etc. In short, when one opens Tribler for the first time, their first reaction is: "OMG, that's ugly 👹 ". Then they try to use it and it does not disappoint their expectations - the UX is as 💩 as the UI.

With such looks, it will be extremely hard to reach 1 million users in the world of 15-seconds attention spans, where each app only gets a single chance to prove itself.

The solution is to either hire a professional GUI design team, or move to Web tech and reuse templates that are already there.

:neckbeard: Identifying torrent authors (identicons) :neckbeard:

At the moment, when the user searches for content, they get a flat list related to channels, folders and torrents. The problem is, the user can't see the author or the source channel of those. This makes it impossible for users to identify good sources of information (e.g. channels to subscribe).

The solution is two-fold:

  1. show full paths for entries in the search results list, either in form of a pop-up hint, or inline
  2. add identicons for public keys/channels
🌳 Add recursive channel search 🌳

Currently, the "filter" input box in channels searches for contents only in the current folder/channel, not diving into the folder/channel child folders. It would be very useful to instead make it work recursively on channel's folders.
To do this with acceptable performance, we'll have to add some accelerator structure to Channels DB, such as material paths, transitive closures or matrix encodings.

📜 Downloads table is too wide 📜 See https://github.com//issues/6452

изображение

Also, the table is not too responsive, especially when Tribler starts. A better way to represent the downloads list is to use a hiding list to the right half of the window. Also, that will enable a natural drag'n'drop way of moving torrents to/from Channels.
Home_Layout

🚫 : Can't share files with Tribler, torrent creation broken 🚫 : Tribler's primary goal is to enable users to easily share files. Our users asked for a feature to [share a folder of files](https://github.com//issues/4729). At the moment, the torrent creation dialog is [completely broken](https://github.com//issues/4674) (and it has been broken for a couple of years already).

This means two things:

  • torrent creation is too complex
  • users need a simpler way to share files
🏠 Bring back the Home screen 🏠

Home screen was removed because it did not bring any value (it was just a placeholder). Nonetheless, users expect there to be a home screen, something like a personalized dashboard or a news feed. This time, it should provide value to the user.

Some stuff that should be on the home screen:

  • updates to subscribed channels
  • changes to popular torrents list
  • list of active torrents
  • list of users' channel
  • user's identicon
  • pull requests on the user's channels
  • updates on users' pull requests
  • new personal messages, etc.
  • Tribler development news and new version notifications
📄 Switch to paginated Channels interface 📄

Implementing Channels interface with QTableView was a big mistake. Yes, it provided the fancy endless scrolling feature good for showing off to journalists, but endless scrolling is useless if all entries in the table are the same height. Google uses endless scrolling just for pictures - its search remains paginated for a reason. Moving to paginated QListView will result in the following upsides:

  • better navigation - pages help with that
  • rich, robust and easy to change entries representation (e.g. with thumbnails)
  • much more simple code (the whole "index to delegate" thing with QTreeView is pure horror)
  • enable us to create a united model for downloads and torrent entries in Channels - this will remove all the synchronization issues between the Downloads list and Channels contents.
@ichorid ichorid self-assigned this Oct 21, 2021
@qstokkink
Copy link
Contributor

The reason is Python copies strings on slicing, resulting in useless waste of memory and CPU cycles.

I agree with the conclusion that implementing a dedicated tunnel endpoint is the next logical step. However, I disagree that this is mainly due to string slicing.

Pretty much all of the slicing was removed from IPv8 packet handling a year ago. We even went so far as to replace bytes with bytearray. However, we didn't get the magic performance increase we were hoping for.

What we found (you can also see a hint of this in this breakdown by @ichorid and this breakdown by @egbertbouman) is that the pain lies in exchanging data between Python and <<your low-level language of choice here>>. For example, ciphers.py which makes the calls to the to the OpenSSL backend. At some point I even implemented my own C endpoint which would transfer about 800 MB/s and when fed to Python was brought down to roughly 80 MB/s (excluding crypto). Therefore, never feeding this data into Python seems like a logical choice.

@ichorid
Copy link
Contributor Author

ichorid commented Oct 22, 2021

never feeding this data into Python seems like a logical choice.

Excellent point! The most logical thing to do would be combining SOCKS proxy with a minimal UDP endpoint, so data never leaves low-level domain.

@egbertbouman
Copy link
Member

egbertbouman commented Oct 23, 2021

I'm no longer part of the team, but since I was heavily involved in the anonymity stuff I'll respond anyway.

The result is half-dead DHT performance and unreliable health info.

Every time I suspected DHT nodes blocking Tribler, the real reason turned out to be something else (e.g., 43c2599, 3f54b1d, c5de5d7). Sure, anonymized DHT lookups are slower than normal lookups and DHT nodes can temporarily block Tribler, but I don't think it's as bad as you make it out to be. Of course, it could be that this has changed recently.

The most logical thing to do would be combining SOCKS proxy with a minimal UDP endpoint, so data never leaves low-level domain.

Agreed, the solution lies in not using Python at all when processing tunnel data. It would be interesting to see what this increase in speed does to the relay/exit nodes. Since there is no bandwidth bandwidth throttling implemented, you may end up trading one issue for another 😉

Hidden seeding is broken (and useless)

The hidden seeding test appears to be broken in 2 ways:

Regarding the comment that hidden seeding is useless, that's just because the Tribler network is small. So, the chance of success is really low and therefore you'll rarely find a hidden seeder. That's pretty much what you would expect and what we experienced in the past with libswift.

@synctext
Copy link
Member

Morning, All paths lead to Arvid version 1.2.13..
Thnx for fixing the tunneltest after being broken/unused for so long! Lot of input to process for roadmap development.

Especially the DHT spam experiments with a bombardment of 1000 UDP messages. Very insightful, we should reproduce those again. A moderate client monoculture is emerging I believe. Exit nodes are not load balancing, we talking 100% versus 0.5% load. Strange. Do we have proof that exit nodes are not the main download speed constraint?

At this point I believe the DHT peer discovery is just one point in the chain. No measurements have been conducted to demonstrate what is going on. My current opinion is we might be close to big performance boost (mere fixes) or not (non-Python) 😜

@devos50
Copy link
Contributor

devos50 commented Nov 5, 2021

Exit nodes are not load balancing, we talking 100% versus 0.5% load. Strange.

This might be a symptom of a related component (e.g., caching of exit node network addresses can bias the network towards a subset of exit nodes). Unfortunately, this is hard to tell since we have very little insights in the end-to-end behaviour of our anonymous downloading stack.

No measurements have been conducted to demonstrate what is going on.

This job checks if hidden seeding is "working" (binary check) and, as a next step, could be adopted to get more insights into the load balancing amongst exit nodes. Additionally, we could extend that job to do a DHT spam experiment and see what happens.

@ichorid ichorid added type: Epic type: memo Stuff that can't be solved labels Nov 5, 2021
@ichorid ichorid pinned this issue Nov 6, 2021
@ichorid ichorid removed their assignment Nov 7, 2021
@drew2a drew2a changed the title Usability Roadmap Ticket Vadim's testament Nov 12, 2021
@drew2a drew2a unpinned this issue Jan 20, 2022
@drew2a
Copy link
Contributor

drew2a commented Mar 16, 2022

The issue dedicated to libtorrent 2.0 support: #5556

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: memo Stuff that can't be solved
Development

Successfully merging a pull request may close this issue.

6 participants