-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GigaChannel: BitTorrent is not enough #4677
Comments
We must use duplicates and simplicity until we have exceeded 1 million users. Please don't future engineer this stuff before we have actual waisting of Terabyte hard disks with Tribler 9. The above assumption on crowdsourcing need to be validated in the real world first. Linus is a single person managing thousands of crowdsourcers. Complex systems evolve into unexpected solution with remarkable efficiency. |
( as background, I dislike DHTs and pub/sub with religious passion due to their fundamental incentive misalignment) |
One reason why Git became so successful is that Linus designed it with separation of influence in mind, so developers can merge their work into the main tree without interference from others. I do not insist on using DHT or any specific technology at all. I merely point out that:
|
Regarding the database size: User adoption of a social platform is like a nuclear reactor: there are catalysts increasing reactivity (e.g. useful information provided by the system), and inhibitors decreasing it (e.g. bad UI, etc). To "blow up", the reaction should become self-sustainable, meaning that the rate of the catalysis should prevail over the inhibition. Sometimes, when the system is on the threshold of becoming self-sustainable, it only needs a small "push". One analogy is how a nuclear bomb works: enriched uranium is (relatively) stable by itself, but if you compress it with a small explosion, it gets to a critical density and the fission reaction becomes self-sustainable. Buying ads for a start-up social platform can be seen as this kind of "push". Applying this to Channels:
It is very possible, that we will not be able to ever reach the "critical density" of information, because of the inhibition caused by the database size growth rate and its side effects. The "hard" inhibition threshold of the database size that will repel 99% potential users could lie much lower than said 1TB, (say, at 100GB DB size). However, the self-sustainable level of the catalyst (useful information) could require 1TB databases with our current technology. In that case, no ads or features or performance tweaks will ever get us to 1TB real-world databases. We have now stuck in a vicious circle: no 1TB databases - no improvement of database density; no improvement of database density - no big enough userbase to generate content - no 1TB databases. That would be a pretty said scenario. |
Is low-latency really required though? Could use one master mutable torrent that only mutates on the creation of a new channel, That way peers only get updates when viewing the channel list and there is a new channel, Be nice to include an index.html in each channel to. |
Well, two years ago we had a discussion with @synctext about using mutable torrents and cross-torrent swarms. For some reason, he dismissed the idea of using anything but vanilla BitTorrent protocol 🤷 One way or another, we can't use "one swarm to rule them all" for an obvious reason that someone (us) having to maintain it. This is pure centralization and it is exactly the thing we try to fight here by developing Tribler. Also, this will never scale for the same reason Bitcoin does not scale. The current system of gossiping around subscribed channels is doing well enough to spread popular channels. Having said all that, it would be very nice to eventually have some channel swarms share common metadata elements, like pictures, etc. @Dmole , could you please further explain what is "per file hashes"? |
Glad it's working out.
I understand the general desire for compatibility but normal v1 torrents can be used inside mutable torrents just for channels. Avoiding centralization would require trusting peers to not poison the list of channels (every client would have the pk)... which may not be a practical issue.
|
Nice, Arvid even cites Tribler as the source of Mekle tree inspiration! 😄 : |
https://github.com/Tribler/tribler/discussions/5721 describes the solution |
Long story short: metadata requires crowdsourcing, which requires low-latency delivery system, which BitTorrent can't provide. So, in addition to BitTorrent, we need something like a DHT-based pub-sub.
Motivation
Our primary objective is to provide relevant information to humans. Humans should be able to process it efficiently. Human's information processing capabilities are very limited, and they can't process more than a few dozens of lines of text at once. No one ever looks beyond the 3rd page of Google. Therefore, we must limit the information that we show to the user. There are three primary ways to do this:
All three ways work together nicely, complementing each other. For example, one can search for a word
foo
in a local database, sort it based on recency, click on a most promising entry and then browse the collection holding it to look for similar entries.Knowledge creation
People add information to the system. The information comes in the form of independent entries, possibly organized into collections. Every person has their domain of information facilitated by the public key infrastructure, so no one has (direct) power over other's creations.
A person can either add some original information (e.g., a personal podcast) or copy it from others. Humans can't meaningfully produce more than a dozen original entries per day. Therefore, the influx of truly original content per person is minimal. However, when a person copies stuff in their collection and shares it, they effectively produce new information. The act of selection is an information-producing event. We will call this information the grouping information.
The problem of duplicates
When one can't browse personal channels, there is no grouping information. Therefore, if entry E comes both from peers A and B, it makes no sense to store it twice, and the second one can be dropped on receiving it into a local DB. As soon as we start to account for any kind of grouping information, cutting entries means losing/distorting this information. Of course, we can store database relationships instead of duplicate entries themselves. This storage scheme will help with the indexing, but will still result in the same O(n) linear storage requirements. Thus, we must drop grouping information based on some criteria, lest it overwhelms the system.
Ontology tree balancing
When users can only produce single-level channels, it is impossible to build an ontology that includes more than about a 100^2 entries, even if all users cooperate on this. A human user can't look through more than a hundred channels, and can't look through more than a couple hundred entries in a channel. A perfect ontology is a balanced tree (or even better, a perfect encoding). Thus, users must be able to create multi-level channels, which means more grouping information.
Group selection
When users figure out their grouping information is dropped when it clashes with others, they will start organizing in communities to coordinate their efforts. This organization will require some instruments for collective authoring and communication.
Scaling
At some point, there will be channels with thousands of collective authors and sub-channels (collections). These collections will be updated a few times per day, which will result in several updates per second for the root channel. This is essentially the same problem that is faced by the Bitcoin ledger. Therefore, there should be only loose connections between the root channel and the sub-channels.
The BitTorrent problem
The power of torrents comes from it exploiting "the network effect": one can start sharing the data as soon as one got at least one piece of the torrent. This is made possible by hashing the data as a whole, so anyone can check any piece for the correctness and immediately share it. Even with "mutable torrents" support, sharing a single collectively-authored channel means either:
a. seeding thousands of tiny torrents for each sub-channel;
b. downloading the updated version several times per second, while simultaneously serving thousands of previous versions.
As a system, BitTorrent was never intended for this kind of usage: low latency updates, low volume transfers.
The solution
BitTorrents should only be used for bulk transfers of big channels. The channel infohashes should be automatically updated, say, daily. For the online updates, we should employ something like a DHT-bases pub-sub system (e.g. PolderCast). Then, two systems will together cover all the bases:
The text was updated successfully, but these errors were encountered: