-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tag based content discovery and crowdsourcing #6214
Comments
Comments regarding the tag-based organisation and why need this in the first place: Why tag-centric organisation?
What is tag-centric even?Tag-centric organization is people, content, metadata.
The vision for Tribler:
|
This issue re-does the work already conducted 14 years ago. But then it failed. Our 'recent' tag work of only 10 years ago. High performance implementation of tag-based discovery: https://dl.acm.org/doi/abs/10.1145/2063576.2063852 |
Draft Wireframes for tags (updated 10.09.21): Source: https://drew2a.notion.site/Tags-c1567365a1c94ce78271257f0aa19b06 |
Solid progress. |
Trying out something new here. When it comes to interaction design, it is common to work with user stories that help to think about the user. Even after writing the simple user stories below, I feel that they are very helpful in getting rid of the bias towards the developer that we most likely all have. Feel free to improve/give feedback on the following user stories. PersonasI can think of the two personas:
EpicWithin this issue, we focus on persona 2 since the goals of persona 1 are addressed by different components. Tribler currently has a subpar user experience when it comes to finding and recommending content. As discussed before, we want to see if tags are able to improve this situation. As overarching goal, Tribler should be extended with the following functionality:
User storiesFor a minimal version of tags, I see the following three user stories:
Note: each user story should be clear, feasible, and testable, also see this article. |
After some discussions and mock-ups, here's a GUI preview of the resolution of user story 1: Note that I use the "GUI test mode" for prototyping so the tags/titles do not make sense yet. I'm hovering over the 'edit' button in the first row. Clicking on the pencil will bring up a dialog where a user can suggest/remove tags (we reached majority consensus on using a dialog), but that dialog is not ready yet. Color scheme/margins/paddings/sizes have not been finalized yet. |
We made a few design decisions:
|
Something to think about. Do we want to build a community of "taggers" after launch of 7.11? Or let our users know using TorrentFreak after a few months of more iterations and improvements.
|
Building a community would be great, but would probably require more work beyond a minimal version (e.g., making the contributions of a particular user visible). So let's first iterate and improve the current system. |
Design decisions behind the DB:
class TorrentTagOp(db.Entity):
id = orm.PrimaryKey(int, auto=True)
torrent_tag = orm.Required(lambda: TorrentTag)
peer = orm.Required(lambda: Peer)
operation = orm.Required(int)
time = orm.Required(int)
signature = orm.Required(bytes)
updated_at = orm.Required(datetime.datetime, default=datetime.datetime.utcnow)
orm.composite_key(torrent_tag, peer)
class TorrentTag(db.Entity):
...
added_count = orm.Required(int, default=0)
removed_count = orm.Required(int, default=0)
class TorrentTag(db.Entity):
...
local_operation = orm.Optional(int) cc: @kozlovsky |
@kozlovsky , wouldn't using a separate DB make it impossible to do complex queries involving both Metadata store data and Tags data? |
@ichorid I think that with a separate tag database the development of an initial version of the tag-based system may be easier. Regarding queries, with our current approach for FTS search, it should be no difference between a single database and two separate databases. If it would be necessary, we can combine databases later, or even just attach the tag database to the metadata store DB. |
Tag ReinforcementNot sure if the suggestion below is applicable/suitable for the first version, but it is open for discussion. Problem: To address the most trivial poisoning attacks, we decided that a particular tag will only be displayed when two identities have suggested it (thresholding). However, the chance that two users independently come up with the same tags for the same content is rather low. Even with a threshold of 2, I predict that much content will remain visibly untagged. Potential solution: We can help the user by showing tags that have been suggested by other users but don't have enough support yet (i.e., haven't reached the threshold). This indication (e.g., "Suggestions: X, Y, Z" or "Suggested by others: A, B, C") should be part of the dialog where a user can add/remove tags, for example, below the input field. To prevent visual clutter, we should limit the number of suggestions shown. |
If you need inspiration, there have been some academic works that look at the tag reinforcement of user-generated tags in the Steam Tags system (e.g., http://dx.doi.org/10.1145/3377290.3377300). |
Dataset for tagging: https://github.com/MTG/mtg-jamendo-dataset |
The current version of tags has been running successfully for a few months now, and we have seen several tags that have been created by different users. As the next step, we want to use these tags and our existing infrastructure to improve the search experience. Concretely, our first goal is to identify and bundle torrents that describe similar content (for example: Our upcoming improvements are also a key step towards readying our infrastructure to build and maintain a global knowledge graph. This knowledge graph can act as fundamental primitive for upcoming science in the domain of content search, content navigation, and eventually content recommendation. |
library science knowledge - related work. The manifestation versus item abstraction plus tagging. |
"Justin Bieber is gay" scientific problem - tag spamMeritRank is needed to fix this spam issue in the Tribler future. Fans and fame of artists also attracts Internet trolls. We have in the past cofounded the Musicbrainz music metadata library. This crowdsourcing library has a unique dataset of votes on tags with explicit spam. See the
Bieber has a profile page. Next step in our semantic search roadmap is modelling the split between concept and materialisation. The knowledge graph should contains both types of entries. See the 1994 early scientific beginnings of solution: gossip, signals, and reputation. Simple central reputation system of central profiles Publication venue: https://www.frontiersin.org/research-topics/19868/human-centered-ai-crowd-computing or |
Dev meeting brainstorm outcome: Martijn has/had a crawler running with Tag-crowdsourcing. Check status @drew2a and 1-day dataset analysis with live "remove tag" within Tribler 7.13 release? |
To describe the current state of the DatabaseThe full schema is available at It describes tribler/src/tribler/core/components/key/key_component.py Lines 24 to 26 in 26b0be8
In the tribler/src/tribler/core/components/database/db/layers/knowledge_data_access_layer.py Lines 60 to 65 in 76de562
Where tribler/src/tribler/core/components/database/db/layers/knowledge_data_access_layer.py Lines 32 to 57 in 26b0be8
Statement examples: SimpleStatement(subject_type=ResourceType.TORRENT, subject='infohash1', predicate=ResourceType.TAG, object='tag1')
SimpleStatement(subject_type=ResourceType.TORRENT, subject='infohash2', predicate=ResourceType.TAG, object='tag2')
SimpleStatement(subject_type=ResourceType.TORRENT, subject='infohash3', predicate=ResourceType.CONTENT_ITEM, object='content item') Due to the inherent lack of trust in peers, we cannot simply replace an existing statement with a newly received one. Instead, we store all There are two operations available for peers: tribler/src/tribler/core/components/database/db/layers/knowledge_data_access_layer.py Lines 26 to 29 in 26b0be8
All operations are recorded in the database, allowing for the calculation of the final score of a specific operation based on the cumulative actions taken by all peers. This approach enables a comprehensive assessment of each operation's overall impact within the network. Currently, a simplistic approach is employed, which involves merely summing all the 'add' operations (+1) and subtracting the 'remove' operations (-1) across all peers. This method is intended to be replaced by a more sophisticated mechanism, the tribler/src/tribler/core/components/database/db/layers/knowledge_data_access_layer.py Lines 98 to 100 in 26b0be8
ER diagramerDiagram
Peer {
int id PK "auto=True"
bytes public_key "unique=True"
datetime added_at "Optional, default=utcnow()"
}
Statement {
int id PK "auto=True"
int subject_id FK
int object_id FK
int added_count "default=0"
int removed_count "default=0"
int local_operation "Optional"
}
Resource {
int id PK "auto=True"
string name
int type "ResourceType enum"
}
StatementOp {
int id PK "auto=True"
int statement_id FK
int peer_id FK
int operation
int clock
bytes signature
datetime updated_at "default=utcnow()"
bool auto_generated "default=False"
}
Misc {
string name PK
string value "Optional"
}
Statement }|--|| Resource : "subject_id"
Statement }|--|| Resource : "object_id"
StatementOp }|--|| Statement : "statement_id"
StatementOp }|--|| Peer : "peer_id"
|
The next chapter is dedicated to the community itself. CommunityThe algorithm of the community's operation:
tribler/src/tribler/core/components/knowledge/community/knowledge_payload.py Lines 8 to 18 in 44e2235
tribler/src/tribler/core/components/knowledge/community/knowledge_community.py Lines 126 to 128 in 44e2235
tribler/src/tribler/core/components/knowledge/community/knowledge_community.py Lines 119 to 124 in 44e2235
Autogenerated KnowledgeIn addition to the user-added knowledge statements, there is also auto-generated statements. The KnowledgeRulesProcessor was developed for the automatic generation of knowledge, which analyzes the records in the database and generates knowledge based on predefined regex patterns found in them. For example here is a definition of autogenerated tags:
This is a definition of Ubuntu, Debian and Linux Mint content items.
Auto-generation of knowledge occurs through two mechanisms:
Auto-generated knowledge does not participate in gossip among the network. |
The third paragraph is dedicated to the UI. UIThree changes have been made to the UI:
Also, a feature for searching by tags was added, but this feature hasn't been introduced to the users yet. |
Tags have now been implemented and even documented (above). With that, this issue is complete. |
After a thorough discussion, we came to the following architecture for the Tags system:
The text was updated successfully, but these errors were encountered: