Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The relevance ranking algorithm when searching in Tribler needs to be redesigned #2250

Closed
devos50 opened this issue Jun 2, 2016 · 16 comments
Assignees

Comments

@devos50
Copy link
Contributor

devos50 commented Jun 2, 2016

When searching in Tribler, search results (which can consist of torrents and channels) are ranked according to several criteria. Channels and torrents are sorted separately from each other. The associated channels of the torrents are also determined and in case a torrent is present in more than one channel, the channel with the highest number of votes is taken.

Channels are sorted based on the number of torrents in the channel. The algorithm to sort torrents is more complex. First, an array of five scores are constructed. These values are determined as follows:

  1. The number of matching keywords in the name of the torrent
  2. The number of matching keywords in the name of the files the torrent contains
  3. The position of the lowest matching keyword in the torrent name. For instance, when searching for ‘Away From Keyboard’ and ‘of’ is a matching keyword, this position will be 2.
  4. The number of matching keywords in the extensions of the file names of the torrent.
  5. A score that is based on several (normalised) attributes of the torrent. This score is determined after the set of local search results are constructed. The following formula is used: score = 0.8 * normalized_num_seeder - 0.1 * normalized_num_negative_votes + 0.1 * normalized_num_subscriptions. Note that the number of negatives votes and number of subscriptions do not exist anymore and will always be 0.

After the torrents and channels are sorted, the results are put in a list. Every five torrent results, a channel result is inserted. This is to make sure that channels are also showing up in the search results (and more important, are visible to the user).

Also, I think it’s better to not be dependent on other search results to make streaming torrents from the database possible (like we do now when normalising). In other words, the final relevance score in the new algorithm should only be dependent on the data that this torrent object has.

I will start thinking about a new (more simple) algorithm that we could use for this and add the proposal in this issue.

Possible improvements:

  • Improve the relevance algorithm by taking the number of matches inside the torrent name/torrent file names in consideration. For instance, when searching for 'pioneer one', a torrent named 'pioneer one collection' probably has a higher relevance than a torrent named 'pioneer one - episode 3, season 4'. This will prevent situations like the following, when I expect 'test_tribler.txt' to show up higher than 'test.tribler.upc.iso'.
    9167a65ab13479c028a35e1a9625f2c6
  • I think the lowest matching term criteria should be removed or get lower priority since I doubt that it's always true that terms that are matching earlier in torrent A than in torrent B makes torrent A more relevant. For instance, torrent A = "game of thrones" and torrent B = "thrones in someplace". If a user searches for 'thrones', torrent B will be marked as more relevant which feels weird.
  • Users are making mistakes when typing. For instance, if I type "triler", I might mean "Tribler" or "triller" so I would like to see results that are matching these queries. This can be achieved by calculating the levenshtein distance with know keywords. One issue here is performance. Fortunately, sqlite contains support for levenshtein collation.
    4c590b44f5fabf0762f4fc29c01f7fa8
  • The sorting of the channels is currently only dependent on the number of torrents in the channel. If I'm not mistaken, it is not even sorted on the matching in the name! See the following example for instance, where I expect the channel with 'pirate' in the name, to show up higher:
    f29b7edfd88b07d83b8d69d10746db87
  • When building the FTS3 index in the sqlite database, duplicate words are removed. This means that when searching for 'years', a torrent named 'best years' will be ranked equal to a torrent named 'years and years'. However, 'years and years' should have a higher relevance since the word 'years' occurs twice. Another example is when searching for 'iso'. A torrent file that contains 100 iso files is ranked equivalent to a torrent file that only has one iso file.
  • By removing the number of seeders criteria when sorting, less popular torrent have a better chance of coming up higher in the lists, increasing the chance the user will download these and seed the torrent afterwards.
  • Our current algorithm only returns results that matches all given keywords. So when searching for 'pirate audio', only torrents returned that have both terms in it, are returned. It might be better to show the user also torrents matching 'pirate' and matching 'audio' (while still giving more relevance to torrents that matches 'pirate audio').

Proposal for a new algorithm:

The new algorithm should take the matching with keywords better into account and be less dependent on other (confusing) variables such as number of seeders and number of torrents in the channel.

  • The first step should be to adjust our FTS3 index to not filter out duplicate keywords in the torrent name, file names and file extensions. This will have consequences on the amount of storage required and requires rebuilding the FTS3 index in the upgrader.
  • Slightly change the FTS matching query to account for partial term results.
  • Use the BM25F algorithm to determine the relevance of the query in the torrent name, file names and file extensions. Note that we have to calculate the BM25 score three times. After that, we can take a weighted average of these scores. The score of the torrent name matching should have the highest priority. The relevance scores are sent to the GUI over the REST API.
  • Channel name matching is done in the same way, however, the only difference is that we are not using FTS for the channels. We also calculate the BM25 score of the channel name matching. These relevance scores are sent to the GUI over the REST API.
  • If after several seconds there are no incoming results, fetch the keyword(s) with the closest Levenshtein distance from the database and display them to the user with a message similar to the following: "No results found so far. Did you mean to search for ..."?

The above algorithm takes all of the improvements into account I outlined above.

@whirm whirm added this to the Backlog milestone Jun 3, 2016
@devos50
Copy link
Contributor Author

devos50 commented Jun 3, 2016

@whirm I need this for the new GUI and I almost have a proposal ready. Can you adjust the milestone and assign the issue to me?

@whirm whirm modified the milestones: New GUI, Backlog Jun 3, 2016
@devos50
Copy link
Contributor Author

devos50 commented Jun 5, 2016

A first implementation of the algorithm as outlined above can be found here: https://gist.github.com/devos50/71b44339914259c82515f21846ee8173. Some minor tweaking might have to be done to make relevance ranking even better.

I will not implement this for the old GUI but rather make the REST API use the new algorithm.

@synctext
Copy link
Member

synctext commented Jun 6, 2016

nice!
please try to write text and document this in a thesis.tex manner, instead of less formal issue rockets. saves work..

@ghost
Copy link

ghost commented Sep 9, 2016

It might be worth considering that names aren't the only important thing a keyword search may be looking for - a 'deep search' of the content of files, too, used to be possible in earlier p2p networks (especially within special parts of files, like ID3 tags) and may yet be an important feature, long term to keep our eyes on.

Another thing worth thinking about is whether something like 'pagerank' could be relevant - whereas we have categories/channels that contain torrents, if you search for, for example 'thrones'
if there's a channel of thrones it might be worth giving some kind of priority, even if a small amount, to all the torrents on that channel --
Channel: thrones
S01E01.theora.ogg.torrent.magnet
S01E02.theora.ogg.torrent.magnet
S01E03.theora.ogg.torrent.magnet
etc

@synctext
Copy link
Member

synctext commented Sep 10, 2016

@themusicgod1 Yes, we need accurate rich metadata, proper pageranking/itemranking, and good ID3 stuff.

btw. Worked on this myself 16 years ago, sadly never got solved (http://cd.ro.nu/hypermail/1046.html). Maybe an extra 16 years:-)

@synctext
Copy link
Member

synctext commented Apr 26, 2021

Re-opening this issue again. To also show this have getting attention in past years. Last update was 5 years ago.
image
ToDo: do a maximum 1-week only improvement sprint with basic heuristics. We will learn along the way, evaluate and come up with better design. Learn-by-doing methodology. Each matching swarm gets a score to determine the ranking. Basic score is the swarm size. (e.g. 2000 peers swarm get higher ranking than 10 peers). Heuristics:

  1. Biggest swarm first: Biggest swarm matching the keyword nicely should come first.
  2. One word matching rule: If you search for the creative commons movie "Sintel", the shortest matching swarm should be promoted. Matches such as the.script.from.the.movie.Sintel.pdf get lower score. Penalty if you have too many non-matching words.
  3. First word is the best rule: searching for "Sintel", the match "Sintel-Part-II" gets higher score than "Part-of-Sintel".
  4. Maximum matching rule. Search for multiple terms, like open source movie: "The Internet's Own Boy: The Story of Aaron Swartz". Penalty if you do not match all original search terms.
  5. No extra in-between words rule. Searching for "Internet's Own Boy" places results such as "Internet's very Own Boy" and "Internet's very special Boy person" lower than the exacter match: "Internet's Own Boy - Aaron Swartz"

Just mix in matching channels names for now. Treat them equally as matching swarms (with a translated popularity).
Repeating: the goal is not to fully fix the "relevance ranking" problem, but to fix it 80% of the way and help us come up with a 97%-ish solution. Hence the time-boxed approach of 1 week of effort only please.

@synctext
Copy link
Member

Tribler Search Engine - channel expansion

We can finally start doing relevance ranking. The popularity community is producing results and channels should be getting real-time browsing after reset! 🎊 First priority is getting the above 5 heuristics implemented. Next heuristics, more invasive to QT:
6. When a channel matches exactly the name of the "keyword search terms" it is shown in expanded state. See example
7. When multiple channels exactly match the search, only the strongest is expanded.
8. Adding thumbnails. Our users could be exposed to obscene thumbnails. Solution: only if the "family filter" is disabled by the user we display the thumbnail of the channel.
Tribler_search_engine_design_v1

@ichorid
Copy link
Contributor

ichorid commented May 28, 2021

  1. When a channel matches exactly the name of the "keyword search terms" it is shown in expanded state. See example

Yep. Users were even asking for something like that.

@devos50 devos50 removed their assignment Jun 30, 2021
@qstokkink qstokkink modified the milestones: New GUI, Backlog Sep 27, 2021
@kozlovsky
Copy link
Contributor

kozlovsky commented Jun 16, 2022

During the last three weeks, I was finally able to work on the search quality improvements and have some interesting results. I can do a live demo tomorrow.

First, I implemented a new torrent_rank ranking function that completely replaces the Okapi BM25 function inside SQL queries:

torrent_rank(query, title, seeders, freshness) -> rank

rank is a float number in the range [0, 1). The better the match, the higher the rank. In my opinion, the new function does almost precisely what @synctext expects from it. Let's look at some examples:

The new function rates exact title match relatively high, but not the highest without the information about seeders and freshness:

torrent_rank('Big Buck Bunny', 'Big Buck Bunny') -> 0.81

When the number of seeders is known, the rank is higher, even better if the torrent creation date is known:

torrent_rank('Big Buck Bunny', 'Big Buck Bunny', seeders=100) -> 0.855
torrent_rank('Big Buck Bunny', 'Big Buck Bunny', seeders=100, freshness=100*DAYS) -> 0.8769230769230768

It is better for a torrent to have more seeders and to be recent:

torrent_rank('Big Buck Bunny', 'Big Buck Bunny', seeders=100, freshness=100*DAYS) -> 0.8769230769230768
torrent_rank('Big Buck Bunny', 'Big Buck Bunny', seeders=200, freshness=100*DAYS) -> 0.8923076923076921
torrent_rank('Big Buck Bunny', 'Big Buck Bunny', seeders=300, freshness=100*DAYS) -> 0.8999999999999999
torrent_rank('Big Buck Bunny', 'Big Buck Bunny', seeders=300, freshness=30*DAYS) -> 0.9262499999999999
torrent_rank('Big Buck Bunny', 'Big Buck Bunny', seeders=300, freshness=10*DAYS) -> 0.9506249999999999

If a torrent has creation date in the future and its freshness is negative, the freshness is ignored:

torrent_rank('Big Buck Bunny', 'Big Buck Bunny', freshness=10*DAYS) -> 0.8775
torrent_rank('Big Buck Bunny', 'Big Buck Bunny', freshness=-10*DAYS) -> 0.81

Now, let's concentrate on the title ranking. The exact match is good:

torrent_rank('Big Buck Bunny', 'Big Buck Bunny') -> 0.81

If the title contains some excess words at the end, the rank is slightly worse:

torrent_rank('Big Buck Bunny', 'Big Buck Bunny II') -> 0.8038167938931299

The more excess words we have in the title, the lower is the rank:

torrent_rank('Big Buck Bunny', 'Big Buck Bunny Movie Creation History') -> 0.7917293233082708

If the title contains in-between words, the rank is lower. The closer extra words are to the start of the title, the bigger the penalty. The biggest penalty is when extra words appear at the start of the title:

torrent_rank('Big Buck Bunny', 'Big Buck Bunny') -> 0.81
torrent_rank('Big Buck Bunny', 'Big Buck Bunny II') -> 0.8038167938931299
torrent_rank('Big Buck Bunny', 'Big Buck Brown Bunny') -> 0.7506109979633401
torrent_rank('Big Buck Bunny', 'Big Bad Buck Bunny') -> 0.7424206815511163
torrent_rank('Big Buck Bunny', 'Boring Big Buck Bunny') -> 0.7312500000000001

The more extra in-between words the title has, the lower the rank:

click to expand
torrent_rank('Big Buck Bunny', 'Big Buck Bunny') -> 0.81

torrent_rank('Big Buck Bunny', 'Big Buck Bunny a') -> 0.8038167938931299
torrent_rank('Big Buck Bunny', 'Big Buck Bunny a b') -> 0.7977272727272727
torrent_rank('Big Buck Bunny', 'Big Buck Bunny a b c') -> 0.7917293233082708

torrent_rank('Big Buck Bunny', 'Big Buck a Bunny') -> 0.7506109979633401
torrent_rank('Big Buck Bunny', 'Big Buck a b Bunny') -> 0.6993358633776092
torrent_rank('Big Buck Bunny', 'Big Buck a b c Bunny') -> 0.6546181172291297

torrent_rank('Big Buck Bunny', 'Big a Buck Bunny') -> 0.7424206815511163
torrent_rank('Big Buck Bunny', 'Big a b Buck Bunny') -> 0.6852494577006508
torrent_rank('Big Buck Bunny', 'Big a b c Buck Bunny') -> 0.6362537764350453

torrent_rank('Big Buck Bunny', 'a Big Buck Bunny') -> 0.7312500000000001
torrent_rank('Big Buck Bunny', 'a b Big Buck Bunny') -> 0.6664556962025318
torrent_rank('Big Buck Bunny', 'a b c Big Buck Bunny') -> 0.6122093023255814

The wrong order of words in the title imposes a penalty:

torrent_rank('Big Buck Bunny', 'Big Buck Bunny') -> 0.81
torrent_rank('Big Buck Bunny', 'Big Bunny Buck') -> 0.7476923076923077

The penalty is very high if some words are missed from the title. The closer the missed word to the beginning of the query, the bigger the penalty. Several missed words lead to a huge penalty:

torrent_rank('Big Buck Bunny', 'Big Buck Bunny') -> 0.81
torrent_rank('Big Buck Bunny', 'Big Buck') -> 0.47250000000000003
torrent_rank('Big Buck Bunny', 'Big Bunny') -> 0.44181818181818183
torrent_rank('Big Buck Bunny', 'Buck Bunny') -> 0.405
torrent_rank('Big Buck Bunny', 'Bunny') -> 0.2858823529411765

If the query string or title string contains duplicates, the ranking function correctly takes them into account. If duplicates are in the query string, they should also be in the title, or the title receives a major penalty. If duplicates are in the title but not in the query, it gives a slight penalty to the result (and the order is also important):

torrent_rank('Run Bunny Run', 'Run Bunny Run') -> 0.81
torrent_rank('Run Bunny', 'Run Bunny Run') -> 0.8033057851239669  # excess word in the title
torrent_rank('Run Bunny Run', 'Run Run Bunny') -> 0.7476923076923077  # wrong order
torrent_rank('Run Bunny Run', 'Run Bunny') -> 0.47250000000000003  # missed word in the title

When title rank is not ideal, having many seeders and the recent torrent creation date is still useful. The torrent with a non-exact matching title can be higher in the search result list if it has a huge number of seeders. But if the title match is bad, having a huge number of seeders does not help:

torrent_rank('Big Buck Bunny', 'Big Buck Bunny', seeders=5, freshness=365*DAYS) -> 0.8211573236889693

torrent_rank('Big Buck Bunny', 'Big Buck Bunny II', seeders=5, freshness=365*DAYS) -> 0.8148889471722596
torrent_rank('Big Buck Bunny', 'Big Buck Bunny II', seeders=1000, freshness=30*DAYS) -> 0.934177654406662

torrent_rank('Big Buck Bunny', 'Bunny', seeders=5, freshness=365*DAYS) -> 0.28982023189022443
torrent_rank('Big Buck Bunny', 'Bunny', seeders=1000, freshness=30*DAYS) -> 0.33224598930481275

Now some more examples that @synctext gives:

  1. Biggest swarm first: Biggest swarm matching the keyword nicely should come first.
torrent_rank('Sintel', 'Sintel') -> 0.81
torrent_rank('Sintel', 'Sintel', seeders=100) -> 0.855
torrent_rank('Sintel', 'Sintel', seeders=1000) -> 0.8918181818181817
  1. One word matching rule: If you search for the creative commons movie "Sintel", the shortest matching swarm should be promoted. Matches such as the.script.from.the.movie.Sintel.pdf get lower score. Penalty if you have too many non-matching words.
torrent_rank('Sintel', 'Sintel') -> 0.81
torrent_rank('Sintel', 'the.script.from.the.movie.Sintel.pdf') -> 0.5210526315789475
  1. First word is the best rule: searching for "Sintel", the match "Sintel-Part-II" gets higher score than "Part-of-Sintel".
torrent_rank('Sintel', 'Sintel-Part-II') -> 0.7955357142857143
torrent_rank('Sintel', 'Part-of-Sintel') -> 0.6649253731343284
  1. Maximum matching rule. Search for multiple terms, like open source movie: "The Internet's Own Boy: The Story of Aaron Swartz". Penalty if you do not match all original search terms.
torrent_rank("The Internet's Own Boy: The Story of Aaron Swartz", "The Internet's Own Boy: The Story of Aaron Swartz") -> 0.81
torrent_rank("The Internet's Own Boy: The Story of Aaron Swartz", "The Internet's Boy: The Story of Aaron Swartz") -> 0.4984615384615385
torrent_rank("The Internet's Own Boy: The Story of Aaron Swartz", "The Story of Aaron Swartz") -> 0.1915720319099015
  1. No extra in-between words rule. Searching for "Internet's Own Boy" places results such as "Internet's very Own Boy" and "Internet's very special Boy person" lower than the exacter match: "Internet's Own Boy - Aaron Swartz"
torrent_rank("Internet's Own Boy", "Internet's Own Boy") -> 0.81
torrent_rank("Internet's Own Boy", "Internet's Own Boy - Aaron Swartz") -> 0.7985915492957747
torrent_rank("Internet's Own Boy", "Internet's Very Own Boy") -> 0.7509933774834436
torrent_rank("Internet's Own Boy", "Internet's Very Special Boy Person") -> 0.43531669865642997

As the ranking function is integrated directly into the SQL query, it works fast. For me, the difference with the previous search algorithm looks very significant - now, the query to the local database returns precisely the result I expect it to give me. Remote queries will start receiving improved results after a version of Tribler with new ranking functions is officially released and installed by users.

I hope to have a chance to demonstrate the proposed search algorithm tomorrow at the lab (with some UI improvements).

@kozlovsky
Copy link
Contributor

kozlovsky commented Jun 16, 2022

Regarding the ideal search UI, I think it should satisfy the following guidelines:

  1. As multiple memes said, nobody goes to the second page of Google search. For me, it means that it is more important to display the most relevant results at the top of the list than to display the most extended result list possible. And this, in turn, means that pagination of the result list is not necessary. It is enough to display a single result page with several hundred torrents, but these torrents should be the best match for a query, with the most relevant results at the top.
  2. The search should be fast. As Google reports, even a slight delay in displaying search results affects user experience. For me, it means that the Tribler search UI should not add artificial delays to integrate local and remove search results. It is better to provide local results as fast as possible and display remote results later.
  3. The difference between local and remote results is a technical detail of Tribler implementation. What was remote before the search becomes local right after, so it is confusing to display remote and local results on separate tabs, and they should be displayed in a single list (as we have now).
  4. On the other side, displaying the remote results should not lead to situations when a user attempts to click on a torrent and a new remotely received torrent is inserted into the same position at that moment, which leads to the opening of a wrong torrent. So, displaying remote results should not be confusing to the user.
  5. The search result list should not display multiple torrents with the same title, as it is hard for a user to find a difference between them. With the current UI, we have results for some searches where the entire first page is filled with precisely the same title without any distinction between result rows. In my opinion, it is better to collapse multiple results with the identical title into a single group and expand it later on to user demand.

I'll try to demonstrate the UI prototype that satisfies these guidelines.

@kozlovsky
Copy link
Contributor

kozlovsky commented Jun 17, 2022

The screenshot demonstrates how the proposed search UI looks. It is very simple and does not contain any complicated UI elements. After a user types in a query string, it immediately sees local results (within seconds). If later remote results bring some new torrents, a search title starts displaying the number of remotely received torrents and updates this number dynamically. The user can click on the title of the search page to immediately add received torrents to the result list. In that case, new torrents are sorted according to the rank and inserted in the proper places. In the current draft implementation, these new torrents have the * symbol added to the beginning of the title so that it is possible to find new torrents in the list for debugging purposes quickly. If we decide to use this concept of the simplified search UI, then instead of the prefix, we can temporarily highlight remotely received torrents with a bright color.

2022-06-17_07-50-06

@synctext
Copy link
Member

synctext commented Jun 22, 2022

progress since 6 years! let not be afraid of a single refresh, users hate to click on things for them to be useful. Search: display local results and don't wait for remote search. Wait maximum 2000 ms for remote search to come in?

Data driven:

  • measure typical remote search response time?
  • build a workload model?
  • try to find root cause of 1% slow queries (Max. 1 week)

@synctext
Copy link
Member

synctext commented Sep 12, 2022

Reminder, in the future we would like to do live searches.
We have some screenshots taking feature within Application tester: https://jenkins-ci.tribler.org/job/Deployment-Tests/job/Run_Tribler_Release/job/Run_Win10_64bit/ws/screenshots/screenshot_1661867305.jpg
Currently we don't use the live network, in order not to make it unhealthy with possible faulty code! Should we create an isolated testnet for 10 peers which are queried using the remote search?

@synctext
Copy link
Member

synctext commented Oct 11, 2022

Solid progress, now we have dark SQL magic 🥳
Search is now much better with this upcoming PR. One issue still pending with performance. There has been a huge variance in showing results in the GUI (100 ms upto 3 seconds, with exceptional outliers of 10 seconds). Still unknown; race condition?? (still guessing)

We don't have any theoretically grounded algorithms to implement, first hire new phd student on this.

ToDo document magic constructs like term_weight = 5 / (5 + i). We currently still have hard coded values in the search ranking code torrent_rank('Big Buck Bunny', `Big Buck Brown Bunny`) == pytest.approx(0.75061099)

For 2022 and 2023 we want to improve search sophistication. We perceive 6 distinct levels (currently moving from level 1 to level 2). However, understanding what is going on in a distributed system composed of strangers with abusive spammers is known to be the hard problem. Reliable reproducing is essential.

  1. current Tribler with 30 years old BM25 algorithm
  2. ranking heuristics
  3. ranking with perfect metadata (1. tags and 2. thumbnails) image
  4. fuzzy learn-to-rank
  5. Row bundling
  6. perfect personalised learn-to-rank with perfect metadata and micro bubbles of similar superfans (e.g. Tiktok state-of-the-art)

@qstokkink qstokkink removed this from the Backlog milestone Aug 23, 2024
@synctext synctext self-assigned this Sep 5, 2024
@synctext
Copy link
Member

synctext commented Sep 5, 2024

@mg98 old 2016 issue, relevant for phd research.

@synctext
Copy link
Member

synctext commented Sep 5, 2024

Shifting this issue to phd student @mg98 😃

Epic issue: #7290

@synctext synctext closed this as completed Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

6 participants