Improved ranking for search results and updated search UI without the artifical delay at the loading screen #7025

kozlovsky · 2022-08-31T07:28:17Z

This PR implements improved search with better item ranking implemented at the DB level, as described in #2250 (comment). Elements with relevant titles and many seeders are placed closer to the top of the search result list.

UI shows local search results immediately, ready without any artificial delay. Remote items show automatically with a slight delay to avoid result flickering and have highlighted background to prevent user confusion when remote results appear on the screen.

Search UI has a progress bar to show the progress for the remote searches. The tooltip of the progress bar displays the number of complete remote requests and the total number of remote requests. It is possible to click on the progress bar to stop processing remote results if they take a long time.

Remote requests have a timeout of 20 seconds to be incorporated into the search result in the UI, and the progress bar actual progress takes into account this hard timeout as well as the current progress on receiving remote results; the more results we have, the close the progress bar to a finish.

drew2a · 2022-09-30T09:26:52Z

I guess I've found a QT-related bug:

Open the Tribler
Search for "ubuntu"
Click on a row

drew2a

Nice improvement! I added a couple of comments for the core logic.

I will check the GUI side immediately after the bugfix is completed.

src/tribler/core/components/metadata_store/remote_query_community/settings.py

src/tribler/core/tests/test_search_utils.py

src/tribler/core/components/metadata_store/db/store.py

drew2a · 2022-09-30T11:35:51Z

src/tribler/core/utilities/search_utils.py

@@ -1,13 +1,18 @@
 """
 Search utilities.

-Author(s): Jelle Roozenburg, Arno Bakker
+Author(s): Jelle Roozenburg, Arno Bakker, Alexander Kozlovsky


We can remove the line with authors altogether. I added the name just because the original people mentioned here do not have a relation to 85% of the code in this file anymore :)

src/tribler/core/utilities/search_utils.py

kozlovsky · 2022-09-30T12:17:18Z

I guess I've found a QT-related bug:

@drew2a thanks, I was able to reproduce it on Mac. On Windows, it does not appear.

synctext · 2022-10-20T07:55:18Z

Ambition level is to wrap this up by upcoming Friday.

kozlovsky · 2022-10-24T03:34:44Z

@drew2a I fixed the bug you mentioned. When doing this, I changed the way the dynamically added search results items are highlighted.

In UI, I want to highlight dynamically added results with a slightly different color so the user does not become disoriented when new rows suddenly shift the previous content of the search result list.

Initially, I painted just a title column with a slightly different background color for dynamically added results. As your screenshot shows, it looks pretty ugly when the row is selected.

To fix it, I first tried to use Qt roles for model data items. Qt::BackgroundRole should allow specifying different background brushes for different table rows, and it sounded like the right way to solve the task.

QStyledItemDelegate that we use in Tribler has the initStyleOption method that should allow the background brush to be customizable.

Unfortunately, it turns out that all these Qt mechanics with roles is not working if a Qt widget or its parent has a CSS style. With any CSS style specified, Qt completely ignores all roles from models. As we extensively use CSS styling in Tribler, we cannot use roles to paint different search result items in a different way and cannot specify different background colors for different table rows.

As a workaround, I switched to activating hovered state for a short period for new items. That looks like the only way to specify different backgrounds for some items. And on practice, the resulting highlighting looks quite good to me.

drew2a · 2022-10-24T08:30:18Z

@kozlovsky the bug has been fixed. Confirmed. UI looks good to me.

Note, that dynamically added items are not grouped into the snippet:

kozlovsky · 2022-10-24T08:50:25Z

@drew2a I suggest to group remote results in a separate PR

src/tribler/core/tests/test_search_utils.py

src/tribler/core/utilities/search_utils.py

src/tribler/core/tests/test_search_utils.py

drew2a · 2022-10-24T14:45:58Z

src/tribler/core/utilities/search_utils.py

+def calculate_rank(query: List[str], title: List[str]) -> float:
+    """
+    Calculate the similarity of the title to the query as a float value in range [0, 1].
+    """
+    if not query:
+        return 1.0
+
+    if not title:
+        return 0.0
+
+    title = deque(title)
+    total_error = 0
+    for i, term in enumerate(query):
+        # The first word is more important than the second word, and so on
+        term_weight = POSITION_COEFF / (POSITION_COEFF + i)
+
+        # Read the description of the `find_term` function to understand what is going on. Basically, we are trying
+        # to find each query word in the title words, calculate the penalty if the query word is not found or if there
+        # are some title words before it, and then rotate the skipped title words to the end of the title. This way,
+        # the least penalty got a title that has query words in the proper order at the beginning of the title.
+        found, skipped = find_term(term, title)
+        if found:
+            # if the query word is found in the title, add penalty for skipped words in title before it
+            total_error += skipped * term_weight
+        else:
+            # if the query word is not found in the title, add a big penalty for it
+            total_error += MISSED_WORD_PENALTY * term_weight
+
+    # a small penalty for excess words in the title that was not mentioned in the search phrase
+    remainder_weight = 1 / (REMAINDER_COEFF + len(query))
+    remained_words_error = len(title) * remainder_weight
+    total_error += remained_words_error
+
+    # a search rank should be between 1 and 0
+    return RANK_NORMALIZATION_COEFF / (RANK_NORMALIZATION_COEFF + total_error)


I am wondering... could calculate_rank be replaced by something simpler?

After a quick research, I prepared a test function that satisfies @synctext requirements posted here

def test_calculate_rank(): # rule 2 assert torrent_rank(query='Sintel', title='Sintel') > \ torrent_rank(query='Sintel', title='the.script.from.the.movie.Sintel.pdf') # rule 3 assert torrent_rank(query='Sintel', title='Sintel-Part-I') > \ torrent_rank(query='Sintel', title='Part-of-Sintel') # rule 4 assert torrent_rank(query="The Internet's Own Boy", title="The Internet's Own Boy") > \ torrent_rank(query="The Internet's Own Boy", title="The Internet's Own Boy: The Story") >\ torrent_rank(query="The Internet's Own Boy", title="The Internet's Own Boy: The Story of Aaron Swartz") # rule 5 assert torrent_rank(query="Internet's Own Boy", title="Internet's Own Boy - Aaron Swartz") > \ torrent_rank(query="Internet's Own Boy", title="Internet's very Own Boy") assert torrent_rank(query="Internet's Own Boy", title="Internet's Own Boy - Aaron Swartz") > \ torrent_rank(query="Internet's Own Boy", title="Internet's very special Boy person")

And came up with the simple calculate_rank function that is probably good enough for the first iteration of search improvement.

The new version:

def calculate_rank(query: List[str], title: List[str]) -> float: title = ' '.join(title) query = ' '.join(query) matcher = SequenceMatcher(None, query, title, autojunk=False) longest = matcher.find_longest_match(0, len(query), 0, len(title)) return matcher.quick_ratio() * longest.size

The old version:

def calculate_rank(query: List[str], title: List[str]) -> float: if not query: return 1.0 if not title: return 0.0 title = deque(title) total_error = 0 for i, term in enumerate(query): # The first word is more important than the second word, and so on term_weight = POSITION_COEFF / (POSITION_COEFF + i) # Read the description of the `find_term` function to understand what is going on. Basically, we are trying # to find each query word in the title words, calculate the penalty if the query word is not found or if there # are some title words before it, and then rotate the skipped title words to the end of the title. This way, # the least penalty got a title that has query words in the proper order at the beginning of the title. found, skipped = find_term(term, title) if found: # if the query word is found in the title, add penalty for skipped words in title before it total_error += skipped * term_weight else: # if the query word is not found in the title, add a big penalty for it total_error += MISSED_WORD_PENALTY * term_weight # a small penalty for excess words in the title that was not mentioned in the search phrase remainder_weight = 1 / (REMAINDER_COEFF + len(query)) remained_words_error = len(title) * remainder_weight total_error += remained_words_error # a search rank should be between 1 and 0 return RANK_NORMALIZATION_COEFF / (RANK_NORMALIZATION_COEFF + total_error) def find_term(term: str, title: Deque[str]) -> Tuple[bool, int]: try: skipped = title.index(term) except ValueError: return False, 0 title.rotate(-skipped) # rotate skipped words to the end title.popleft() # remove found word return True, skipped

There is a big practical difference in the results of these two functions. I show it on practical examples a week later when I return from the courses.

First, the suggested function does not work as expected.

Consider the usage of the original calculate_rank function.

>>> calculate_rank(["The", "Big", "Buck", "Bunny"], ["The", "Big", "Buck", "Bunny"]) 1.0 >>> calculate_rank(["Bunny"], ["The", "Big", "Buck", "Bunny"]) 0.7534246575342466

It is a bit easier to use the wrapping function title_rank, which is a lightweight wrapper around calculate_rank and returns the same values but accepts strings instead of lists:

>>> title_rank("The Big Buck Bunny", "The Big Buck Bunny") # the same rank as above, but easier to read 1.0 >>> title_rank("Bunny", "The Big Buck Bunny") 0.7534246575342466

The function in the PR returns the value from the range [0..1], where 1.0 means the exact match, and lower values mean the non-exact match.

If I rewrite the calculate_rank function as you are suggesting, then the results will be:

>>> title_rank2("The Big Buck Bunny", "The Big Buck Bunny") 18.0 >>> title_rank2("Bunny", "The Big Buck Bunny") 2.1739130434782608

The result does not satisfy the requirements. The rank for the exact match is not 1.0 but some arbitrary big value:

>>> title_rank2("The Big Buck Bunny foo bar baz", "The Big Buck Bunny foo bar baz") 30.0

So, the meaning of the returned value is unclear to me here.

So, while I agree that it is possible to have an alternative function for rank calculation, the suggested one requires significant changes to be useful. The function implemented in PR was already tested, and it works well.

Regarding the possibility of using difflib.SequenceMatcher as a base for implementing the rank function: it is possible, but the result will not be as good as for the function implemented in the PR. The implemented algorithm considers that the different words have a different weight in the final rank depending on their position. For torrent titles, it is crucial and should significantly impact the resulting order of sorted torrents. It is very common that the torrent title has some additional words at the end that were not mentioned in the search phrase: like '1080', 'MP4', 'AVI', 'Eng', '2010', 'MKV', 'ZIP', '24FPS', etc. On the other side, if there are additional words at the beginning or in the middle of the title, it usually means that the torrent is not a good match for the query, and its rank should receive a significant penalty.

For example:

>>> title_rank("The Big Buck Bunny", "The Big Buck Bunny") # exact match 1.0 >>> title_rank("The Big Buck Bunny", "The Big Buck Bunny Mp4 Zip 24FPS") 0.9790209790209791 # low penalty for excess words at the end of the title >>> title_rank("The Big Buck Bunny", "Making of The Big Buck Bunny") 0.823529411764706 # higher penalty for excess words at the beginning of the title

Now, to be able to compare with the suggested simpler algorithm based on difflib.SequenceMatcher, I put the same words at the beginning and the end of the title. The following is the result of the algorithm implemented in the PR:

>>> title_rank("The Big Buck Bunny", "The Big Buck Bunny") 1.0 >>> title_rank("The Big Buck Bunny", "The Big Buck Bunny Foo Bar Baz") 0.9790209790209791 # low penalty for excess words at the end of the title >>> title_rank("The Big Buck Bunny", "Foo Bar Baz The Big Buck Bunny") 0.7567567567567567 # higher penalty for excess words at the beginning of the title

And this is the result for the suggested function:

>>> title_rank2("The Big Buck Bunny", "The Big Buck Bunny") 18.0 >>> title_rank2("The Big Buck Bunny", "The Big Buck Bunny Foo Bar Baz") 13.5 # same value if the excess title words are at the beginning or at the end >>> title_rank2("The Big Buck Bunny", "Foo Bar Baz The Big Buck Bunny") 13.5 # same value if the excess title words are at the beginning or at the end

You can see that the suggested function not only returns dubious rank but also does not take into account the difference between the excess words at the beginning and the end of the title; the rank is the same value of 13.5.

Based on this, I'm standing for the function implemented in the PR and believe that it solves the task adequately.

If I got you correctly, you have the following concerns about the suggested function:

The output is not normalized to [0..1]

It works worse for one corner case: "the excess words at the beginning and the end of the title".

About the (1) -- yes, it is a draft. The output could be easily normalized. I didn't do it because for the prototype it is not necessary.

About (2) yes, but the suggested solution works pretty well for the most cases and extremely easy in terms of implementation.

Pros and cons in the single table:

Original Solution Suggested Solution

Johan Test ✅ ✅

Speed 1.4 20.8

LOC 100 6

Kozlovsky Test (corner cases) ✅ ❌

The speed test has been performed by the similar test that you did before:

import timeit from difflib import SequenceMatcher from typing import List def calculate_rank2(query: List[str], title: List[str]) -> float: title = ' '.join(title) query = ' '.join(query) matcher = SequenceMatcher(None, query, title, autojunk=False) longest = matcher.find_longest_match(0, len(query), 0, len(title)) return matcher.quick_ratio() * longest.size t1 = timeit.Timer( "calculate_rank('The Big Buck Bunny', 'Making of The Big Buck Bunny')", "from tribler.core.utilities.search_utils import calculate_rank\n") t2 = timeit.Timer( "calculate_rank2('The Big Buck Bunny', 'Making of The Big Buck Bunny')", "from __main__ import calculate_rank2\n") print('For the test, time 10k invocation of function, repeat 5 times') function1_results = t1.repeat(repeat=5, number=10000) print(f'Original function speed: {function1_results}') function2_results = t2.repeat(repeat=5, number=10000) print(f'Suggested function speed: {function2_results}') # Use the fastest result from each benchmark as it is the least affected by other parallel tasks and so the most precise slowdown = min(function2_results) * 100 / min(function1_results) - 100 print(f'The slowdown is {slowdown} percent')

To summarize the above my opinion is that we should keep your solution despite its complexity and size 🤝

Actually, the performance test requires a small fix. On my machine, the test shows that the function based on SequenceMatcher is 1579% slower than the solution from the PR. But the calculate_rank function actually expects lists of words, not strings, in its arguments. So, the correct timers definitions should look like this:

t1 = timeit.Timer( "calculate_rank('The Big Buck Bunny'.split(), 'Making of The Big Buck Bunny'.split())", "from tribler.core.utilities.search_utils import calculate_rank\n") t2 = timeit.Timer( "calculate_rank2('The Big Buck Bunny'.split(), 'Making of The Big Buck Bunny.split()')", "from __main__ import calculate_rank2\n")

With this change, the function based on SequenceMatcher is still slower, but only 849% slower.

the suggested solution works pretty well for the most cases

Actually, I consider the remaining cases quite significant and not so rare. It is pretty possible that Tribler has an almost perfect title match in the database, but it receives a heavy penalty for additional words at the end of the title name and, as a result, is not included in the search result list. With the function implementation from the PR, such results will not be accidentally penalized, and users will see them. I expect it to be important for around 10% of search queries.

In other words, if a user searches for "The Big Buck Bunny 1080p", and the title in the database is "The Big Buck Bunny MP4 1080p 30FPS Eng", it should almost not be penalized for a non-perfect match, and the algorithm from the PR takes that into account.

…ad of hardcoded values

…ded values

…he `find_term` function

…emovePeers strategy

…ders_rank()` formula

…nk_range, test_torrent_rank_range

kozlovsky force-pushed the search_improvements branch 3 times, most recently from 62799fb to 416f917 Compare September 30, 2022 05:19

kozlovsky marked this pull request as ready for review September 30, 2022 07:07

kozlovsky requested review from a team and drew2a and removed request for a team September 30, 2022 07:07

kozlovsky changed the title ~~WIP improved search~~ Improved ranking for search results and updated search UI without the artifical delay at the loading screen Sep 30, 2022

xoriole self-requested a review September 30, 2022 10:28

drew2a suggested changes Sep 30, 2022

View reviewed changes

synctext mentioned this pull request Oct 11, 2022

The relevance ranking algorithm when searching in Tribler needs to be redesigned #2250

Closed

devos50 mentioned this pull request Oct 11, 2022

Redesign of the Search/Channels feature #3615

Closed

kozlovsky force-pushed the search_improvements branch from 63c86d8 to f30beae Compare October 13, 2022 09:57

kozlovsky force-pushed the search_improvements branch 4 times, most recently from f981f8c to 879e7b1 Compare October 24, 2022 02:38

kozlovsky requested a review from drew2a October 24, 2022 03:34

drew2a reviewed Oct 24, 2022

View reviewed changes

src/tribler/core/tests/test_search_utils.py Outdated Show resolved Hide resolved

drew2a suggested changes Oct 24, 2022

View reviewed changes

drew2a reviewed Oct 24, 2022

View reviewed changes

kozlovsky force-pushed the search_improvements branch from 879e7b1 to feb9b04 Compare November 1, 2022 08:56

kozlovsky requested a review from drew2a November 1, 2022 17:29

kozlovsky force-pushed the search_improvements branch from 8e513e5 to f3881bf Compare November 2, 2022 10:44

kozlovsky added 18 commits November 4, 2022 18:21

Add tests for better coverage

bb5ddc6

Use relative comparison of ranking results in test_torrent_rank inste…

de8a749

…ad of hardcoded values

Add a description for a complex sort_by expression

2f73fe3

Use named coefficients in calculate_rank function instead of a hardco…

6b41095

…ded values

Add a comment to the calculate_rank function and a description to t…

fbcce11

…he `find_term` function

Highlight remote results

cd9cd54

Stop highlighting remote results after a small period

3565af4

Add TARGET_PEERS_NUMBER constant that is used as a default value in R…

bc4bc21

…emovePeers strategy

Add a comment to the max_query_peers option

f8914f8

Rename: term -> word

cd8cb1c

Docstrings added; leechers are passed as a separate parameter

49e5f64

Make test errors easier to understand

65297ba

Move examples for find_work function from the docstring to a unit test

70e071d

Update a comment

2f078d5

Add a SEEDERS_HALF_RANK named constant and a description for the `see…

a670ffa

…ders_rank()` formula

Add a description of the freshness parameter

158b179

Add a description to a freshness_rank function's formula

808bd27

Add test_title_rank_range, test_freshness_rank_range, test_seeders_ra…

946e573

…nk_range, test_torrent_rank_range

kozlovsky force-pushed the search_improvements branch 3 times, most recently from 2c94374 to f5468f0 Compare November 4, 2022 19:14

kozlovsky added 2 commits November 4, 2022 20:30

Rename the function: find_word -> find_word_and_rotate_title

dabb40d

Do not perform more than one full-text remote search in parallel

c3df3a5

kozlovsky force-pushed the search_improvements branch from f5468f0 to c3df3a5 Compare November 4, 2022 19:30

drew2a approved these changes Nov 4, 2022

View reviewed changes

kozlovsky merged commit 4755bee into Tribler:main Nov 4, 2022

kozlovsky deleted the search_improvements branch November 4, 2022 23:11

drew2a mentioned this pull request May 23, 2023

The big migration: from the Channels to the Knowledge Graph #7398

Closed

This was referenced Apr 18, 2024

Remove the torrent grouping based on names #7984

Closed

Remove the torrent grouping feature based on names #7986

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved ranking for search results and updated search UI without the artifical delay at the loading screen #7025

Improved ranking for search results and updated search UI without the artifical delay at the loading screen #7025

kozlovsky commented Aug 31, 2022 •

edited

Loading

drew2a commented Sep 30, 2022

drew2a left a comment

drew2a Sep 30, 2022

kozlovsky Oct 24, 2022

kozlovsky commented Sep 30, 2022

synctext commented Oct 20, 2022

kozlovsky commented Oct 24, 2022

drew2a commented Oct 24, 2022

kozlovsky commented Oct 24, 2022 •

edited

Loading

drew2a Oct 24, 2022

kozlovsky Oct 24, 2022

kozlovsky Nov 1, 2022 •

edited

Loading

drew2a Nov 2, 2022

kozlovsky Nov 2, 2022

kozlovsky Nov 2, 2022 •

edited

Loading

kozlovsky Nov 2, 2022

	Original Solution	Suggested Solution
Johan Test	✅	✅
Speed	1.4	20.8
LOC	100	6
Kozlovsky Test (corner cases)	✅	❌

Improved ranking for search results and updated search UI without the artifical delay at the loading screen #7025

Improved ranking for search results and updated search UI without the artifical delay at the loading screen #7025

Conversation

kozlovsky commented Aug 31, 2022 • edited Loading

drew2a commented Sep 30, 2022

drew2a left a comment

Choose a reason for hiding this comment

drew2a Sep 30, 2022

Choose a reason for hiding this comment

kozlovsky Oct 24, 2022

Choose a reason for hiding this comment

kozlovsky commented Sep 30, 2022

synctext commented Oct 20, 2022

kozlovsky commented Oct 24, 2022

drew2a commented Oct 24, 2022

kozlovsky commented Oct 24, 2022 • edited Loading

drew2a Oct 24, 2022

Choose a reason for hiding this comment

kozlovsky Oct 24, 2022

Choose a reason for hiding this comment

kozlovsky Nov 1, 2022 • edited Loading

Choose a reason for hiding this comment

drew2a Nov 2, 2022

Choose a reason for hiding this comment

kozlovsky Nov 2, 2022

Choose a reason for hiding this comment

kozlovsky Nov 2, 2022 • edited Loading

Choose a reason for hiding this comment

kozlovsky Nov 2, 2022

Choose a reason for hiding this comment

kozlovsky commented Aug 31, 2022 •

edited

Loading

kozlovsky commented Oct 24, 2022 •

edited

Loading

kozlovsky Nov 1, 2022 •

edited

Loading

kozlovsky Nov 2, 2022 •

edited

Loading