Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved ranking for search results and updated search UI without the artifical delay at the loading screen #7025

Merged
merged 32 commits into from
Nov 4, 2022

Conversation

kozlovsky
Copy link
Contributor

@kozlovsky kozlovsky commented Aug 31, 2022

This PR implements improved search with better item ranking implemented at the DB level, as described in #2250 (comment). Elements with relevant titles and many seeders are placed closer to the top of the search result list.

UI shows local search results immediately, ready without any artificial delay. Remote items show automatically with a slight delay to avoid result flickering and have highlighted background to prevent user confusion when remote results appear on the screen.

Search UI has a progress bar to show the progress for the remote searches. The tooltip of the progress bar displays the number of complete remote requests and the total number of remote requests. It is possible to click on the progress bar to stop processing remote results if they take a long time.

Remote requests have a timeout of 20 seconds to be incorporated into the search result in the UI, and the progress bar actual progress takes into account this hard timeout as well as the current progress on receiving remote results; the more results we have, the close the progress bar to a finish.

@kozlovsky kozlovsky force-pushed the search_improvements branch 3 times, most recently from 62799fb to 416f917 Compare September 30, 2022 05:19
@kozlovsky kozlovsky marked this pull request as ready for review September 30, 2022 07:07
@kozlovsky kozlovsky requested review from a team and drew2a and removed request for a team September 30, 2022 07:07
@kozlovsky kozlovsky changed the title WIP improved search Improved ranking for search results and updated search UI without the artifical delay at the loading screen Sep 30, 2022
@drew2a
Copy link
Contributor

drew2a commented Sep 30, 2022

I guess I've found a QT-related bug:

  1. Open the Tribler
  2. Search for "ubuntu"
  3. Click on a row

image

@xoriole xoriole self-requested a review September 30, 2022 10:28
Copy link
Contributor

@drew2a drew2a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement! I added a couple of comments for the core logic.

I will check the GUI side immediately after the bugfix is completed.

src/tribler/core/tests/test_search_utils.py Show resolved Hide resolved
src/tribler/core/tests/test_search_utils.py Outdated Show resolved Hide resolved
src/tribler/core/components/metadata_store/db/store.py Outdated Show resolved Hide resolved
@@ -1,13 +1,18 @@
"""
Search utilities.

Author(s): Jelle Roozenburg, Arno Bakker
Author(s): Jelle Roozenburg, Arno Bakker, Alexander Kozlovsky
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the line with authors altogether. I added the name just because the original people mentioned here do not have a relation to 85% of the code in this file anymore :)

src/tribler/core/utilities/search_utils.py Outdated Show resolved Hide resolved
src/tribler/core/utilities/search_utils.py Outdated Show resolved Hide resolved
src/tribler/core/utilities/search_utils.py Outdated Show resolved Hide resolved
src/tribler/core/utilities/search_utils.py Outdated Show resolved Hide resolved
@kozlovsky
Copy link
Contributor Author

I guess I've found a QT-related bug:

@drew2a thanks, I was able to reproduce it on Mac. On Windows, it does not appear.

@synctext
Copy link
Member

Ambition level is to wrap this up by upcoming Friday.

@kozlovsky kozlovsky force-pushed the search_improvements branch 4 times, most recently from f981f8c to 879e7b1 Compare October 24, 2022 02:38
@kozlovsky
Copy link
Contributor Author

@drew2a I fixed the bug you mentioned. When doing this, I changed the way the dynamically added search results items are highlighted.

In UI, I want to highlight dynamically added results with a slightly different color so the user does not become disoriented when new rows suddenly shift the previous content of the search result list.

Initially, I painted just a title column with a slightly different background color for dynamically added results. As your screenshot shows, it looks pretty ugly when the row is selected.

To fix it, I first tried to use Qt roles for model data items. Qt::BackgroundRole should allow specifying different background brushes for different table rows, and it sounded like the right way to solve the task.

QStyledItemDelegate that we use in Tribler has the initStyleOption method that should allow the background brush to be customizable.

Unfortunately, it turns out that all these Qt mechanics with roles is not working if a Qt widget or its parent has a CSS style. With any CSS style specified, Qt completely ignores all roles from models. As we extensively use CSS styling in Tribler, we cannot use roles to paint different search result items in a different way and cannot specify different background colors for different table rows.

As a workaround, I switched to activating hovered state for a short period for new items. That looks like the only way to specify different backgrounds for some items. And on practice, the resulting highlighting looks quite good to me.

@kozlovsky kozlovsky requested a review from drew2a October 24, 2022 03:34
@drew2a
Copy link
Contributor

drew2a commented Oct 24, 2022

@kozlovsky the bug has been fixed. Confirmed. UI looks good to me.

Note, that dynamically added items are not grouped into the snippet:

image

@kozlovsky
Copy link
Contributor Author

kozlovsky commented Oct 24, 2022

@drew2a I suggest to group remote results in a separate PR

src/tribler/core/utilities/search_utils.py Outdated Show resolved Hide resolved
src/tribler/core/utilities/search_utils.py Outdated Show resolved Hide resolved
src/tribler/core/utilities/search_utils.py Outdated Show resolved Hide resolved
src/tribler/core/utilities/search_utils.py Outdated Show resolved Hide resolved
src/tribler/core/utilities/search_utils.py Outdated Show resolved Hide resolved
src/tribler/core/utilities/search_utils.py Show resolved Hide resolved
src/tribler/core/utilities/search_utils.py Outdated Show resolved Hide resolved
src/tribler/core/utilities/search_utils.py Outdated Show resolved Hide resolved
src/tribler/core/utilities/search_utils.py Outdated Show resolved Hide resolved
src/tribler/core/tests/test_search_utils.py Outdated Show resolved Hide resolved
Comment on lines 112 to 213
def calculate_rank(query: List[str], title: List[str]) -> float:
"""
Calculate the similarity of the title to the query as a float value in range [0, 1].
"""
if not query:
return 1.0

if not title:
return 0.0

title = deque(title)
total_error = 0
for i, term in enumerate(query):
# The first word is more important than the second word, and so on
term_weight = POSITION_COEFF / (POSITION_COEFF + i)

# Read the description of the `find_term` function to understand what is going on. Basically, we are trying
# to find each query word in the title words, calculate the penalty if the query word is not found or if there
# are some title words before it, and then rotate the skipped title words to the end of the title. This way,
# the least penalty got a title that has query words in the proper order at the beginning of the title.
found, skipped = find_term(term, title)
if found:
# if the query word is found in the title, add penalty for skipped words in title before it
total_error += skipped * term_weight
else:
# if the query word is not found in the title, add a big penalty for it
total_error += MISSED_WORD_PENALTY * term_weight

# a small penalty for excess words in the title that was not mentioned in the search phrase
remainder_weight = 1 / (REMAINDER_COEFF + len(query))
remained_words_error = len(title) * remainder_weight
total_error += remained_words_error

# a search rank should be between 1 and 0
return RANK_NORMALIZATION_COEFF / (RANK_NORMALIZATION_COEFF + total_error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering... could calculate_rank be replaced by something simpler?

After a quick research, I prepared a test function that satisfies @synctext requirements posted here

def test_calculate_rank():
    # rule 2
    assert torrent_rank(query='Sintel', title='Sintel') > \
           torrent_rank(query='Sintel', title='the.script.from.the.movie.Sintel.pdf')

    # rule 3
    assert torrent_rank(query='Sintel', title='Sintel-Part-I') > \
           torrent_rank(query='Sintel', title='Part-of-Sintel')

    # rule 4
    assert torrent_rank(query="The Internet's Own Boy", title="The Internet's Own Boy") > \
           torrent_rank(query="The Internet's Own Boy", title="The Internet's Own Boy: The Story") >\
           torrent_rank(query="The Internet's Own Boy", title="The Internet's Own Boy: The Story of Aaron Swartz")

    # rule 5
    assert torrent_rank(query="Internet's Own Boy", title="Internet's Own Boy - Aaron Swartz") > \
           torrent_rank(query="Internet's Own Boy", title="Internet's very Own Boy")

    assert torrent_rank(query="Internet's Own Boy", title="Internet's Own Boy - Aaron Swartz") > \
           torrent_rank(query="Internet's Own Boy", title="Internet's very special Boy person")

And came up with the simple calculate_rank function that is probably good enough for the first iteration of search improvement.

The new version:

def calculate_rank(query: List[str], title: List[str]) -> float:
    title = ' '.join(title)
    query = ' '.join(query)
    matcher = SequenceMatcher(None, query, title, autojunk=False)
    longest = matcher.find_longest_match(0, len(query), 0, len(title))
    return matcher.quick_ratio() * longest.size

The old version:

def calculate_rank(query: List[str], title: List[str]) -> float:
    if not query:
        return 1.0

    if not title:
        return 0.0

    title = deque(title)
    total_error = 0
    for i, term in enumerate(query):
        # The first word is more important than the second word, and so on
        term_weight = POSITION_COEFF / (POSITION_COEFF + i)

        # Read the description of the `find_term` function to understand what is going on. Basically, we are trying
        # to find each query word in the title words, calculate the penalty if the query word is not found or if there
        # are some title words before it, and then rotate the skipped title words to the end of the title. This way,
        # the least penalty got a title that has query words in the proper order at the beginning of the title.
        found, skipped = find_term(term, title)
        if found:
            # if the query word is found in the title, add penalty for skipped words in title before it
            total_error += skipped * term_weight
        else:
            # if the query word is not found in the title, add a big penalty for it
            total_error += MISSED_WORD_PENALTY * term_weight

    # a small penalty for excess words in the title that was not mentioned in the search phrase
    remainder_weight = 1 / (REMAINDER_COEFF + len(query))
    remained_words_error = len(title) * remainder_weight
    total_error += remained_words_error

    # a search rank should be between 1 and 0
    return RANK_NORMALIZATION_COEFF / (RANK_NORMALIZATION_COEFF + total_error)


def find_term(term: str, title: Deque[str]) -> Tuple[bool, int]:
    try:
        skipped = title.index(term)
    except ValueError:
        return False, 0

    title.rotate(-skipped)  # rotate skipped words to the end
    title.popleft()  # remove found word
    return True, skipped

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a big practical difference in the results of these two functions. I show it on practical examples a week later when I return from the courses.

Copy link
Contributor Author

@kozlovsky kozlovsky Nov 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, the suggested function does not work as expected.

Consider the usage of the original calculate_rank function.

>>> calculate_rank(["The", "Big", "Buck", "Bunny"], ["The", "Big", "Buck", "Bunny"])
1.0
>>> calculate_rank(["Bunny"], ["The", "Big", "Buck", "Bunny"])
0.7534246575342466

It is a bit easier to use the wrapping function title_rank, which is a lightweight wrapper around calculate_rank and returns the same values but accepts strings instead of lists:

>>> title_rank("The Big Buck Bunny", "The Big Buck Bunny")  # the same rank as above, but easier to read
1.0
>>> title_rank("Bunny", "The Big Buck Bunny")
0.7534246575342466

The function in the PR returns the value from the range [0..1], where 1.0 means the exact match, and lower values mean the non-exact match.

If I rewrite the calculate_rank function as you are suggesting, then the results will be:

>>> title_rank2("The Big Buck Bunny", "The Big Buck Bunny")
18.0
>>> title_rank2("Bunny", "The Big Buck Bunny")
2.1739130434782608

The result does not satisfy the requirements. The rank for the exact match is not 1.0 but some arbitrary big value:

>>> title_rank2("The Big Buck Bunny foo bar baz", "The Big Buck Bunny foo bar baz")
30.0

So, the meaning of the returned value is unclear to me here.

So, while I agree that it is possible to have an alternative function for rank calculation, the suggested one requires significant changes to be useful. The function implemented in PR was already tested, and it works well.

Regarding the possibility of using difflib.SequenceMatcher as a base for implementing the rank function: it is possible, but the result will not be as good as for the function implemented in the PR. The implemented algorithm considers that the different words have a different weight in the final rank depending on their position. For torrent titles, it is crucial and should significantly impact the resulting order of sorted torrents. It is very common that the torrent title has some additional words at the end that were not mentioned in the search phrase: like '1080', 'MP4', 'AVI', 'Eng', '2010', 'MKV', 'ZIP', '24FPS', etc. On the other side, if there are additional words at the beginning or in the middle of the title, it usually means that the torrent is not a good match for the query, and its rank should receive a significant penalty.

For example:

>>> title_rank("The Big Buck Bunny", "The Big Buck Bunny")  # exact match
1.0
>>> title_rank("The Big Buck Bunny", "The Big Buck Bunny Mp4 Zip 24FPS")
0.9790209790209791  # low penalty for excess words at the end of the title
>>> title_rank("The Big Buck Bunny", "Making of The Big Buck Bunny")
0.823529411764706  # higher penalty for excess words at the beginning of the title

Now, to be able to compare with the suggested simpler algorithm based on difflib.SequenceMatcher, I put the same words at the beginning and the end of the title. The following is the result of the algorithm implemented in the PR:

>>> title_rank("The Big Buck Bunny", "The Big Buck Bunny")
1.0
>>> title_rank("The Big Buck Bunny", "The Big Buck Bunny Foo Bar Baz")
0.9790209790209791  # low penalty for excess words at the end of the title
>>> title_rank("The Big Buck Bunny", "Foo Bar Baz The Big Buck Bunny")
0.7567567567567567  # higher penalty for excess words at the beginning of the title

And this is the result for the suggested function:

>>> title_rank2("The Big Buck Bunny", "The Big Buck Bunny")
18.0
>>> title_rank2("The Big Buck Bunny", "The Big Buck Bunny Foo Bar Baz")
13.5  # same value if the excess title words are at the beginning or at the end
>>> title_rank2("The Big Buck Bunny", "Foo Bar Baz The Big Buck Bunny")
13.5  # same value if the excess title words are at the beginning or at the end

You can see that the suggested function not only returns dubious rank but also does not take into account the difference between the excess words at the beginning and the end of the title; the rank is the same value of 13.5.

Based on this, I'm standing for the function implemented in the PR and believe that it solves the task adequately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I got you correctly, you have the following concerns about the suggested function:

  1. The output is not normalized to [0..1]
  2. It works worse for one corner case: "the excess words at the beginning and the end of the title".

About the (1) -- yes, it is a draft. The output could be easily normalized. I didn't do it because for the prototype it is not necessary.

About (2) yes, but the suggested solution works pretty well for the most cases and extremely easy in terms of implementation.

Pros and cons in the single table:

Original Solution Suggested Solution
Johan Test
Speed 1.4 20.8
LOC 100 6
Kozlovsky Test (corner cases)

The speed test has been performed by the similar test that you did before:

import timeit
from difflib import SequenceMatcher
from typing import List


def calculate_rank2(query: List[str], title: List[str]) -> float:
    title = ' '.join(title)
    query = ' '.join(query)
    matcher = SequenceMatcher(None, query, title, autojunk=False)
    longest = matcher.find_longest_match(0, len(query), 0, len(title))
    return matcher.quick_ratio() * longest.size


t1 = timeit.Timer(
    "calculate_rank('The Big Buck Bunny', 'Making of The Big Buck Bunny')",
    "from tribler.core.utilities.search_utils import calculate_rank\n")

t2 = timeit.Timer(
    "calculate_rank2('The Big Buck Bunny', 'Making of The Big Buck Bunny')",
    "from __main__ import calculate_rank2\n")

print('For the test, time 10k invocation of function, repeat 5 times')

function1_results = t1.repeat(repeat=5, number=10000)
print(f'Original function speed: {function1_results}')

function2_results = t2.repeat(repeat=5, number=10000)
print(f'Suggested function speed: {function2_results}')

# Use the fastest result from each benchmark as it is the least affected by other parallel tasks and so the most precise
slowdown = min(function2_results) * 100 / min(function1_results) - 100
print(f'The slowdown is {slowdown} percent')

To summarize the above my opinion is that we should keep your solution despite its complexity and size 🤝

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the performance test requires a small fix. On my machine, the test shows that the function based on SequenceMatcher is 1579% slower than the solution from the PR. But the calculate_rank function actually expects lists of words, not strings, in its arguments. So, the correct timers definitions should look like this:

t1 = timeit.Timer(
    "calculate_rank('The Big Buck Bunny'.split(), 'Making of The Big Buck Bunny'.split())",
    "from tribler.core.utilities.search_utils import calculate_rank\n")

t2 = timeit.Timer(
    "calculate_rank2('The Big Buck Bunny'.split(), 'Making of The Big Buck Bunny.split()')",
    "from __main__ import calculate_rank2\n")

With this change, the function based on SequenceMatcher is still slower, but only 849% slower.

Copy link
Contributor Author

@kozlovsky kozlovsky Nov 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the suggested solution works pretty well for the most cases

Actually, I consider the remaining cases quite significant and not so rare. It is pretty possible that Tribler has an almost perfect title match in the database, but it receives a heavy penalty for additional words at the end of the title name and, as a result, is not included in the search result list. With the function implementation from the PR, such results will not be accidentally penalized, and users will see them. I expect it to be important for around 10% of search queries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, if a user searches for "The Big Buck Bunny 1080p", and the title in the database is "The Big Buck Bunny MP4 1080p 30FPS Eng", it should almost not be penalized for a non-perfect match, and the algorithm from the PR takes that into account.

@kozlovsky kozlovsky force-pushed the search_improvements branch from 879e7b1 to feb9b04 Compare November 1, 2022 08:56
@kozlovsky kozlovsky requested a review from drew2a November 1, 2022 17:29
@kozlovsky kozlovsky force-pushed the search_improvements branch from 8e513e5 to f3881bf Compare November 2, 2022 10:44
@kozlovsky kozlovsky force-pushed the search_improvements branch 3 times, most recently from 2c94374 to f5468f0 Compare November 4, 2022 19:14
@kozlovsky kozlovsky force-pushed the search_improvements branch from f5468f0 to c3df3a5 Compare November 4, 2022 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants