-
Notifications
You must be signed in to change notification settings - Fork 451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved ranking for search results and updated search UI without the artifical delay at the loading screen #7025
Conversation
62799fb
to
416f917
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvement! I added a couple of comments for the core logic.
I will check the GUI side immediately after the bugfix is completed.
src/tribler/core/components/metadata_store/remote_query_community/settings.py
Show resolved
Hide resolved
@@ -1,13 +1,18 @@ | |||
""" | |||
Search utilities. | |||
|
|||
Author(s): Jelle Roozenburg, Arno Bakker | |||
Author(s): Jelle Roozenburg, Arno Bakker, Alexander Kozlovsky |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove the line with authors altogether. I added the name just because the original people mentioned here do not have a relation to 85% of the code in this file anymore :)
@drew2a thanks, I was able to reproduce it on Mac. On Windows, it does not appear. |
63c86d8
to
f30beae
Compare
Ambition level is to wrap this up by upcoming Friday. |
f981f8c
to
879e7b1
Compare
@drew2a I fixed the bug you mentioned. When doing this, I changed the way the dynamically added search results items are highlighted. In UI, I want to highlight dynamically added results with a slightly different color so the user does not become disoriented when new rows suddenly shift the previous content of the search result list. Initially, I painted just a title column with a slightly different background color for dynamically added results. As your screenshot shows, it looks pretty ugly when the row is selected. To fix it, I first tried to use Qt roles for model data items.
Unfortunately, it turns out that all these Qt mechanics with roles is not working if a Qt widget or its parent has a CSS style. With any CSS style specified, Qt completely ignores all roles from models. As we extensively use CSS styling in Tribler, we cannot use roles to paint different search result items in a different way and cannot specify different background colors for different table rows. As a workaround, I switched to activating hovered state for a short period for new items. That looks like the only way to specify different backgrounds for some items. And on practice, the resulting highlighting looks quite good to me. |
@kozlovsky the bug has been fixed. Confirmed. UI looks good to me. Note, that dynamically added items are not grouped into the snippet: |
@drew2a I suggest to group remote results in a separate PR |
def calculate_rank(query: List[str], title: List[str]) -> float: | ||
""" | ||
Calculate the similarity of the title to the query as a float value in range [0, 1]. | ||
""" | ||
if not query: | ||
return 1.0 | ||
|
||
if not title: | ||
return 0.0 | ||
|
||
title = deque(title) | ||
total_error = 0 | ||
for i, term in enumerate(query): | ||
# The first word is more important than the second word, and so on | ||
term_weight = POSITION_COEFF / (POSITION_COEFF + i) | ||
|
||
# Read the description of the `find_term` function to understand what is going on. Basically, we are trying | ||
# to find each query word in the title words, calculate the penalty if the query word is not found or if there | ||
# are some title words before it, and then rotate the skipped title words to the end of the title. This way, | ||
# the least penalty got a title that has query words in the proper order at the beginning of the title. | ||
found, skipped = find_term(term, title) | ||
if found: | ||
# if the query word is found in the title, add penalty for skipped words in title before it | ||
total_error += skipped * term_weight | ||
else: | ||
# if the query word is not found in the title, add a big penalty for it | ||
total_error += MISSED_WORD_PENALTY * term_weight | ||
|
||
# a small penalty for excess words in the title that was not mentioned in the search phrase | ||
remainder_weight = 1 / (REMAINDER_COEFF + len(query)) | ||
remained_words_error = len(title) * remainder_weight | ||
total_error += remained_words_error | ||
|
||
# a search rank should be between 1 and 0 | ||
return RANK_NORMALIZATION_COEFF / (RANK_NORMALIZATION_COEFF + total_error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering... could calculate_rank
be replaced by something simpler?
After a quick research, I prepared a test function that satisfies @synctext requirements posted here
def test_calculate_rank():
# rule 2
assert torrent_rank(query='Sintel', title='Sintel') > \
torrent_rank(query='Sintel', title='the.script.from.the.movie.Sintel.pdf')
# rule 3
assert torrent_rank(query='Sintel', title='Sintel-Part-I') > \
torrent_rank(query='Sintel', title='Part-of-Sintel')
# rule 4
assert torrent_rank(query="The Internet's Own Boy", title="The Internet's Own Boy") > \
torrent_rank(query="The Internet's Own Boy", title="The Internet's Own Boy: The Story") >\
torrent_rank(query="The Internet's Own Boy", title="The Internet's Own Boy: The Story of Aaron Swartz")
# rule 5
assert torrent_rank(query="Internet's Own Boy", title="Internet's Own Boy - Aaron Swartz") > \
torrent_rank(query="Internet's Own Boy", title="Internet's very Own Boy")
assert torrent_rank(query="Internet's Own Boy", title="Internet's Own Boy - Aaron Swartz") > \
torrent_rank(query="Internet's Own Boy", title="Internet's very special Boy person")
And came up with the simple calculate_rank
function that is probably good enough for the first iteration of search improvement.
The new version:
def calculate_rank(query: List[str], title: List[str]) -> float:
title = ' '.join(title)
query = ' '.join(query)
matcher = SequenceMatcher(None, query, title, autojunk=False)
longest = matcher.find_longest_match(0, len(query), 0, len(title))
return matcher.quick_ratio() * longest.size
The old version:
def calculate_rank(query: List[str], title: List[str]) -> float:
if not query:
return 1.0
if not title:
return 0.0
title = deque(title)
total_error = 0
for i, term in enumerate(query):
# The first word is more important than the second word, and so on
term_weight = POSITION_COEFF / (POSITION_COEFF + i)
# Read the description of the `find_term` function to understand what is going on. Basically, we are trying
# to find each query word in the title words, calculate the penalty if the query word is not found or if there
# are some title words before it, and then rotate the skipped title words to the end of the title. This way,
# the least penalty got a title that has query words in the proper order at the beginning of the title.
found, skipped = find_term(term, title)
if found:
# if the query word is found in the title, add penalty for skipped words in title before it
total_error += skipped * term_weight
else:
# if the query word is not found in the title, add a big penalty for it
total_error += MISSED_WORD_PENALTY * term_weight
# a small penalty for excess words in the title that was not mentioned in the search phrase
remainder_weight = 1 / (REMAINDER_COEFF + len(query))
remained_words_error = len(title) * remainder_weight
total_error += remained_words_error
# a search rank should be between 1 and 0
return RANK_NORMALIZATION_COEFF / (RANK_NORMALIZATION_COEFF + total_error)
def find_term(term: str, title: Deque[str]) -> Tuple[bool, int]:
try:
skipped = title.index(term)
except ValueError:
return False, 0
title.rotate(-skipped) # rotate skipped words to the end
title.popleft() # remove found word
return True, skipped
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a big practical difference in the results of these two functions. I show it on practical examples a week later when I return from the courses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, the suggested function does not work as expected.
Consider the usage of the original calculate_rank
function.
>>> calculate_rank(["The", "Big", "Buck", "Bunny"], ["The", "Big", "Buck", "Bunny"])
1.0
>>> calculate_rank(["Bunny"], ["The", "Big", "Buck", "Bunny"])
0.7534246575342466
It is a bit easier to use the wrapping function title_rank
, which is a lightweight wrapper around calculate_rank
and returns the same values but accepts strings instead of lists:
>>> title_rank("The Big Buck Bunny", "The Big Buck Bunny") # the same rank as above, but easier to read
1.0
>>> title_rank("Bunny", "The Big Buck Bunny")
0.7534246575342466
The function in the PR returns the value from the range [0..1]
, where 1.0
means the exact match, and lower values mean the non-exact match.
If I rewrite the calculate_rank
function as you are suggesting, then the results will be:
>>> title_rank2("The Big Buck Bunny", "The Big Buck Bunny")
18.0
>>> title_rank2("Bunny", "The Big Buck Bunny")
2.1739130434782608
The result does not satisfy the requirements. The rank for the exact match is not 1.0
but some arbitrary big value:
>>> title_rank2("The Big Buck Bunny foo bar baz", "The Big Buck Bunny foo bar baz")
30.0
So, the meaning of the returned value is unclear to me here.
So, while I agree that it is possible to have an alternative function for rank calculation, the suggested one requires significant changes to be useful. The function implemented in PR was already tested, and it works well.
Regarding the possibility of using difflib.SequenceMatcher
as a base for implementing the rank function: it is possible, but the result will not be as good as for the function implemented in the PR. The implemented algorithm considers that the different words have a different weight in the final rank depending on their position. For torrent titles, it is crucial and should significantly impact the resulting order of sorted torrents. It is very common that the torrent title has some additional words at the end that were not mentioned in the search phrase: like '1080'
, 'MP4'
, 'AVI'
, 'Eng'
, '2010'
, 'MKV'
, 'ZIP'
, '24FPS'
, etc. On the other side, if there are additional words at the beginning or in the middle of the title, it usually means that the torrent is not a good match for the query, and its rank should receive a significant penalty.
For example:
>>> title_rank("The Big Buck Bunny", "The Big Buck Bunny") # exact match
1.0
>>> title_rank("The Big Buck Bunny", "The Big Buck Bunny Mp4 Zip 24FPS")
0.9790209790209791 # low penalty for excess words at the end of the title
>>> title_rank("The Big Buck Bunny", "Making of The Big Buck Bunny")
0.823529411764706 # higher penalty for excess words at the beginning of the title
Now, to be able to compare with the suggested simpler algorithm based on difflib.SequenceMatcher
, I put the same words at the beginning and the end of the title. The following is the result of the algorithm implemented in the PR:
>>> title_rank("The Big Buck Bunny", "The Big Buck Bunny")
1.0
>>> title_rank("The Big Buck Bunny", "The Big Buck Bunny Foo Bar Baz")
0.9790209790209791 # low penalty for excess words at the end of the title
>>> title_rank("The Big Buck Bunny", "Foo Bar Baz The Big Buck Bunny")
0.7567567567567567 # higher penalty for excess words at the beginning of the title
And this is the result for the suggested function:
>>> title_rank2("The Big Buck Bunny", "The Big Buck Bunny")
18.0
>>> title_rank2("The Big Buck Bunny", "The Big Buck Bunny Foo Bar Baz")
13.5 # same value if the excess title words are at the beginning or at the end
>>> title_rank2("The Big Buck Bunny", "Foo Bar Baz The Big Buck Bunny")
13.5 # same value if the excess title words are at the beginning or at the end
You can see that the suggested function not only returns dubious rank but also does not take into account the difference between the excess words at the beginning and the end of the title; the rank is the same value of 13.5
.
Based on this, I'm standing for the function implemented in the PR and believe that it solves the task adequately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I got you correctly, you have the following concerns about the suggested function:
- The output is not normalized to [0..1]
- It works worse for one corner case: "the excess words at the beginning and the end of the title".
About the (1) -- yes, it is a draft. The output could be easily normalized. I didn't do it because for the prototype it is not necessary.
About (2) yes, but the suggested solution works pretty well for the most cases and extremely easy in terms of implementation.
Pros and cons in the single table:
Original Solution | Suggested Solution | |
---|---|---|
Johan Test | ✅ | ✅ |
Speed | 1.4 | 20.8 |
LOC | 100 | 6 |
Kozlovsky Test (corner cases) | ✅ | ❌ |
The speed test has been performed by the similar test that you did before:
import timeit
from difflib import SequenceMatcher
from typing import List
def calculate_rank2(query: List[str], title: List[str]) -> float:
title = ' '.join(title)
query = ' '.join(query)
matcher = SequenceMatcher(None, query, title, autojunk=False)
longest = matcher.find_longest_match(0, len(query), 0, len(title))
return matcher.quick_ratio() * longest.size
t1 = timeit.Timer(
"calculate_rank('The Big Buck Bunny', 'Making of The Big Buck Bunny')",
"from tribler.core.utilities.search_utils import calculate_rank\n")
t2 = timeit.Timer(
"calculate_rank2('The Big Buck Bunny', 'Making of The Big Buck Bunny')",
"from __main__ import calculate_rank2\n")
print('For the test, time 10k invocation of function, repeat 5 times')
function1_results = t1.repeat(repeat=5, number=10000)
print(f'Original function speed: {function1_results}')
function2_results = t2.repeat(repeat=5, number=10000)
print(f'Suggested function speed: {function2_results}')
# Use the fastest result from each benchmark as it is the least affected by other parallel tasks and so the most precise
slowdown = min(function2_results) * 100 / min(function1_results) - 100
print(f'The slowdown is {slowdown} percent')
To summarize the above my opinion is that we should keep your solution despite its complexity and size 🤝
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the performance test requires a small fix. On my machine, the test shows that the function based on SequenceMatcher
is 1579% slower than the solution from the PR. But the calculate_rank
function actually expects lists of words, not strings, in its arguments. So, the correct timers definitions should look like this:
t1 = timeit.Timer(
"calculate_rank('The Big Buck Bunny'.split(), 'Making of The Big Buck Bunny'.split())",
"from tribler.core.utilities.search_utils import calculate_rank\n")
t2 = timeit.Timer(
"calculate_rank2('The Big Buck Bunny'.split(), 'Making of The Big Buck Bunny.split()')",
"from __main__ import calculate_rank2\n")
With this change, the function based on SequenceMatcher is still slower, but only 849% slower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the suggested solution works pretty well for the most cases
Actually, I consider the remaining cases quite significant and not so rare. It is pretty possible that Tribler has an almost perfect title match in the database, but it receives a heavy penalty for additional words at the end of the title name and, as a result, is not included in the search result list. With the function implementation from the PR, such results will not be accidentally penalized, and users will see them. I expect it to be important for around 10% of search queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, if a user searches for "The Big Buck Bunny 1080p", and the title in the database is "The Big Buck Bunny MP4 1080p 30FPS Eng", it should almost not be penalized for a non-perfect match, and the algorithm from the PR takes that into account.
879e7b1
to
feb9b04
Compare
8e513e5
to
f3881bf
Compare
…ad of hardcoded values
…he `find_term` function
…emovePeers strategy
…ders_rank()` formula
…nk_range, test_torrent_rank_range
2c94374
to
f5468f0
Compare
f5468f0
to
c3df3a5
Compare
This PR implements improved search with better item ranking implemented at the DB level, as described in #2250 (comment). Elements with relevant titles and many seeders are placed closer to the top of the search result list.
UI shows local search results immediately, ready without any artificial delay. Remote items show automatically with a slight delay to avoid result flickering and have highlighted background to prevent user confusion when remote results appear on the screen.
Search UI has a progress bar to show the progress for the remote searches. The tooltip of the progress bar displays the number of complete remote requests and the total number of remote requests. It is possible to click on the progress bar to stop processing remote results if they take a long time.
Remote requests have a timeout of 20 seconds to be incorporated into the search result in the UI, and the progress bar actual progress takes into account this hard timeout as well as the current progress on receiving remote results; the more results we have, the close the progress bar to a finish.