Skip to content

Commit

Permalink
Add a comment to the calculate_rank function and a description to t…
Browse files Browse the repository at this point in the history
…he `find_term` function
  • Loading branch information
kozlovsky committed Oct 13, 2022
1 parent 6ae61c0 commit f30beae
Showing 1 changed file with 68 additions and 0 deletions.
68 changes: 68 additions & 0 deletions src/tribler/core/utilities/search_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,10 @@ def calculate_rank(query: List[str], title: List[str]) -> float:
# The first word is more important than the second word, and so on
term_weight = POSITION_COEFF / (POSITION_COEFF + i)

# Read the description of the `find_term` function to understand what is going on. Basically, we are trying
# to find each query word in the title words, calculate the penalty if the query word is not found or if there
# are some title words before it, and then rotate the skipped title words to the end of the title. This way,
# the least penalty got a title that has query words in the proper order at the beginning of the title.
found, skipped = find_term(term, title)
if found:
# if the query word is found in the title, add penalty for skipped words in title before it
Expand All @@ -146,6 +150,70 @@ def find_term(term: str, title: Deque[str]) -> Tuple[bool, int]:
"""
Finds the query word in the title.
Returns whether it was found or not and the number of skipped words in the title.
This is a helper function to efficiently answer a question of how close a query string and a title string are,
taking into account the ordering of words in both strings.
The `term` parameter is a word from a search string. It is called `term` and not `word` because it can also be
a stemmed version of the word if the comparison algorithm implemented in the top-level `torrent_rank` function
works with stemmed words. The ability to work with stemmed words was added to `torrent_rank` and then removed,
as it currently does not give significant benefits, but it can be added again in the future.
The `title` parameter is a deque of words from the torrent title. It also can be a deque of stemmed words
if the `torrent_rank` function supports stemming.
The `find_term` function returns the boolean value of whether the term was found in the title deque or not and
the number of the skipped leading terms in the `title` deque. Also, it modifies the `title` deque in place by
removing the first entrance of the found term and rotating all leading non-matching terms to the end of the deque.
An example: find_term('A', deque(['X', 'Y', 'A', 'B', 'C'])) returns `(True, 2)`, where True means that
the term 'A' was found in the `title` deque, and 2 is the number of skipped terms ('X', 'Y'). Also, it modifies
the `title` deque, so it starts looking like deque(['B', 'C', 'X', 'Y']). The found term 'A' was removed, and
the leading non-matching terms ('X', 'Y') was moved to the end of the deque.
Now some examples of how the function can be used. To use the function, you can call it one time for each word
from the query and see:
- how many query words are missed in the title;
- how many excess or out-of-place title words are found before each query word;
- and how many title words are not mentioned in the query.
Example 1, query "A B C", title "A B C":
find_term("A", deque(["A", "B", "C"])) -> (found=True, skipped=0, rest=deque(["B", "C"]))
find_term("B", deque(["B", "C"])) -> (found=True, skipped=0, rest=deque(["C"]))
find_term("C", deque(["C"])) -> (found=True, skipped=0, rest=deque([]))
Conclusion: exact match.
Example 2, query "A B C", title "A B C D":
find_term("A", deque(["A", "B", "C", "D"])) -> (found=True, skipped=0, rest=deque(["B", "C", "D"]))
find_term("B", deque(["B", "C", "D"])) -> (found=True, skipped=0, rest=deque(["C", "D"]))
find_term("C", deque(["C", "D"])) -> (found=True, skipped=0, rest=deque(["D"]))
Conclusion: minor penalty for one excess word in the title that is not in the query.
Example 3, query "A B C", title "X Y A B C":
find_term("A", deque(["X", "Y", "A", "B", "C"])) -> (found=True, skipped=2, rest=deque(["B", "C", "X", "Y"]))
find_term("B", deque(["B", "C", "X", "Y"])) -> (found=True, skipped=0, rest=deque(["C", "X", "Y"]))
find_term("C", deque(["C", "X", "Y"])) -> (found=True, skipped=0, rest=deque(["X", "Y"]))
Conclusion: major penalty for skipping two words at the beginning of the title plus a minor penalty for two
excess words in the title that are not in the query.
Example 4, query "A B C", title "A B X Y C":
find_term("A", deque(["A", "B", "X", "Y", "C"])) -> (found=True, skipped=0, rest=deque(["B", "X", "Y", "C"]))
find_term("B", deque(["B", "X", "Y", "C"])) -> (found=True, skipped=0, rest=deque(["X", "Y", "C"]))
find_term("C", deque(["X", "Y", "C"])) -> (found=True, skipped=2, rest=deque(["X", "Y"]))
Conclusion: average penalty for skipping two words in the middle of the title plus a minor penalty for two
excess words in the title that are not in the query.
Example 5, query "A B C", title "A C B":
find_term("A", deque(["A", "C", "B"])) -> (found=True, skipped=0, rest=deque(["C", "B"]))
find_term("B", deque(["C", "B"])) -> (found=True, skipped=1, rest=deque(["C"]))
find_term("C", deque(["C"])) -> (found=True, skipped=0, rest=deque(["C"]))
Conclusion: average penalty for skipping one word in the middle of the title.
Example 6, query "A B C", title "A C X":
find_term("A", deque(["A", "C", "X"])) -> (found=True, skipped=0, rest=deque(["C", "X"]))
find_term("B", deque(["C", "X"])) -> (found=False, skipped=0, rest=deque(["C", "X"]))
find_term("C", deque(["C", "X"])) -> (found=True, skipped=0, rest=deque(["X"]))
Conclusion: huge penalty for missing one query word plus a minor penalty for one excess title word.
"""
try:
skipped = title.index(term)
Expand Down

0 comments on commit f30beae

Please sign in to comment.