Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-word exact matches at the start of a title aren't always first in the list of search results #466

Closed
harrisonpim opened this issue May 24, 2022 · 4 comments · Fixed by #475
Assignees
Labels
search relevance Tuning and improving ranking and relevance

Comments

@harrisonpim
Copy link
Contributor

harrisonpim commented May 24, 2022

Hi gang, search ranking question - I was trying to find the book "Information Law", by searching on those words: https://wellcomecollection.org/works?query=information+law. As you can see the book doesn't show up until about half way down the second page of hits, even though the words are an exact match to the first two words of the title. Is that to be expected (because the book's works page isn't very popular or something?). If I add the author name to the search, then it shows up as the only hit (the book is https://wellcomecollection.org/works/zkg7xqm7).

@harrisonpim
Copy link
Contributor Author

harrisonpim commented May 24, 2022

the current query seems to be giving high scores to matches in the search.relations, which has no positional awareness. we don't currently prioritise proximity to the start of a title (or any other field)

@harrisonpim
Copy link
Contributor Author

harrisonpim commented May 24, 2022

The 'correct' result for this query can be retrieved in first position by adding a span_first query:

{
  "query": {
    "bool": {
      "should": [
        {
          "span_first": {
            "match": {
              "span_term": {
                "data.title.shingles": "information law"
              }
            },
            "end":1,
            "boost":1000
          }
        },
        {
          "multi_match": {
...

@harrisonpim
Copy link
Contributor Author

harrisonpim commented May 24, 2022

I'd like to investigate the effect of

  • turning down the boost on the relations field vs
  • adding the span_first to title.shingles
    using the standard set of rank tools before making any changes to the prod query

that requires:

  • CCRing the prod works index into the rank cluster
  • adding a new precision test to the rank suite using this example
  • running the rank test suite from CLI on both candidates
  • running the rank speed comparison on both candidates

@harrisonpim harrisonpim self-assigned this May 24, 2022
@harrisonpim harrisonpim added the search relevance Tuning and improving ranking and relevance label May 24, 2022
@pollecuttn
Copy link

Issue first reported on Slack at https://wellcome.slack.com/archives/C8X9YKM5X/p1653293886888919

Repository owner moved this from In Progress to Done in Digital platform May 27, 2022
@pollecuttn pollecuttn moved this from Done to Archive in Digital platform Jun 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
search relevance Tuning and improving ranking and relevance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants