Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research: Elasticsearch relevancy scores #5842

Closed
2 of 4 tasks
patphongs opened this issue May 22, 2024 · 4 comments
Closed
2 of 4 tasks

Research: Elasticsearch relevancy scores #5842

patphongs opened this issue May 22, 2024 · 4 comments
Assignees
Milestone

Comments

@patphongs
Copy link
Member

patphongs commented May 22, 2024

What we’re after

Research relevancy scores and how they work. Some research has been documented by Mark at GSA that could better inform this effort.

Related ticket(s)

  • Future related tickets when we discuss new tickets for relevancy sort later today.

Action item(s)

  • Look at cloud.gov research
  • Is there any overlap with proximity searches? That's another requested feature from legal users. ex. Within a paragraph, within a sentence, within 5 words, etc. (Elasticsearch: interval search queries)

Completion criteria

  • Research has been documented in this ticket and followup tickets have been created
@JonellaCulmer JonellaCulmer changed the title Elasticsearch relevancy scores Research: Elasticsearch relevancy scores Jul 23, 2024
@JonellaCulmer JonellaCulmer moved this to 📥 Assigned in Website project Jul 23, 2024
@JonellaCulmer JonellaCulmer added this to the 25.6 milestone Jul 23, 2024
@tmpayton
Copy link
Contributor

tmpayton commented Jul 31, 2024

From my research I was able to determine that we can implement a relevancy sort and that it would be relatively low effort with no re-index needed.
 
I discovered that the _score variable was null due to us implementing a custom sort. My test branch contains the changes that would need to be made.
 
Regarding equally weighted scores, Elasticsearch automatically sorts by the index order (most recent first). However, we can specify our own secondary sort as well.
 
The relevance score is not related to proximity. To customize the proximity we would have to implement an intervals query.
 

How Relevance Scores Work 

Before scoring, Elasticsearch limits the set of candidate documents by applying a boolean test so that it only includes documents that match the query. After that, a score is calculated for each document in this set using the BM25 algorithm.

The BM25 algorithm considers:
 

  1. Term frequency
    • The more times that a search term appears in the field we are searching in a document, the more relevant that document is
  2. Inverse document frequency
    • The more documents that contain a search term in the field we are searching, the less important that term is. (Is a term common or rare)
  3. Document Length Normalization/Field length
    • If a document contains a search term in a field that is very short , it is more likely relevant than a document that contains a search term in a field that is very long. It prevents longer documents from being unfairly ranked higher.
  4. Saturation
    • BM25 includes a saturation parameter that limits the impact of term frequency. After a certain point, additional occurrences of a term have declining results on the relevance score.
       

There is no maximum for a relevance score, however we can set a min_score to help filter out less desirable results.
 

How Nested Relevance Scores Work 

 Elasticsearch calculates a relevancy score for each nested document based on the query it matches. This scoring is similar to how it would score a regular document.  After computing the relevancy scores for the nested documents, Elasticsearch aggregates these scores to produce a final score for the parent document.
 
The score_mode parameter determines how to combine the scores of nested documents. The options include:
 

  • avg(default)
    • The final score is the average score of all matching nested documents.
  • Sum
    • The final score is the sum of the scores of all matching nested documents.
  • Max
    • The final score is the maximum score among the matching nested documents.
  • Min
    • The final score is the lowest relevance score of all matching child objects.
  • None:
    • Do not use the relevance scores of matching child objects. The query assigns parent documents a score of 0.

Here is an example of using max, avg, and sum score_modes.
 

Examples

 
My branch includes some sample output from queries with multiple words and one with same scores.
 
I tested multiple words with and without quotes and I saw that the score changes based on matching exact terms vs all of them.
 
I also tested what happens when scores are the same, and observed that it correctly sorts them by index order.
 
For reference, my branch is based off of Corinne's branch that switches to simple_query_string from query_string, switches case_respondents to simple_query_string, and adds a second filter q_exclude.
 

Also, when testing locally, we can use the Explain API to determine exactly how a document got its score.
 
Here is an example of the curl request to the explain API:
 

curl -X GET "localhost:9200/ao_index/_explain/2023-01?pretty" -H 'Content-Type: application/json' -d'
{
   "query": {
      "bool": {
         "must": [
            {
               "term": {
                  "type": "advisory_opinions"
               }
            },
            {
               "bool": {
                  "minimum_should_match": 1
               }
            }
         ],
         "should": [
            {
               "nested": {
                  "path": "documents",
                  "inner_hits": {
                     "_source": false,
                     "highlight": {
                        "require_field_match": false,
                        "fields": {
                           "documents.text": {},
                           "documents.description": {}
                        }
                     },
                     "size": 100
                  },
                  "query": {
                     "bool": {
                        "must": [
                           {
                              "simple_query_string": {
                                 "query": "(\"Reasons to believe\")",
                                 "fields": [
                                    "documents.text"
                                 ]
                              }
                           }
                        ]
                     }
                  }
               }
            },
            {
               "simple_query_string": {
                  "query": "(\"Reasons to believe\")",
                  "fields": [
                     "no",
                     "name",
                     "summary"
                  ]
               }
            }
         ],
         "minimum_should_match": 1
      }
   }
}
'

@tmpayton
Copy link
Contributor

Here is an example of the scoring algorithm for a mur

@tmpayton
Copy link
Contributor

tmpayton commented Aug 2, 2024

I created this graphic that better explains how the relevancy scores are calculated

image

@tmpayton
Copy link
Contributor

tmpayton commented Aug 2, 2024

If we wanted to sort by the document_hit_score we would have to use a workaround as this is not directly supported by elasticsearch. We could use scripted sorting for instance:

{
  "query": {
    "nested": {
      "path": "nested_field",
      "query": {
        "bool": {
          "must": [
            { "match": { "nested_field.some_field": "value" } }
          ]
        }
      }
    }
  },
  "sort": [
    {
      "_script": {
        "type": "number",
        "script": {
          "source": "doc['nested_field.some_numeric_field'].value",
          "lang": "painless"
        },
        "order": "desc"
      }
    }
  ]
} 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: ✅ Done
Development

No branches or pull requests

3 participants