Skip to content

Commit

Permalink
Term Stats documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
afoucret committed Oct 30, 2024
1 parent c6f7827 commit 4a89e6e
Show file tree
Hide file tree
Showing 4 changed files with 110 additions and 22 deletions.
13 changes: 10 additions & 3 deletions docs/reference/query-dsl/script-score-query.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -62,10 +62,17 @@ multiplied by `boost` to produce final documents' scores. Defaults to `1.0`.
===== Use relevance scores in a script

Within a script, you can
{ref}/modules-scripting-fields.html#scripting-score[access]
{ref}/modules-scripting-fields.html#scripting-score[access]
the `_score` variable which represents the current relevance score of a
document.

[[script-score-access-term-statistics]]
===== Use term statistics in a script

Within a script, you can
{ref}/modules-scripting-fields.html#scripting-term-statistics[access]
the `_termStats` variable which provides statistical information about the terms used in the child query of the `script_score` query.

[[script-score-predefined-functions]]
===== Predefined functions
You can use any of the available {painless}/painless-contexts.html[painless
Expand Down Expand Up @@ -147,7 +154,7 @@ updated since update operations also update the value of the `_seq_no` field.

[[decay-functions-numeric-fields]]
====== Decay functions for numeric fields
You can read more about decay functions
You can read more about decay functions
{ref}/query-dsl-function-score-query.html#function-decay[here].

* `double decayNumericLinear(double origin, double scale, double offset, double decay, double docValue)`
Expand Down Expand Up @@ -233,7 +240,7 @@ The `script_score` query calculates the score for
every matching document, or hit. There are faster alternative query types that
can efficiently skip non-competitive hits:

* If you want to boost documents on some static fields, use the
* If you want to boost documents on some static fields, use the
<<query-dsl-rank-feature-query, `rank_feature`>> query.
* If you want to boost documents closer to a date or geographic point, use the
<<query-dsl-distance-feature-query, `distance_feature`>> query.
Expand Down
37 changes: 25 additions & 12 deletions docs/reference/reranking/learning-to-rank-model-training.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -38,11 +38,21 @@ Feature extractors are defined using templated queries. https://eland.readthedoc
from eland.ml.ltr import QueryFeatureExtractor
feature_extractors=[
# We want to use the score of the match query for the title field as a feature:
# We want to use the BM25 score of the match query for the title field as a feature:
QueryFeatureExtractor(
feature_name="title_bm25",
query={"match": {"title": "{{query}}"}}
),
# We want to use the the number of matched terms in the title field as a feature:
QueryFeatureExtractor(
feature_name="title_matched_term_count",
query={
"script_score": {
"query": {"match": {"title": "{{query}}"}},
"script": {"source": "return _termStats.matchedTermsCount();"},
}
},
),
# We can use a script_score query to get the value
# of the field rating directly as a feature:
QueryFeatureExtractor(
Expand All @@ -54,26 +64,29 @@ feature_extractors=[
}
},
),
# We can execute a script on the value of the query
# and use the return value as a feature:
QueryFeatureExtractor(
feature_name="query_length",
# We extract the number of terms in the query as feature.
QueryFeatureExtractor(
feature_name="query_term_count",
query={
"script_score": {
"query": {"match_all": {}},
"script": {
"source": "return params['query'].splitOnToken(' ').length;",
"params": {
"query": "{{query}}",
}
},
"query": {"match": {"title": "{{query}}"}},
"script": {"source": "return _termStats.uniqueTermsCount();"},
}
},
),
]
----
// NOTCONSOLE

[NOTE]
.Tern statistics as features
===================================================
It is very common for an LTR model to leverage raw term statistics as features.
To extract these information, you can use the {ref}/modules-scripting-fields.html#scripting-term-statistics[term statistics feature] provided as part of the <<query-dsl-script-score-query,`script_score`>> query.
===================================================

Once the feature extractors have been defined, they are wrapped in an `eland.ml.ltr.LTRModelConfig` object for use in later training steps:

[source,python]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -61,10 +61,3 @@ When exposing pagination to users, `window_size` should remain constant as each
====== Negative scores

Depending on how your model is trained, it’s possible that the model will return negative scores for documents. While negative scores are not allowed from first-stage retrieval and ranking, it is possible to use them in the LTR rescorer.

[discrete]
[[learning-to-rank-rescorer-limitations-term-statistics]]
====== Term statistics as features

We do not currently support term statistics as features, however future releases will introduce this capability.

75 changes: 75 additions & 0 deletions docs/reference/scripting/fields.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,81 @@ GET my-index-000001/_search
}
-------------------------------------

[discrete]
[[scripting-term-statistics]]
=== Accessing term statistics of a document within a script

Scripts used in a <<query-dsl-script-score-query,`script_score`>> query have access to the `_termStats` variable which provides statistical information about the terms in the child query.

In the following example, `_termStats` is used within a <<query-dsl-script-score-query,`script_score`>> query to retrieve the average term frequency for the terms `quick`, `brown`, and `fox` in the `text` field:

[source,console]
-------------------------------------
PUT my-index-000001/_doc/1?refresh
{
"text": "quick brown fox"
}
PUT my-index-000001/_doc/2?refresh
{
"text": "quick fox"
}
GET my-index-000001/_search
{
"query": { <1>
"function_score": {
"query": {
"match": {
"text": "quick brown fox"
}
},
"script_score": {
"script": {
"source": "_termStats.termFreq().getAverage()" <2>
}
}
}
}
}
-------------------------------------

<1> Child query used to infer the field and the terms considered in term statistics.

<2> The script calculates the average document frequency for the terms in the query using `_termStats`.

`_termStats` provides access to the following functions for working with term statistics:

- `uniqueTermsCount`: Returns the total number of unique terms in the query. This value is the same across all documents.
- `matchedTermsCount`: Returns the count of query terms that matched within the current document.
- `docFreq`: Provides document frequency statistics for the terms in the query, indicating how many documents contain each term. This value is consistent across all documents.
- `totalTermFreq`: Provides the total frequency of terms across all documents, representing how often each term appears in the entire corpus. This value is consistent across all documents.
- `termFreq`: Returns the frequency of query terms within the current document, showing how often each term appears in that document.

[NOTE]
.Functions returning aggregated statistics
===================================================
The `docFreq`, `termFreq` and `totalTermFreq` functions return objects that represent statistics across all terms of the child query.
Statistics provides support for the following methods:
`getAverage()`: Returns the average value of the metric.
`getMin()`: Returns the minimum value of the metric.
`getMax()`: Returns the maximum value of the metric.
`getSum()`: Returns the sum of the metric values.
`getCount()`: Returns the count of terms included in the metric calculation.
===================================================


[NOTE]
.Painless language required
===================================================
The `_termStats` variable is only available when using the <<modules-scripting-painless, Painless>> scripting language.
===================================================

[discrete]
[[modules-scripting-doc-vals]]
Expand Down

0 comments on commit 4a89e6e

Please sign in to comment.