Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the how-to section of the docs for 7.0: #37717

Merged
merged 7 commits into from
Mar 12, 2019
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 0 additions & 7 deletions docs/reference/how-to/indexing-speed.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -114,13 +114,6 @@ The default is `10%` which is often plenty: for example, if you give the JVM
10GB of memory, it will give 1GB to the index buffer, which is enough to host
two shards that are heavily indexing.

[float]
=== Disable `_field_names`

The <<mapping-field-names-field,`_field_names` field>> introduces some
index-time overhead, so you might want to disable it if you never need to
run `exists` queries.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, this is removed because indexing this field doesn't have the index overhead it once did?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. @colings86 made things better in #26930 and we now use doc values or norms to run queries most-of-time, and only index into the _field_names field when doc values are not enabled (either because the user disabled them or because the field doesn't support doc values).


[float]
=== Additional optimizations

Expand Down
6 changes: 3 additions & 3 deletions docs/reference/how-to/recipes.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@

This section includes a few recipes to help with common problems:

* <<mixing-exact-search-with-stemming>>
* <<consistent-scoring>>
* <<mixing-exact-search-with-stemming,Mixing exact search with stemming>>
* <<consistent-scoring,Getting consistent scores>>
* <<static-scoring-signals,Incorporating static relevance signals into the score>>

include::recipes/stemming.asciidoc[]
include::recipes/scoring.asciidoc[]

125 changes: 123 additions & 2 deletions docs/reference/how-to/recipes/scoring.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -60,8 +60,8 @@ request do not have similar index statistics and relevancy could be bad.

If you have a small dataset, the easiest way to work around this issue is to
index everything into an index that has a single shard
(`index.number_of_shards: 1`). Then index statistics will be the same for all
documents and scores will be consistent.
(`index.number_of_shards: 1`), which is the default. Then index statistics
will be the same for all documents and scores will be consistent.

Otherwise the recommended way to work around this issue is to use the
<<dfs-query-then-fetch,`dfs_query_then_fetch`>> search type. This will make
Expand All @@ -78,3 +78,124 @@ queries, beware that gathering statistics alone might not be cheap since all
terms have to be looked up in the terms dictionaries in order to look up
statistics.

[[static-scoring-signals]]
=== Incorporating static relevance signals into the score

Many domains have static signals that are known to be correlated with relevance.
For instance https://en.wikipedia.org/wiki/PageRank[PageRank] and url length are
two commonly used features for web search in order to tune the score of web
pages independently of the query.

There are two main queries that allow combining static score contributions with
textual relevance, eg. as computed with BM25:
- <<query-dsl-script-score-query,script_score query>>
- <<query-dsl-feature-query,feature query>>

For instance imagine that you have a `pagerank` field that you wish to
combine with the BM25 score so that the final score is equal to
`score = bm25_score + pagerank / (10 + pagerank)`.

With the <<query-dsl-script-score-query,script_score query>> the query would
look like this:

//////////////////////////

[source,js]
--------------------------------------------------
PUT index
{
"mappings": {
"properties": {
"body": {
"type": "text"
},
"pagerank": {
"type": "long"
}
}
}
}
--------------------------------------------------
// CONSOLE
// TEST

//////////////////////////

[source,js]
--------------------------------------------------
GET index/_search
{
"query" : {
"script_score" : {
"query" : {
"match": { "body": "elasticsearch" }
},
"script" : {
"source" : "_score * rational(doc['pagerank'].value, 10)" <1>
}
}
}
}
--------------------------------------------------
// CONSOLE
//TEST[continued]
<1> `pagerank` must be mapped as a <<number>>

while with the <<query-dsl-feature-query,feature query>> it would look like
below:

//////////////////////////

[source,js]
--------------------------------------------------
PUT index
{
"mappings": {
"properties": {
"body": {
"type": "text"
},
"pagerank": {
"type": "feature"
}
}
}
}
--------------------------------------------------
// CONSOLE
// TEST

//////////////////////////

[source,js]
--------------------------------------------------
GET _search
{
"query" : {
"bool" : {
"must": {
"match": { "body": "elasticsearch" }
},
"should": {
"feature": {
"field": "pagerank", <1>
"saturation": {
"pivot": 10
}
}
}
}
}
}
--------------------------------------------------
// CONSOLE
<1> `pagerank` must be mapped as a <<feature,`feature`>> field

While both options would return similar scores, there are trade-offs:
<<query-dsl-script-score-query,script_score>> provides a lot of flexibility,
enabling you to combine the text relevance score with static signals as you
prefer. On the other hand, the <<feature,`feature` query>> only exposes a couple
ways to incorporate static signails into the score. However, it relies on the
<<feature,`feature`>> and <<feature-vector,`feature_vector`>> fields, which
index values in a special way that allows the <<feature,`feature` query>> to
skip over non-competitive documents and get the top matches of a query faster.
23 changes: 14 additions & 9 deletions docs/reference/how-to/search-speed.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -395,15 +395,6 @@ be able to cope with `max_failures` node failures at once at most, then the
right number of replicas for you is
`max(max_failures, ceil(num_nodes / num_primaries) - 1)`.

[float]
=== Turn on adaptive replica selection

When multiple copies of data are present, elasticsearch can use a set of
criteria called <<search-adaptive-replica,adaptive replica selection>> to select
the best copy of the data based on response time, service time, and queue size
of the node containing each copy of the shard. This can improve query throughput
and reduce latency for search-heavy applications.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also just to confirm, this is removed because adaptive replica selection is now the default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.


=== Tune your queries with the Profile API

You can also analyse how expensive each component of your queries and
Expand All @@ -419,3 +410,17 @@ Some caveats to the Profile API are that:
- the Profile API as a debugging tool adds significant overhead to search execution and can also have a very verbose output
- given the added overhead, the resulting took times are not reliable indicators of actual took time, but can be used comparatively between clauses for relative timing differences
- the Profile API is best for exploring possible reasons behind the most costly clauses of a query but isn't intended for accurately measuring absolute timings of each clause

=== Faster phrase queries with `index_phrases`

The <<text,`text`>> field has an <<index-phrases,`index_phrases`>> option that
pre-indexes 2-shingles at index-time and is automatically leveraged by query
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just a matter of taste I guess, but I find "pre-indexes ... at index-time" is a bit redundant (especially the pre). Maybe "indexes 2-shingles" or "already adds 2-shingles at index-time" is enough?

parsers to run phrase queries that don't have a slop. If your use-case involves
running lots of phrase queries, this can speed up queries significantly.

=== Faster prefix queries with `index_prefixes`

The <<text,`text`>> field has an <<index-phrases,`index_prefixes`>> option that
pre-indexes prefixes of all terms at index-time and is automatically leveraged
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same here

by query parsers to run prefix queries. If your use-case involves running lots
of prefix queries, this can speed up queries significantly.