From 74309933e347012855f3e8cc770116ff9b3586b7 Mon Sep 17 00:00:00 2001 From: Wylie Conlon Date: Thu, 29 Oct 2020 14:53:46 -0400 Subject: [PATCH 1/6] Clarify field data cache behavior --- docs/plugins/mapper-annotated-text.asciidoc | 2 +- .../diversified-sampler-aggregation.asciidoc | 2 +- .../significantterms-aggregation.asciidoc | 2 +- docs/reference/cat/fielddata.asciidoc | 4 +- docs/reference/cluster/stats.asciidoc | 2 +- docs/reference/how-to/search-speed.asciidoc | 15 +- .../mapping/fields/id-field.asciidoc | 14 +- docs/reference/mapping/params.asciidoc | 2 - .../params/eager-global-ordinals.asciidoc | 11 +- .../mapping/params/fielddata.asciidoc | 134 ------------------ .../mapping/types/parent-join.asciidoc | 5 +- docs/reference/mapping/types/text.asciidoc | 109 ++++++++++++++ .../modules/indices/circuit_breaker.asciidoc | 6 +- .../modules/indices/fielddata.asciidoc | 46 ++++-- 14 files changed, 178 insertions(+), 176 deletions(-) delete mode 100644 docs/reference/mapping/params/fielddata.asciidoc diff --git a/docs/plugins/mapper-annotated-text.asciidoc b/docs/plugins/mapper-annotated-text.asciidoc index 4a30da47d62c2..a1dd0bd3dd3c9 100644 --- a/docs/plugins/mapper-annotated-text.asciidoc +++ b/docs/plugins/mapper-annotated-text.asciidoc @@ -18,7 +18,7 @@ include::install_remove.asciidoc[] [[mapper-annotated-text-usage]] ==== Using the `annotated-text` field -The `annotated-text` tokenizes text content as per the more common `text` field (see +The `annotated-text` tokenizes text content as per the more common <> (see "limitations" below) but also injects any marked-up annotation tokens directly into the search index: diff --git a/docs/reference/aggregations/bucket/diversified-sampler-aggregation.asciidoc b/docs/reference/aggregations/bucket/diversified-sampler-aggregation.asciidoc index f49a12ce0eeba..87ee6b62f4b92 100644 --- a/docs/reference/aggregations/bucket/diversified-sampler-aggregation.asciidoc +++ b/docs/reference/aggregations/bucket/diversified-sampler-aggregation.asciidoc @@ -178,7 +178,7 @@ Each option will hold up to `shard_size` values in memory while performing de-du - hold ordinals of the field as determined by the Lucene index (`global_ordinals`) - hold hashes of the field values - with potential for hash collisions (`bytes_hash`) -The default setting is to use `global_ordinals` if this information is available from the Lucene index and reverting to `map` if not. +The default setting is to use <> if this information is available from the Lucene index and reverting to `map` if not. The `bytes_hash` setting may prove faster in some cases but introduces the possibility of false positives in de-duplication logic due to the possibility of hash collisions. Please note that Elasticsearch will ignore the choice of execution hint if it is not applicable and that there is no backward compatibility guarantee on these hints. diff --git a/docs/reference/aggregations/bucket/significantterms-aggregation.asciidoc b/docs/reference/aggregations/bucket/significantterms-aggregation.asciidoc index 92ded6243ccca..a9dcfe1532890 100644 --- a/docs/reference/aggregations/bucket/significantterms-aggregation.asciidoc +++ b/docs/reference/aggregations/bucket/significantterms-aggregation.asciidoc @@ -550,7 +550,7 @@ A description of the different collection modes can be found in the There are different mechanisms by which terms aggregations can be executed: - by using field values directly in order to aggregate data per-bucket (`map`) - - by using global ordinals of the field and allocating one bucket per global ordinal (`global_ordinals`) + - by using <> of the field and allocating one bucket per global ordinal (`global_ordinals`) Elasticsearch tries to have sensible defaults so this is something that generally doesn't need to be configured. diff --git a/docs/reference/cat/fielddata.asciidoc b/docs/reference/cat/fielddata.asciidoc index 60acf19095c90..20c6207cf399c 100644 --- a/docs/reference/cat/fielddata.asciidoc +++ b/docs/reference/cat/fielddata.asciidoc @@ -4,8 +4,8 @@ cat fielddata ++++ -Returns the amount of heap memory currently used by fielddata on every data node -in the cluster. +Returns the amount of heap memory currently used by the +<> on every data node in the cluster. [[cat-fielddata-api-request]] diff --git a/docs/reference/cluster/stats.asciidoc b/docs/reference/cluster/stats.asciidoc index b5affc837a7af..3f5790e1ba885 100644 --- a/docs/reference/cluster/stats.asciidoc +++ b/docs/reference/cluster/stats.asciidoc @@ -246,7 +246,7 @@ activities. `fielddata`:: (object) -Contains statistics about the field data cache of selected nodes. +Contains statistics about the <> of selected nodes. + .Properties of `fielddata` [%collapsible%open] diff --git a/docs/reference/how-to/search-speed.asciidoc b/docs/reference/how-to/search-speed.asciidoc index 79df665127edc..565a427531df5 100644 --- a/docs/reference/how-to/search-speed.asciidoc +++ b/docs/reference/how-to/search-speed.asciidoc @@ -303,13 +303,14 @@ may become much worse. [discrete] === Warm up global ordinals -Global ordinals are a data-structure that is used in order to run -<> aggregations on -<> fields. They are loaded lazily in memory because -Elasticsearch does not know which fields will be used in `terms` aggregations -and which fields won't. You can tell Elasticsearch to load global ordinals -eagerly when starting or refreshing a shard by configuring mappings as -described below: +<> are a data structure that is used in +order to increase aggregation speed. They are calculated lazily and stored in +the JVM heap as part of the <>. For fields +that are heavily used for bucketing aggregations, you can tell {es} to add to +the cache before requests are received. This should be done carefully because it +will increase heap usage and delay indexing until the cache is created. This can +be set dynamically on an existing mapping by setting the +<> mappping parameter: [source,console] -------------------------------------------------- diff --git a/docs/reference/mapping/fields/id-field.asciidoc b/docs/reference/mapping/fields/id-field.asciidoc index 33f1e8eb7178c..8bf53798cd76a 100644 --- a/docs/reference/mapping/fields/id-field.asciidoc +++ b/docs/reference/mapping/fields/id-field.asciidoc @@ -33,12 +33,14 @@ GET my-index-000001/_search <1> Querying on the `_id` field (also see the <>) -The value of the `_id` field is also accessible in aggregations or for sorting, -but doing so is discouraged as it requires to load a lot of data in memory. In -case sorting or aggregating on the `_id` field is required, it is advised to -duplicate the content of the `_id` field in another field that has `doc_values` -enabled. - +The `_id` field is by default not available by default for use with aggregations or sorting. +To aggregate or sort by the `_id` field, it is recommended to +duplicate the `_id` field onto a `keyword` field using the <>. + +It is not recommended to enable `_id` fields to be aggregated using the <>, +but it is possible. This can be done by <> +to `"indices.id_field_data.enabled": true`. Enabling this setting and then aggregating on the `_id` +field will use significant memory and show deprecation warnings in the logs. [NOTE] ================================================== diff --git a/docs/reference/mapping/params.asciidoc b/docs/reference/mapping/params.asciidoc index a3ddbec095342..f0f9e9a41a7a6 100644 --- a/docs/reference/mapping/params.asciidoc +++ b/docs/reference/mapping/params.asciidoc @@ -49,8 +49,6 @@ include::params/eager-global-ordinals.asciidoc[] include::params/enabled.asciidoc[] -include::params/fielddata.asciidoc[] - include::params/format.asciidoc[] include::params/ignore-above.asciidoc[] diff --git a/docs/reference/mapping/params/eager-global-ordinals.asciidoc b/docs/reference/mapping/params/eager-global-ordinals.asciidoc index 4b1ae5f626f71..9f771a3d66745 100644 --- a/docs/reference/mapping/params/eager-global-ordinals.asciidoc +++ b/docs/reference/mapping/params/eager-global-ordinals.asciidoc @@ -34,11 +34,12 @@ to be enabled. * Operations on parent and child documents from a `join` field, including `has_child` queries and `parent` aggregations. -NOTE: The global ordinal mapping is an on-heap data structure. When measuring -memory usage, Elasticsearch counts the memory from global ordinals as -'fielddata'. Global ordinals memory is included in the -<>, and is returned -under `fielddata` in the <> response. +NOTE: The global ordinal mapping use heap memory as part of the +<>. Aggregations that include high +cardinality values can use a significant amount of heap memory, and +could exceed the threshold of the +<>. +It is recommended to set a specific limit for the field data cache size. ==== Loading global ordinals diff --git a/docs/reference/mapping/params/fielddata.asciidoc b/docs/reference/mapping/params/fielddata.asciidoc deleted file mode 100644 index 1faa82a53f310..0000000000000 --- a/docs/reference/mapping/params/fielddata.asciidoc +++ /dev/null @@ -1,134 +0,0 @@ -[[fielddata]] -=== `fielddata` - -Most fields are <> by default, which makes them -searchable. Sorting, aggregations, and accessing field values in scripts, -however, requires a different access pattern from search. - -Search needs to answer the question _"Which documents contain this term?"_, -while sorting and aggregations need to answer a different question: _"What is -the value of this field for **this** document?"_. - -Most fields can use index-time, on-disk <> for this -data access pattern, but <> fields do not support `doc_values`. - -Instead, `text` fields use a query-time *in-memory* data structure called -`fielddata`. This data structure is built on demand the first time that a -field is used for aggregations, sorting, or in a script. It is built by -reading the entire inverted index for each segment from disk, inverting the -term ↔︎ document relationship, and storing the result in memory, in the JVM -heap. - -[[fielddata-disabled-text-fields]] -==== Fielddata is disabled on `text` fields by default - -Fielddata can consume a *lot* of heap space, especially when loading high -cardinality `text` fields. Once fielddata has been loaded into the heap, it -remains there for the lifetime of the segment. Also, loading fielddata is an -expensive process which can cause users to experience latency hits. This is -why fielddata is disabled by default. - -If you try to sort, aggregate, or access values from a script on a `text` -field, you will see this exception: - -[literal] -Fielddata is disabled on text fields by default. Set `fielddata=true` on -[`your_field_name`] in order to load fielddata in memory by uninverting the -inverted index. Note that this can however use significant memory. - -[[before-enabling-fielddata]] -==== Before enabling fielddata - -Before you enable fielddata, consider why you are using a `text` field for -aggregations, sorting, or in a script. It usually doesn't make sense to do -so. - -A text field is analyzed before indexing so that a value like -`New York` can be found by searching for `new` or for `york`. A `terms` -aggregation on this field will return a `new` bucket and a `york` bucket, when -you probably want a single bucket called `New York`. - -Instead, you should have a `text` field for full text searches, and an -unanalyzed <> field with <> -enabled for aggregations, as follows: - -[source,console] ---------------------------------- -PUT my-index-000001 -{ - "mappings": { - "properties": { - "my_field": { <1> - "type": "text", - "fields": { - "keyword": { <2> - "type": "keyword" - } - } - } - } - } -} ---------------------------------- - -<1> Use the `my_field` field for searches. -<2> Use the `my_field.keyword` field for aggregations, sorting, or in scripts. - -[[enable-fielddata-text-fields]] -==== Enabling fielddata on `text` fields - -You can enable fielddata on an existing `text` field using the -<> as follows: - -[source,console] ------------------------------------ -PUT my-index-000001/_mapping -{ - "properties": { - "my_field": { <1> - "type": "text", - "fielddata": true - } - } -} ------------------------------------ -// TEST[continued] - -<1> The mapping that you specify for `my_field` should consist of the existing - mapping for that field, plus the `fielddata` parameter. - -[[field-data-filtering]] -==== `fielddata_frequency_filter` - -Fielddata filtering can be used to reduce the number of terms loaded into -memory, and thus reduce memory usage. Terms can be filtered by _frequency_: - -The frequency filter allows you to only load terms whose document frequency falls -between a `min` and `max` value, which can be expressed an absolute -number (when the number is bigger than 1.0) or as a percentage -(eg `0.01` is `1%` and `1.0` is `100%`). Frequency is calculated -*per segment*. Percentages are based on the number of docs which have a -value for the field, as opposed to all docs in the segment. - -Small segments can be excluded completely by specifying the minimum -number of docs that the segment should contain with `min_segment_size`: - -[source,console] --------------------------------------------------- -PUT my-index-000001 -{ - "mappings": { - "properties": { - "tag": { - "type": "text", - "fielddata": true, - "fielddata_frequency_filter": { - "min": 0.001, - "max": 0.1, - "min_segment_size": 500 - } - } - } - } -} --------------------------------------------------- diff --git a/docs/reference/mapping/types/parent-join.asciidoc b/docs/reference/mapping/types/parent-join.asciidoc index a33ab33baadf3..6826f155cc4f7 100644 --- a/docs/reference/mapping/types/parent-join.asciidoc +++ b/docs/reference/mapping/types/parent-join.asciidoc @@ -120,11 +120,12 @@ PUT my-index-000001/_doc/4?routing=1&refresh <2> `answer` is the name of the join for this document <3> The parent id of this child document -==== Parent-join and performance. +==== Parent-join and performance The join field shouldn't be used like joins in a relation database. In Elasticsearch the key to good performance is to de-normalize your data into documents. Each join field, `has_child` or `has_parent` query adds a -significant tax to your query performance. +significant tax to your query performance. It also increases the usage of the JVM heap on the +<>. The only case where the join field makes sense is if your data contains a one-to-many relationship where one entity significantly outnumbers the other entity. An example of such case is a use case with products diff --git a/docs/reference/mapping/types/text.asciidoc b/docs/reference/mapping/types/text.asciidoc index 9ef0399fd16c3..1d816867a637f 100644 --- a/docs/reference/mapping/types/text.asciidoc +++ b/docs/reference/mapping/types/text.asciidoc @@ -141,3 +141,112 @@ The following parameters are accepted by `text` fields: <>:: Metadata about the field. + +[[fielddata]] +==== `fielddata` + +`text` fields are searchable by default, but by default are not available for +aggregations, sorting, or scripting. If you try to sort, aggregate, or access +values from a script on a `text` field, you will see this exception: + +[literal] +Fielddata is disabled on text fields by default. Set `fielddata=true` on +[`your_field_name`] in order to load fielddata in memory by uninverting the +inverted index. Note that this can however use significant memory. + +Field data is the only way to access the analyzed tokens from a full text field +in aggregations, sorting, or scripting. For example, a full text field like `New York` +would get analyzed as `new` and `york`. To aggregate on these tokens requires field data. + +[[before-enabling-fielddata]] +==== Before enabling fielddata + +It usually doesn't make sense to enable fielddata on text fields. Field data +is stored in the heap with the <> because it +is expensive to calculate. Calculating the field data can cause latency spikes, and +increasing heap usage is a cause of cluster performance issues. + +Most users who want to do more with text fields use <> +by having both a `text` field for full text searches, and an +unanalyzed <> field for aggregations, as follows: + +[source,console] +--------------------------------- +PUT my-index-000001 +{ + "mappings": { + "properties": { + "my_field": { <1> + "type": "text", + "fields": { + "keyword": { <2> + "type": "keyword" + } + } + } + } + } +} +--------------------------------- + +<1> Use the `my_field` field for searches. +<2> Use the `my_field.keyword` field for aggregations, sorting, or in scripts. + +[[enable-fielddata-text-fields]] +==== Enabling fielddata on `text` fields + +You can enable fielddata on an existing `text` field using the +<> as follows: + +[source,console] +----------------------------------- +PUT my-index-000001/_mapping +{ + "properties": { + "my_field": { <1> + "type": "text", + "fielddata": true + } + } +} +----------------------------------- +// TEST[continued] + +<1> The mapping that you specify for `my_field` should consist of the existing + mapping for that field, plus the `fielddata` parameter. + +[[field-data-filtering]] +==== `fielddata_frequency_filter` + +Fielddata filtering can be used to reduce the number of terms loaded into +memory, and thus reduce memory usage. Terms can be filtered by _frequency_: + +The frequency filter allows you to only load terms whose document frequency falls +between a `min` and `max` value, which can be expressed an absolute +number (when the number is bigger than 1.0) or as a percentage +(eg `0.01` is `1%` and `1.0` is `100%`). Frequency is calculated +*per segment*. Percentages are based on the number of docs which have a +value for the field, as opposed to all docs in the segment. + +Small segments can be excluded completely by specifying the minimum +number of docs that the segment should contain with `min_segment_size`: + +[source,console] +-------------------------------------------------- +PUT my-index-000001 +{ + "mappings": { + "properties": { + "tag": { + "type": "text", + "fielddata": true, + "fielddata_frequency_filter": { + "min": 0.001, + "max": 0.1, + "min_segment_size": 500 + } + } + } + } +} +-------------------------------------------------- diff --git a/docs/reference/modules/indices/circuit_breaker.asciidoc b/docs/reference/modules/indices/circuit_breaker.asciidoc index d06b3f27c11c5..2f85996c0d433 100644 --- a/docs/reference/modules/indices/circuit_breaker.asciidoc +++ b/docs/reference/modules/indices/circuit_breaker.asciidoc @@ -33,9 +33,9 @@ The parent-level breaker can be configured with the following settings: [discrete] ==== Field data circuit breaker The field data circuit breaker allows Elasticsearch to estimate the amount of -memory a field will require to be loaded into memory. It can then prevent the -field data loading by raising an exception. By default the limit is configured -to 40% of the maximum JVM heap. It can be configured with the following +memory a field will require to be loaded into the <>. +It can then prevent the field data loading by raising an exception. By default the +limit is configured to 40% of the maximum JVM heap. It can be configured with the following parameters: [[fielddata-circuit-breaker-limit]] diff --git a/docs/reference/modules/indices/fielddata.asciidoc b/docs/reference/modules/indices/fielddata.asciidoc index 5a2bbac9f379d..d3fae03e3aab5 100644 --- a/docs/reference/modules/indices/fielddata.asciidoc +++ b/docs/reference/modules/indices/fielddata.asciidoc @@ -1,16 +1,41 @@ [[modules-fielddata]] === Field data cache settings -The field data cache is used mainly when sorting on or computing aggregations -on a field. It loads all the field values to memory in order to provide fast -document based access to those values. The field data cache can be -expensive to build for a field, so its recommended to have enough memory -to allocate it, and to keep it loaded. +The field data cache is an in-memory data structure, built on demand +based on the type of query that is being run. It contains both +<> and <>, +which serve similar functions for different types of queries. +The cache uses the JVM heap, so it is important to monitor its use +and not to overload your cluster. -The amount of memory used for the field -data cache can be controlled using `indices.fielddata.cache.size`. Note: -reloading the field data which does not fit into your cache will be expensive -and perform poorly. +Other than fields where the cache is built ahead of time, it is populated as needed +on request. This includes: + +* Certain bucket aggregations on `keyword`, `ip`, and `flattened` fields. This +includes `terms` aggregations, as well as `composite`, `diversified_sampler`, +and `significant_terms`. +* Bucket aggregations on `text` fields that have <> + enabled. +* Bucket aggregations on the <> when it is enabled for aggregation +* Operations on parent and child documents from a `join` field, including +`has_child` queries and `parent` aggregations. + +[discrete] +[[fielddata-sizing]] +==== Cache size + +The entries in the cache are expensive to build, so the default behavior is +to keep the cache loaded in memory + +The default cache size is unlimited, causing the cache to grow until it +reaches the limit set by the <>. +It is recommended to set a cache size limit that is smaller than the circuit breaker +value. Setting the limit will cause the cache to behave as a least-recently-updated +cache, only keeping the most recently requested field data. + +If the field data circuit breaker is reached, preventing further requests, the +best option is to manually <>. This will +allow requests to re-build the cache setting. `indices.fielddata.cache.size`:: (<>) @@ -24,5 +49,4 @@ absolute value, eg `12GB`. Defaults to unbounded. Also see You can monitor memory usage for field data as well as the field data circuit breaker using -<> - +<> or the <> From 57411deb06d4d3880b63607207062b987024a58f Mon Sep 17 00:00:00 2001 From: James Rodewig <40268737+jrodewig@users.noreply.github.com> Date: Thu, 29 Oct 2020 15:19:41 -0400 Subject: [PATCH 2/6] Update docs/plugins/mapper-annotated-text.asciidoc --- docs/plugins/mapper-annotated-text.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/plugins/mapper-annotated-text.asciidoc b/docs/plugins/mapper-annotated-text.asciidoc index a1dd0bd3dd3c9..9307b6aaefe13 100644 --- a/docs/plugins/mapper-annotated-text.asciidoc +++ b/docs/plugins/mapper-annotated-text.asciidoc @@ -18,7 +18,7 @@ include::install_remove.asciidoc[] [[mapper-annotated-text-usage]] ==== Using the `annotated-text` field -The `annotated-text` tokenizes text content as per the more common <> (see +The `annotated-text` tokenizes text content as per the more common {ref}/text.html[`text`] field (see "limitations" below) but also injects any marked-up annotation tokens directly into the search index: From 391ab35414f190530d240a554831b17b48ac0803 Mon Sep 17 00:00:00 2001 From: James Rodewig <40268737+jrodewig@users.noreply.github.com> Date: Mon, 2 Nov 2020 11:15:07 -0500 Subject: [PATCH 3/6] [DOCS] Add redirect for `fielddata` --- docs/reference/mapping/types/text.asciidoc | 4 +-- docs/reference/redirects.asciidoc | 42 ++++++++++++++++++++++ 2 files changed, 44 insertions(+), 2 deletions(-) diff --git a/docs/reference/mapping/types/text.asciidoc b/docs/reference/mapping/types/text.asciidoc index 1d816867a637f..a31f7b172d645 100644 --- a/docs/reference/mapping/types/text.asciidoc +++ b/docs/reference/mapping/types/text.asciidoc @@ -143,7 +143,7 @@ The following parameters are accepted by `text` fields: Metadata about the field. [[fielddata]] -==== `fielddata` +==== `fielddata` mapping parameter `text` fields are searchable by default, but by default are not available for aggregations, sorting, or scripting. If you try to sort, aggregate, or access @@ -216,7 +216,7 @@ PUT my-index-000001/_mapping mapping for that field, plus the `fielddata` parameter. [[field-data-filtering]] -==== `fielddata_frequency_filter` +==== `fielddata_frequency_filter` mapping parameter Fielddata filtering can be used to reduce the number of terms loaded into memory, and thus reduce memory usage. Terms can be filtered by _frequency_: diff --git a/docs/reference/redirects.asciidoc b/docs/reference/redirects.asciidoc index 2c260b0ecd730..756ea70bd1368 100644 --- a/docs/reference/redirects.asciidoc +++ b/docs/reference/redirects.asciidoc @@ -1231,3 +1231,45 @@ See <>. The autoscaling decision API has been renamed to capacity, see <>. + +[role="exclude",id="caching-heavy-aggregations"] +=== Caching heavy aggregations + +See <>. + +[role="exclude",id="returning-only-agg-results"] +=== Returning only aggregation results + +See <>. + +[role="exclude",id="agg-metadata"] +=== Aggregation metadata + +See <>. + +[role="exclude",id="returning-aggregation-type"] +=== Returning the type of the aggregation + +See <>. + +[role="exclude",id="indexing-aggregation-results"] +=== Indexing aggregation results with transforms + +See <>. + +[role="exclude",id="search-aggregations-matrix"] +=== Matrix aggregations + +See <>. + +[[search-aggregations-pipeline-movavg-aggregation]] +=== Moving average aggregation + +The moving average aggregation has been removed. Use the +<> +instead. + +[[fielddata]] +=== `fielddata` mapping parameter + +See <>. From c121d75cee297a5cabf41e657e1c740c67287b92 Mon Sep 17 00:00:00 2001 From: James Rodewig <40268737+jrodewig@users.noreply.github.com> Date: Mon, 2 Nov 2020 11:24:22 -0500 Subject: [PATCH 4/6] [DOCS] Add anchor for redirect --- docs/reference/mapping/types/text.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/reference/mapping/types/text.asciidoc b/docs/reference/mapping/types/text.asciidoc index a31f7b172d645..72970e582913d 100644 --- a/docs/reference/mapping/types/text.asciidoc +++ b/docs/reference/mapping/types/text.asciidoc @@ -142,7 +142,7 @@ The following parameters are accepted by `text` fields: Metadata about the field. -[[fielddata]] +[[fielddata-mapping-param]] ==== `fielddata` mapping parameter `text` fields are searchable by default, but by default are not available for From 784de588a4c5cf17b3fa060f78b68fc6c3cd993a Mon Sep 17 00:00:00 2001 From: Wylie Conlon Date: Tue, 3 Nov 2020 14:08:50 -0500 Subject: [PATCH 5/6] Review comments --- docs/reference/how-to/search-speed.asciidoc | 14 +++--- .../mapping/fields/id-field.asciidoc | 25 +++++------ .../params/eager-global-ordinals.asciidoc | 8 ++-- .../mapping/types/parent-join.asciidoc | 3 +- .../modules/indices/circuit_breaker.asciidoc | 9 ++-- .../modules/indices/fielddata.asciidoc | 45 ++++++------------- 6 files changed, 39 insertions(+), 65 deletions(-) diff --git a/docs/reference/how-to/search-speed.asciidoc b/docs/reference/how-to/search-speed.asciidoc index 565a427531df5..2503c02cc1e3b 100644 --- a/docs/reference/how-to/search-speed.asciidoc +++ b/docs/reference/how-to/search-speed.asciidoc @@ -303,14 +303,14 @@ may become much worse. [discrete] === Warm up global ordinals -<> are a data structure that is used in -order to increase aggregation speed. They are calculated lazily and stored in +<> are a data structure that is used to +optimize the performance of aggregations. They are calculated lazily and stored in the JVM heap as part of the <>. For fields -that are heavily used for bucketing aggregations, you can tell {es} to add to -the cache before requests are received. This should be done carefully because it -will increase heap usage and delay indexing until the cache is created. This can -be set dynamically on an existing mapping by setting the -<> mappping parameter: +that are heavily used for bucketing aggregations, you can tell {es} to construct +and cache the global ordinals before requests are received. This should be done +carefully because it will increase heap usage and delay indexing until the global +ordinals are constructed. This can be set dynamically on an existing mapping by +setting the <> mapping parameter: [source,console] -------------------------------------------------- diff --git a/docs/reference/mapping/fields/id-field.asciidoc b/docs/reference/mapping/fields/id-field.asciidoc index 8bf53798cd76a..16ee7d8619408 100644 --- a/docs/reference/mapping/fields/id-field.asciidoc +++ b/docs/reference/mapping/fields/id-field.asciidoc @@ -3,10 +3,14 @@ Each document has an `_id` that uniquely identifies it, which is indexed so that documents can be looked up either with the <> or the -<>. +<>. The `_id` can either be assigned at +indexing time, or a unique `_id` can be generated by {es}. This field is not +configurable. -The value of the `_id` field is accessible in certain queries (`term`, -`terms`, `match`, `query_string`, `simple_query_string`). +The value of the `_id` field is only accessible in certain queries (`term`, +`terms`, `match`, `query_string`, `simple_query_string`), but is restricted +from use in aggregations, sorting, or scripting. For those use cases the +`keyword` type is recommended. [source,console] -------------------------- @@ -16,16 +20,16 @@ PUT my-index-000001/_doc/1 "text": "Document with ID 1" } -PUT my-index-000001/_doc/2?refresh=true +POST my-index-000001/_doc/?refresh=true { - "text": "Document with ID 2" + "text": "Document with generated ID" } GET my-index-000001/_search { "query": { "terms": { - "_id": [ "1", "2" ] <1> + "_id": [ "1", "AhcEj3UB1Y-S1MdSrUDG" ] <1> } } } @@ -33,15 +37,6 @@ GET my-index-000001/_search <1> Querying on the `_id` field (also see the <>) -The `_id` field is by default not available by default for use with aggregations or sorting. -To aggregate or sort by the `_id` field, it is recommended to -duplicate the `_id` field onto a `keyword` field using the <>. - -It is not recommended to enable `_id` fields to be aggregated using the <>, -but it is possible. This can be done by <> -to `"indices.id_field_data.enabled": true`. Enabling this setting and then aggregating on the `_id` -field will use significant memory and show deprecation warnings in the logs. - [NOTE] ================================================== `_id` is limited to 512 bytes in size and larger values will be rejected. diff --git a/docs/reference/mapping/params/eager-global-ordinals.asciidoc b/docs/reference/mapping/params/eager-global-ordinals.asciidoc index 9f771a3d66745..c990abd5da9f4 100644 --- a/docs/reference/mapping/params/eager-global-ordinals.asciidoc +++ b/docs/reference/mapping/params/eager-global-ordinals.asciidoc @@ -35,11 +35,9 @@ to be enabled. `has_child` queries and `parent` aggregations. NOTE: The global ordinal mapping use heap memory as part of the -<>. Aggregations that include high -cardinality values can use a significant amount of heap memory, and -could exceed the threshold of the -<>. -It is recommended to set a specific limit for the field data cache size. +<>. Aggregations on high cardinality fields +can use a significant amount of heap memory, and could exceed the threshold +of the <>. ==== Loading global ordinals diff --git a/docs/reference/mapping/types/parent-join.asciidoc b/docs/reference/mapping/types/parent-join.asciidoc index 6826f155cc4f7..8a4d1e66b390d 100644 --- a/docs/reference/mapping/types/parent-join.asciidoc +++ b/docs/reference/mapping/types/parent-join.asciidoc @@ -124,8 +124,7 @@ PUT my-index-000001/_doc/4?routing=1&refresh The join field shouldn't be used like joins in a relation database. In Elasticsearch the key to good performance is to de-normalize your data into documents. Each join field, `has_child` or `has_parent` query adds a -significant tax to your query performance. It also increases the usage of the JVM heap on the -<>. +significant tax to your query performance. It can also trigger <> to be built. The only case where the join field makes sense is if your data contains a one-to-many relationship where one entity significantly outnumbers the other entity. An example of such case is a use case with products diff --git a/docs/reference/modules/indices/circuit_breaker.asciidoc b/docs/reference/modules/indices/circuit_breaker.asciidoc index 2f85996c0d433..2fd929f85cedb 100644 --- a/docs/reference/modules/indices/circuit_breaker.asciidoc +++ b/docs/reference/modules/indices/circuit_breaker.asciidoc @@ -32,11 +32,10 @@ The parent-level breaker can be configured with the following settings: [[fielddata-circuit-breaker]] [discrete] ==== Field data circuit breaker -The field data circuit breaker allows Elasticsearch to estimate the amount of -memory a field will require to be loaded into the <>. -It can then prevent the field data loading by raising an exception. By default the -limit is configured to 40% of the maximum JVM heap. It can be configured with the following -parameters: +The field data circuit breaker estimates the heap memory required to load a +field into the <>. If loading the field would +cause the cache to exceed a predefined memory limit, the circuit breaker stops the +operation and returns an error. [[fielddata-circuit-breaker-limit]] // tag::fielddata-circuit-breaker-limit-tag[] diff --git a/docs/reference/modules/indices/fielddata.asciidoc b/docs/reference/modules/indices/fielddata.asciidoc index d3fae03e3aab5..8795f41d07144 100644 --- a/docs/reference/modules/indices/fielddata.asciidoc +++ b/docs/reference/modules/indices/fielddata.asciidoc @@ -1,47 +1,30 @@ [[modules-fielddata]] === Field data cache settings -The field data cache is an in-memory data structure, built on demand -based on the type of query that is being run. It contains both -<> and <>, -which serve similar functions for different types of queries. -The cache uses the JVM heap, so it is important to monitor its use -and not to overload your cluster. - -Other than fields where the cache is built ahead of time, it is populated as needed -on request. This includes: - -* Certain bucket aggregations on `keyword`, `ip`, and `flattened` fields. This -includes `terms` aggregations, as well as `composite`, `diversified_sampler`, -and `significant_terms`. -* Bucket aggregations on `text` fields that have <> - enabled. -* Bucket aggregations on the <> when it is enabled for aggregation -* Operations on parent and child documents from a `join` field, including -`has_child` queries and `parent` aggregations. +The field data cache contains <> and <>, +which are both used to support aggregations on certain field types. +Since these are on-heap data structures, it is important to monitor the cache's use. [discrete] [[fielddata-sizing]] ==== Cache size The entries in the cache are expensive to build, so the default behavior is -to keep the cache loaded in memory +to keep the cache loaded in memory. The default cache size is unlimited, +causing the cache to grow until it reaches the limit set by the <>. This behavior can be configured. -The default cache size is unlimited, causing the cache to grow until it -reaches the limit set by the <>. -It is recommended to set a cache size limit that is smaller than the circuit breaker -value. Setting the limit will cause the cache to behave as a least-recently-updated -cache, only keeping the most recently requested field data. +If the cache size limit is set, the cache will begin clearing the least-recently-updated +entries in the cache. This setting can automatically avoid the circuit breaker limit, +at the cost of rebuilding the cache as needed. -If the field data circuit breaker is reached, preventing further requests, the -best option is to manually <>. This will -allow requests to re-build the cache setting. +If the circuit breaker limit is reached, further requests that increase the cache +size will be prevented. In this case you shoul manually <>. `indices.fielddata.cache.size`:: (<>) -The max size of the field data cache, eg `30%` of node heap space, or an -absolute value, eg `12GB`. Defaults to unbounded. Also see -<>. +The max size of the field data cache, eg `38%` of node heap space, or an +absolute value, eg `12GB`. Defaults to unbounded. Should be set smaller than the +<>, if set. [discrete] [[fielddata-monitoring]] @@ -49,4 +32,4 @@ absolute value, eg `12GB`. Defaults to unbounded. Also see You can monitor memory usage for field data as well as the field data circuit breaker using -<> or the <> +the <> or the <>. From b83b08036fc0709d1ef2793aa63483cea7e4d6aa Mon Sep 17 00:00:00 2001 From: Julie Tibshirani Date: Fri, 20 Nov 2020 11:05:05 -0800 Subject: [PATCH 6/6] Address review comments. --- docs/reference/how-to/search-speed.asciidoc | 18 +++++++++--------- .../mapping/fields/id-field.asciidoc | 19 +++++++++++-------- .../params/eager-global-ordinals.asciidoc | 6 +++--- .../modules/indices/fielddata.asciidoc | 6 +++--- 4 files changed, 26 insertions(+), 23 deletions(-) diff --git a/docs/reference/how-to/search-speed.asciidoc b/docs/reference/how-to/search-speed.asciidoc index 2503c02cc1e3b..e51c7fa2b7821 100644 --- a/docs/reference/how-to/search-speed.asciidoc +++ b/docs/reference/how-to/search-speed.asciidoc @@ -308,8 +308,8 @@ optimize the performance of aggregations. They are calculated lazily and stored the JVM heap as part of the <>. For fields that are heavily used for bucketing aggregations, you can tell {es} to construct and cache the global ordinals before requests are received. This should be done -carefully because it will increase heap usage and delay indexing until the global -ordinals are constructed. This can be set dynamically on an existing mapping by +carefully because it will increase heap usage and can make <> +take longer. The option can be updated dynamically on an existing mapping by setting the <> mapping parameter: [source,console] @@ -393,19 +393,19 @@ right number of replicas for you is === Tune your queries with the Profile API -You can also analyse how expensive each component of your queries and -aggregations are using the {ref}/search-profile.html[Profile API]. This might -allow you to tune your queries to be less expensive, resulting in a positive -performance result and reduced load. Also note that Profile API payloads can be -easily visualised for better readability in the -{kibana-ref}/xpack-profiler.html[Search Profiler], which is a Kibana dev tools +You can also analyse how expensive each component of your queries and +aggregations are using the {ref}/search-profile.html[Profile API]. This might +allow you to tune your queries to be less expensive, resulting in a positive +performance result and reduced load. Also note that Profile API payloads can be +easily visualised for better readability in the +{kibana-ref}/xpack-profiler.html[Search Profiler], which is a Kibana dev tools UI available in all X-Pack licenses, including the free X-Pack Basic license. Some caveats to the Profile API are that: - the Profile API as a debugging tool adds significant overhead to search execution and can also have a very verbose output - given the added overhead, the resulting took times are not reliable indicators of actual took time, but can be used comparatively between clauses for relative timing differences - - the Profile API is best for exploring possible reasons behind the most costly clauses of a query but isn't intended for accurately measuring absolute timings of each clause + - the Profile API is best for exploring possible reasons behind the most costly clauses of a query but isn't intended for accurately measuring absolute timings of each clause [[faster-phrase-queries]] === Faster phrase queries with `index_phrases` diff --git a/docs/reference/mapping/fields/id-field.asciidoc b/docs/reference/mapping/fields/id-field.asciidoc index 16ee7d8619408..1e963dd6de7d7 100644 --- a/docs/reference/mapping/fields/id-field.asciidoc +++ b/docs/reference/mapping/fields/id-field.asciidoc @@ -5,12 +5,10 @@ Each document has an `_id` that uniquely identifies it, which is indexed so that documents can be looked up either with the <> or the <>. The `_id` can either be assigned at indexing time, or a unique `_id` can be generated by {es}. This field is not -configurable. +configurable in the mappings. -The value of the `_id` field is only accessible in certain queries (`term`, -`terms`, `match`, `query_string`, `simple_query_string`), but is restricted -from use in aggregations, sorting, or scripting. For those use cases the -`keyword` type is recommended. +The value of the `_id` field is accessible in queries such as `term`, +`terms`, `match`, and `query_string`. [source,console] -------------------------- @@ -20,16 +18,16 @@ PUT my-index-000001/_doc/1 "text": "Document with ID 1" } -POST my-index-000001/_doc/?refresh=true +PUT my-index-000001/_doc/2?refresh=true { - "text": "Document with generated ID" + "text": "Document with ID 2" } GET my-index-000001/_search { "query": { "terms": { - "_id": [ "1", "AhcEj3UB1Y-S1MdSrUDG" ] <1> + "_id": [ "1", "2" ] <1> } } } @@ -37,6 +35,11 @@ GET my-index-000001/_search <1> Querying on the `_id` field (also see the <>) +The `_id` field is restricted from use in aggregations, sorting, and scripting. +In case sorting or aggregating on the `_id` field is required, it is advised to +duplicate the content of the `_id` field into another field that has +`doc_values` enabled. + [NOTE] ================================================== `_id` is limited to 512 bytes in size and larger values will be rejected. diff --git a/docs/reference/mapping/params/eager-global-ordinals.asciidoc b/docs/reference/mapping/params/eager-global-ordinals.asciidoc index c990abd5da9f4..76f2f41656469 100644 --- a/docs/reference/mapping/params/eager-global-ordinals.asciidoc +++ b/docs/reference/mapping/params/eager-global-ordinals.asciidoc @@ -34,10 +34,10 @@ to be enabled. * Operations on parent and child documents from a `join` field, including `has_child` queries and `parent` aggregations. -NOTE: The global ordinal mapping use heap memory as part of the +NOTE: The global ordinal mapping uses heap memory as part of the <>. Aggregations on high cardinality fields -can use a significant amount of heap memory, and could exceed the threshold -of the <>. +can use a lot of memory and trigger the <>. ==== Loading global ordinals diff --git a/docs/reference/modules/indices/fielddata.asciidoc b/docs/reference/modules/indices/fielddata.asciidoc index 8795f41d07144..1383bf74d6d4c 100644 --- a/docs/reference/modules/indices/fielddata.asciidoc +++ b/docs/reference/modules/indices/fielddata.asciidoc @@ -18,13 +18,13 @@ entries in the cache. This setting can automatically avoid the circuit breaker l at the cost of rebuilding the cache as needed. If the circuit breaker limit is reached, further requests that increase the cache -size will be prevented. In this case you shoul manually <>. +size will be prevented. In this case you should manually <>. `indices.fielddata.cache.size`:: (<>) The max size of the field data cache, eg `38%` of node heap space, or an -absolute value, eg `12GB`. Defaults to unbounded. Should be set smaller than the -<>, if set. +absolute value, eg `12GB`. Defaults to unbounded. If you choose to set it, +it should be smaller than <> limit. [discrete] [[fielddata-monitoring]]