elastic · jtibshirani · Nov 20, 2020 · Oct 29, 2020 · Oct 29, 2020 · Nov 2, 2020
diff --git a/docs/plugins/mapper-annotated-text.asciidoc b/docs/plugins/mapper-annotated-text.asciidoc
@@ -18,7 +18,7 @@ include::install_remove.asciidoc[]
 [[mapper-annotated-text-usage]]
 ==== Using the `annotated-text` field
 
-The `annotated-text` tokenizes text content as per the more common `text` field (see 
+The `annotated-text` tokenizes text content as per the more common <<text, `text` field>> (see 
 "limitations" below) but also injects any marked-up annotation tokens directly into
 the search index:
 

diff --git a/docs/reference/aggregations/bucket/diversified-sampler-aggregation.asciidoc b/docs/reference/aggregations/bucket/diversified-sampler-aggregation.asciidoc
@@ -178,7 +178,7 @@ Each option will hold up to `shard_size` values in memory while performing de-du
  - hold ordinals of the field as determined by the Lucene index (`global_ordinals`)
  - hold hashes of the field values - with potential for hash collisions (`bytes_hash`)
 
-The default setting is to use `global_ordinals` if this information is available from the Lucene index and reverting to `map` if not.
+The default setting is to use <<eager-global-ordinals,`global_ordinals`>> if this information is available from the Lucene index and reverting to `map` if not.
 The `bytes_hash` setting may prove faster in some cases but introduces the possibility of false positives in de-duplication logic due to the possibility of hash collisions.
 Please note that Elasticsearch will ignore the choice of execution hint if it is not applicable and that there is no backward compatibility guarantee on these hints.
 

diff --git a/docs/reference/aggregations/bucket/significantterms-aggregation.asciidoc b/docs/reference/aggregations/bucket/significantterms-aggregation.asciidoc
@@ -550,7 +550,7 @@ A description of the different collection modes can be found in the
 There are different mechanisms by which terms aggregations can be executed:
 
  - by using field values directly in order to aggregate data per-bucket (`map`)
- - by using global ordinals of the field and allocating one bucket per global ordinal (`global_ordinals`)
+ - by using <<eager-global-ordinals,global ordinals>> of the field and allocating one bucket per global ordinal (`global_ordinals`)
 
 Elasticsearch tries to have sensible defaults so this is something that generally doesn't need to be configured.
 

diff --git a/docs/reference/cat/fielddata.asciidoc b/docs/reference/cat/fielddata.asciidoc
@@ -4,8 +4,8 @@
 <titleabbrev>cat fielddata</titleabbrev>
 ++++
 
-Returns the amount of heap memory currently used by fielddata on every data node
-in the cluster.
+Returns the amount of heap memory currently used by the
+<<modules-fielddata, field data cache>> on every data node in the cluster.
 
 
 [[cat-fielddata-api-request]]

diff --git a/docs/reference/cluster/stats.asciidoc b/docs/reference/cluster/stats.asciidoc
@@ -246,7 +246,7 @@ activities.
 
 `fielddata`::
 (object)
-Contains statistics about the field data cache of selected nodes.
+Contains statistics about the <<modules-fielddata, field data cache>> of selected nodes.
 +
 .Properties of `fielddata`
 [%collapsible%open]

diff --git a/docs/reference/how-to/search-speed.asciidoc b/docs/reference/how-to/search-speed.asciidoc
@@ -303,13 +303,14 @@ may become much worse.
 [discrete]
 === Warm up global ordinals
 
-Global ordinals are a data-structure that is used in order to run
-<<search-aggregations-bucket-terms-aggregation,`terms`>> aggregations on
-<<keyword,`keyword`>> fields. They are loaded lazily in memory because
-Elasticsearch does not know which fields will be used in `terms` aggregations
-and which fields won't. You can tell Elasticsearch to load global ordinals
-eagerly when starting or refreshing a shard by configuring mappings as
-described below:
+<<eager-global-ordinals,Global ordinals>> are a data structure that is used in
+order to increase aggregation speed. They are calculated lazily and stored in
+the JVM heap as part of the <<modules-fielddata, field data cache>>. For fields
+that are heavily used for bucketing aggregations, you can tell {es} to add to
+the cache before requests are received. This should be done carefully because it
+will increase heap usage and delay indexing until the cache is created. This can
+be set dynamically on an existing mapping by setting the
+<<eager-global-ordinals, eager global ordinals>> mappping parameter:
 
 [source,console]
 --------------------------------------------------

diff --git a/docs/reference/mapping/fields/id-field.asciidoc b/docs/reference/mapping/fields/id-field.asciidoc
@@ -33,12 +33,14 @@ GET my-index-000001/_search
 
 <1> Querying on the `_id` field (also see the <<query-dsl-ids-query,`ids` query>>)
 
-The value of the `_id` field is also accessible in aggregations or for sorting,
-but doing so is discouraged as it requires to load a lot of data in memory. In
-case sorting or aggregating on the `_id` field is required, it is advised to
-duplicate the content of the `_id` field in another field that has `doc_values`
-enabled.
-
+The `_id` field is by default not available by default for use with aggregations or sorting.
+To aggregate or sort by the `_id` field, it is recommended to 
+duplicate the `_id` field onto a `keyword` field using the <<copy-to, `copy_to` mapping parameter>>.
+
+It is not recommended to enable `_id` fields to be aggregated using the <<modules-fielddata, in-memory field data cache>>,
+but it is possible. This can be done by <<cluster-update-settings, changing the cluster setting>>
+to `"indices.id_field_data.enabled": true`. Enabling this setting and then aggregating on the `_id`
+field will use significant memory and show deprecation warnings in the logs.
 
 [NOTE]
 ==================================================

diff --git a/docs/reference/mapping/params.asciidoc b/docs/reference/mapping/params.asciidoc
@@ -49,8 +49,6 @@ include::params/eager-global-ordinals.asciidoc[]
 
 include::params/enabled.asciidoc[]
 
-include::params/fielddata.asciidoc[]
-
 include::params/format.asciidoc[]
 
 include::params/ignore-above.asciidoc[]

diff --git a/docs/reference/mapping/params/eager-global-ordinals.asciidoc b/docs/reference/mapping/params/eager-global-ordinals.asciidoc
@@ -34,11 +34,12 @@ to be enabled.
 * Operations on parent and child documents from a `join` field, including
 `has_child` queries and `parent` aggregations.
 
-NOTE: The global ordinal mapping is an on-heap data structure. When measuring
-memory usage, Elasticsearch counts the memory from global ordinals as
-'fielddata'. Global ordinals memory is included in the
-<<fielddata-circuit-breaker, fielddata circuit breaker>>, and is returned
-under `fielddata` in the <<cluster-nodes-stats, node stats>> response.
+NOTE: The global ordinal mapping use heap memory as part of the
+<<modules-fielddata, field data cache>>. Aggregations that include high
+cardinality values can use a significant amount of heap memory, and
+could exceed the threshold of the
+<<fielddata-circuit-breaker, field data circuit breaker>>.
+It is recommended to set a specific limit for the field data cache size.
 
 ==== Loading global ordinals
 

diff --git a/docs/reference/mapping/params/fielddata.asciidoc b/docs/reference/mapping/params/fielddata.asciidoc
diff --git a/docs/reference/mapping/types/parent-join.asciidoc b/docs/reference/mapping/types/parent-join.asciidoc
@@ -120,11 +120,12 @@ PUT my-index-000001/_doc/4?routing=1&refresh
 <2> `answer` is the name of the join for this document
 <3> The parent id of this child document
 
-==== Parent-join and performance.
+==== Parent-join and performance
 
 The join field shouldn't be used like joins in a relation database. In Elasticsearch the key to good performance
 is to de-normalize your data into documents. Each join field, `has_child` or `has_parent` query adds a
-significant tax to your query performance.
+significant tax to your query performance. It also increases the usage of the JVM heap on the
+<<modules-fielddata, field data cache>>.
 
 The only case where the join field makes sense is if your data contains a one-to-many relationship where
 one entity significantly outnumbers the other entity. An example of such case is a use case with products

diff --git a/docs/reference/mapping/types/text.asciidoc b/docs/reference/mapping/types/text.asciidoc
@@ -141,3 +141,112 @@ The following parameters are accepted by `text` fields:
 <<mapping-field-meta,`meta`>>::
 
     Metadata about the field.
+
+[[fielddata]]
+==== `fielddata`
+
+`text` fields are searchable by default, but by default are not available for
+aggregations, sorting, or scripting. If you try to sort, aggregate, or access
+values from a script on a `text` field, you will see this exception:
+
+[literal]
+Fielddata is disabled on text fields by default.  Set `fielddata=true` on
+[`your_field_name`] in order to load fielddata in memory by uninverting the
+inverted index. Note that this can however use significant memory.
+
+Field data is the only way to access the analyzed tokens from a full text field
+in aggregations, sorting, or scripting. For example, a full text field like `New York`
+would get analyzed as `new` and `york`. To aggregate on these tokens requires field data.
+
+[[before-enabling-fielddata]]
+==== Before enabling fielddata
+
+It usually doesn't make sense to enable fielddata on text fields. Field data
+is stored in the heap with the <<modules-fielddata, field data cache>> because it
+is expensive to calculate. Calculating the field data can cause latency spikes, and
+increasing heap usage is a cause of cluster performance issues.
+
+Most users who want to do more with text fields use <<multi-fields, multi-field mappings>>
+by having both a `text` field for full text searches, and an
+unanalyzed <<keyword,`keyword`>> field for aggregations, as follows:
+
+[source,console]
+---------------------------------
+PUT my-index-000001
+{
+  "mappings": {
+    "properties": {
+      "my_field": { <1>
+        "type": "text",
+        "fields": {
+          "keyword": { <2>
+            "type": "keyword"
+          }
+        }
+      }
+    }
+  }
+}
+---------------------------------
+
+<1> Use the `my_field` field for searches.
+<2> Use the `my_field.keyword` field for aggregations, sorting, or in scripts.
+
+[[enable-fielddata-text-fields]]
+==== Enabling fielddata on `text` fields
+
+You can enable fielddata on an existing `text` field using the
+<<indices-put-mapping,PUT mapping API>> as follows:
+
+[source,console]
+-----------------------------------
+PUT my-index-000001/_mapping
+{
+  "properties": {
+    "my_field": { <1>
+      "type":     "text",
+      "fielddata": true
+    }
+  }
+}
+-----------------------------------
+// TEST[continued]
+
+<1> The mapping that you specify for `my_field` should consist of the existing
+    mapping for that field, plus the `fielddata` parameter.
+
+[[field-data-filtering]]
+==== `fielddata_frequency_filter`
+
+Fielddata filtering can be used to reduce the number of terms loaded into
+memory, and thus reduce memory usage. Terms can be filtered by _frequency_:
+
+The frequency filter allows you to only load terms whose document frequency falls
+between a `min` and `max` value, which can be expressed an absolute
+number (when the number is bigger than 1.0) or as a percentage
+(eg `0.01` is `1%` and `1.0` is `100%`). Frequency is calculated
+*per segment*. Percentages are based on the number of docs which have a
+value for the field, as opposed to all docs in the segment.
+
+Small segments can be excluded completely by specifying the minimum
+number of docs that the segment should contain with `min_segment_size`:
+
+[source,console]
+--------------------------------------------------
+PUT my-index-000001
+{
+  "mappings": {
+    "properties": {
+      "tag": {
+        "type": "text",
+        "fielddata": true,
+        "fielddata_frequency_filter": {
+          "min": 0.001,
+          "max": 0.1,
+          "min_segment_size": 500
+        }
+      }
+    }
+  }
+}
+--------------------------------------------------
diff --git a/docs/reference/modules/indices/circuit_breaker.asciidoc b/docs/reference/modules/indices/circuit_breaker.asciidoc
@@ -33,9 +33,9 @@ The parent-level breaker can be configured with the following settings:
 [discrete]
 ==== Field data circuit breaker
 The field data circuit breaker allows Elasticsearch to estimate the amount of
-memory a field will require to be loaded into memory. It can then prevent the
-field data loading by raising an exception. By default the limit is configured
-to 40% of the maximum JVM heap. It can be configured with the following
+memory a field will require to be loaded into the <<modules-fielddata, field data cache>>.
+It can then prevent the field data loading by raising an exception. By default the
+limit is configured to 40% of the maximum JVM heap. It can be configured with the following
 parameters:
 
 [[fielddata-circuit-breaker-limit]]