Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Clarify field data cache behavior #64375

Merged
merged 7 commits into from
Nov 20, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/plugins/mapper-annotated-text.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ include::install_remove.asciidoc[]
[[mapper-annotated-text-usage]]
==== Using the `annotated-text` field

The `annotated-text` tokenizes text content as per the more common `text` field (see
The `annotated-text` tokenizes text content as per the more common <<text, `text` field>> (see
jrodewig marked this conversation as resolved.
Show resolved Hide resolved
"limitations" below) but also injects any marked-up annotation tokens directly into
the search index:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,7 @@ Each option will hold up to `shard_size` values in memory while performing de-du
- hold ordinals of the field as determined by the Lucene index (`global_ordinals`)
- hold hashes of the field values - with potential for hash collisions (`bytes_hash`)

The default setting is to use `global_ordinals` if this information is available from the Lucene index and reverting to `map` if not.
The default setting is to use <<eager-global-ordinals,`global_ordinals`>> if this information is available from the Lucene index and reverting to `map` if not.
The `bytes_hash` setting may prove faster in some cases but introduces the possibility of false positives in de-duplication logic due to the possibility of hash collisions.
Please note that Elasticsearch will ignore the choice of execution hint if it is not applicable and that there is no backward compatibility guarantee on these hints.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -550,7 +550,7 @@ A description of the different collection modes can be found in the
There are different mechanisms by which terms aggregations can be executed:

- by using field values directly in order to aggregate data per-bucket (`map`)
- by using global ordinals of the field and allocating one bucket per global ordinal (`global_ordinals`)
- by using <<eager-global-ordinals,global ordinals>> of the field and allocating one bucket per global ordinal (`global_ordinals`)

Elasticsearch tries to have sensible defaults so this is something that generally doesn't need to be configured.

Expand Down
4 changes: 2 additions & 2 deletions docs/reference/cat/fielddata.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@
<titleabbrev>cat fielddata</titleabbrev>
++++

Returns the amount of heap memory currently used by fielddata on every data node
in the cluster.
Returns the amount of heap memory currently used by the
<<modules-fielddata, field data cache>> on every data node in the cluster.


[[cat-fielddata-api-request]]
Expand Down
2 changes: 1 addition & 1 deletion docs/reference/cluster/stats.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -246,7 +246,7 @@ activities.
`fielddata`::
(object)
Contains statistics about the field data cache of selected nodes.
Contains statistics about the <<modules-fielddata, field data cache>> of selected nodes.
+
.Properties of `fielddata`
[%collapsible%open]
Expand Down
15 changes: 8 additions & 7 deletions docs/reference/how-to/search-speed.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -303,13 +303,14 @@ may become much worse.
[discrete]
=== Warm up global ordinals

Global ordinals are a data-structure that is used in order to run
<<search-aggregations-bucket-terms-aggregation,`terms`>> aggregations on
<<keyword,`keyword`>> fields. They are loaded lazily in memory because
Elasticsearch does not know which fields will be used in `terms` aggregations
and which fields won't. You can tell Elasticsearch to load global ordinals
eagerly when starting or refreshing a shard by configuring mappings as
described below:
<<eager-global-ordinals,Global ordinals>> are a data structure that is used in
order to increase aggregation speed. They are calculated lazily and stored in
the JVM heap as part of the <<modules-fielddata, field data cache>>. For fields
that are heavily used for bucketing aggregations, you can tell {es} to add to
the cache before requests are received. This should be done carefully because it
will increase heap usage and delay indexing until the cache is created. This can
be set dynamically on an existing mapping by setting the
<<eager-global-ordinals, eager global ordinals>> mappping parameter:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small comments to make the language more precise:

  • Saying "in order to increase aggregation speed" isn't quite accurate because global ordinals help a lot with memory usage. Maybe we could say "Global ordinals are a data structure that are used to optimize the performance of certain aggregations?"
  • "you can tell {es} to add to the cache" -> "you can tell {es} to construct and cache the global ordinals..."
  • "until the cache is created" -> "until the global ordinals are constructed"
  • mappping -> mapping


[source,console]
--------------------------------------------------
Expand Down
14 changes: 8 additions & 6 deletions docs/reference/mapping/fields/id-field.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -33,12 +33,14 @@ GET my-index-000001/_search

<1> Querying on the `_id` field (also see the <<query-dsl-ids-query,`ids` query>>)

The value of the `_id` field is also accessible in aggregations or for sorting,
but doing so is discouraged as it requires to load a lot of data in memory. In
case sorting or aggregating on the `_id` field is required, it is advised to
duplicate the content of the `_id` field in another field that has `doc_values`
enabled.

The `_id` field is by default not available by default for use with aggregations or sorting.
To aggregate or sort by the `_id` field, it is recommended to
duplicate the `_id` field onto a `keyword` field using the <<copy-to, `copy_to` mapping parameter>>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really small comment, the link text is usually just the parameter name: <<copy-to, `copy_to`>>

Copy link
Contributor

@jtibshirani jtibshirani Nov 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that it's not possible to use copy_to with the _id mapper, so this only works for custom IDs where the user can manually duplicate the ID into another field.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I'll clarify that, that wasn't clear in the original text. Going to move this entire section to the top.


It is not recommended to enable `_id` fields to be aggregated using the <<modules-fielddata, in-memory field data cache>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we soon plan to entirely remove the ability to sort/ aggregate on _id, I think it'd be best not to mention the cluster setting. It mostly just helps with the 7.x -> 8.x upgrade.

It looks like we forgot to mention _id field data in the 8.0 breaking changes docs though. I can fix that in a follow-up.

but it is possible. This can be done by <<cluster-update-settings, changing the cluster setting>>
to `"indices.id_field_data.enabled": true`. Enabling this setting and then aggregating on the `_id`
field will use significant memory and show deprecation warnings in the logs.

[NOTE]
==================================================
Expand Down
2 changes: 0 additions & 2 deletions docs/reference/mapping/params.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,6 @@ include::params/eager-global-ordinals.asciidoc[]

include::params/enabled.asciidoc[]

include::params/fielddata.asciidoc[]

include::params/format.asciidoc[]

include::params/ignore-above.asciidoc[]
Expand Down
11 changes: 6 additions & 5 deletions docs/reference/mapping/params/eager-global-ordinals.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -34,11 +34,12 @@ to be enabled.
* Operations on parent and child documents from a `join` field, including
`has_child` queries and `parent` aggregations.

NOTE: The global ordinal mapping is an on-heap data structure. When measuring
memory usage, Elasticsearch counts the memory from global ordinals as
'fielddata'. Global ordinals memory is included in the
<<fielddata-circuit-breaker, fielddata circuit breaker>>, and is returned
under `fielddata` in the <<cluster-nodes-stats, node stats>> response.
NOTE: The global ordinal mapping use heap memory as part of the
jtibshirani marked this conversation as resolved.
Show resolved Hide resolved
<<modules-fielddata, field data cache>>. Aggregations that include high
cardinality values can use a significant amount of heap memory, and
could exceed the threshold of the
<<fielddata-circuit-breaker, field data circuit breaker>>.
It is recommended to set a specific limit for the field data cache size.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are actually still discussing this recommendation in #59829, perhaps we could hold off on adding this sentence until we have a conclusion.

Also maybe "Aggregations that include high cardinality values" -> "Aggregations on high cardinality fields" ?


==== Loading global ordinals

Expand Down
134 changes: 0 additions & 134 deletions docs/reference/mapping/params/fielddata.asciidoc

This file was deleted.

5 changes: 3 additions & 2 deletions docs/reference/mapping/types/parent-join.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -120,11 +120,12 @@ PUT my-index-000001/_doc/4?routing=1&refresh
<2> `answer` is the name of the join for this document
<3> The parent id of this child document

==== Parent-join and performance.
==== Parent-join and performance

The join field shouldn't be used like joins in a relation database. In Elasticsearch the key to good performance
is to de-normalize your data into documents. Each join field, `has_child` or `has_parent` query adds a
significant tax to your query performance.
significant tax to your query performance. It also increases the usage of the JVM heap on the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not actually sure it's a significant contributor to heap usage, since only one join mapping is allowed per index? But it's helpful to know that it produces field data, maybe we could just say "It can also trigger <<eager-global-ordinals, global ordinals>> to be built."

<<modules-fielddata, field data cache>>.

The only case where the join field makes sense is if your data contains a one-to-many relationship where
one entity significantly outnumbers the other entity. An example of such case is a use case with products
Expand Down
109 changes: 109 additions & 0 deletions docs/reference/mapping/types/text.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -141,3 +141,112 @@ The following parameters are accepted by `text` fields:
<<mapping-field-meta,`meta`>>::

Metadata about the field.

[[fielddata]]
==== `fielddata`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this consolidation, it makes it clear this description only applies to text.


`text` fields are searchable by default, but by default are not available for
aggregations, sorting, or scripting. If you try to sort, aggregate, or access
values from a script on a `text` field, you will see this exception:

[literal]
Fielddata is disabled on text fields by default. Set `fielddata=true` on
[`your_field_name`] in order to load fielddata in memory by uninverting the
inverted index. Note that this can however use significant memory.

Field data is the only way to access the analyzed tokens from a full text field
in aggregations, sorting, or scripting. For example, a full text field like `New York`
would get analyzed as `new` and `york`. To aggregate on these tokens requires field data.

[[before-enabling-fielddata]]
==== Before enabling fielddata

It usually doesn't make sense to enable fielddata on text fields. Field data
is stored in the heap with the <<modules-fielddata, field data cache>> because it
is expensive to calculate. Calculating the field data can cause latency spikes, and
increasing heap usage is a cause of cluster performance issues.

Most users who want to do more with text fields use <<multi-fields, multi-field mappings>>
by having both a `text` field for full text searches, and an
unanalyzed <<keyword,`keyword`>> field for aggregations, as follows:

[source,console]
---------------------------------
PUT my-index-000001
{
"mappings": {
"properties": {
"my_field": { <1>
"type": "text",
"fields": {
"keyword": { <2>
"type": "keyword"
}
}
}
}
}
}
---------------------------------

<1> Use the `my_field` field for searches.
<2> Use the `my_field.keyword` field for aggregations, sorting, or in scripts.

[[enable-fielddata-text-fields]]
==== Enabling fielddata on `text` fields

You can enable fielddata on an existing `text` field using the
<<indices-put-mapping,PUT mapping API>> as follows:

[source,console]
-----------------------------------
PUT my-index-000001/_mapping
{
"properties": {
"my_field": { <1>
"type": "text",
"fielddata": true
}
}
}
-----------------------------------
// TEST[continued]

<1> The mapping that you specify for `my_field` should consist of the existing
mapping for that field, plus the `fielddata` parameter.

[[field-data-filtering]]
==== `fielddata_frequency_filter`

Fielddata filtering can be used to reduce the number of terms loaded into
memory, and thus reduce memory usage. Terms can be filtered by _frequency_:

The frequency filter allows you to only load terms whose document frequency falls
between a `min` and `max` value, which can be expressed an absolute
number (when the number is bigger than 1.0) or as a percentage
(eg `0.01` is `1%` and `1.0` is `100%`). Frequency is calculated
*per segment*. Percentages are based on the number of docs which have a
value for the field, as opposed to all docs in the segment.

Small segments can be excluded completely by specifying the minimum
number of docs that the segment should contain with `min_segment_size`:

[source,console]
--------------------------------------------------
PUT my-index-000001
{
"mappings": {
"properties": {
"tag": {
"type": "text",
"fielddata": true,
"fielddata_frequency_filter": {
"min": 0.001,
"max": 0.1,
"min_segment_size": 500
}
}
}
}
}
--------------------------------------------------
6 changes: 3 additions & 3 deletions docs/reference/modules/indices/circuit_breaker.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,9 @@ The parent-level breaker can be configured with the following settings:
[discrete]
==== Field data circuit breaker
The field data circuit breaker allows Elasticsearch to estimate the amount of
memory a field will require to be loaded into memory. It can then prevent the
field data loading by raising an exception. By default the limit is configured
to 40% of the maximum JVM heap. It can be configured with the following
memory a field will require to be loaded into the <<modules-fielddata, field data cache>>.
It can then prevent the field data loading by raising an exception. By default the
limit is configured to 40% of the maximum JVM heap. It can be configured with the following
parameters:
jrodewig marked this conversation as resolved.
Show resolved Hide resolved

[[fielddata-circuit-breaker-limit]]
Expand Down
Loading