[DOCS] Clarify field data cache behavior #64375

wylieconlon · 2020-10-29T19:01:26Z

After talking with @jtibshirani about how the field data cache is used in practice, I have created this PR to clarify all of the small issues I saw in the docs about this:

The field data cache docs did not say that global ordinals were included in the cache size
The cache docs did not describe the need to clear the cache if the limit is reached
The cache docs did not recommend setting an explicit limit Enhancement on indices.fielddata.cache.size #59829
The LRU behavior of the cache under certain conditions was not documented
The _id field docs previously said they supported aggregation, which was changed in 7.0 Add a cluster setting to disallow loading fielddata on _id field #49166
There was a separate page for the fielddata mapping parameter, but it was easier to understand by combining with the docs on the text field.
Lack of cross-linking in general

elasticmachine · 2020-10-29T19:01:28Z

Pinging @elastic/es-docs (>docs)

elasticmachine · 2020-10-29T19:01:28Z

Pinging @elastic/es-docs (:Docs)

docs/plugins/mapper-annotated-text.asciidoc

wylieconlon · 2020-11-02T16:19:27Z

A significant portion of this doc is based on the limiting memory usage section of the definitive guide. I think I ended up writing slightly different advice than what @polyfractal wrote there, comparing this text from the guide:

By default, this setting is unbounded—Elasticsearch will never evict data from fielddata. This default was chosen deliberately... A bounded size forces the data structure to evict data. ... However, with the default settings, the fielddata from the old indices is never evicted! fielddata will just keep on growing until you trip the fielddata circuit breaker (see Circuit Breaker), which will prevent you from loading any more fielddata.

to what I wrote:

The default cache size is unlimited, causing the cache to grow until it
reaches the limit set by the field data circuit breaker.
It is recommended to set a cache size limit that is smaller than the circuit breaker
value. Setting the limit will cause the cache to behave as a least-recently-updated
cache, only keeping the most recently requested field data.

It seems like the definitive guide recommends setting a limit on time-based indices, while I was thinking we wanted to recommend it on all indices?

jrodewig · 2020-11-02T16:26:53Z

docs/reference/mapping/params/fielddata.asciidoc

-[[fielddata]]
-=== `fielddata`


It's best practice to add redirects for any removed pages. I added a redirect for this page with 391ab35 and c121d75

polyfractal · 2020-11-02T16:40:49Z

Haven't looked closely at the PR yet, will be free in a few minutes to take a closer look.

But a note/context about the Definitive Guide advice: it was written at a time when most aggregations worked via field data. Doc values were very new at the time and not widely used yet, so most aggregations were purely on-heap in field data and OOMs were common due to people setting sub-optimal field data settings (e.g. a lot of people used the time-based eviction, or tried to set limits which could be ignored in a variety of circumstances due to how things worked, etc).

So just a big caveat to the "old" advice, it's pretty dated at this point and specific to the environment at the time. So no worries if the advice we write here conflicts with what the DefGuide says. :)

jtibshirani

Thanks @wylieconlon for this contribution, it looks great overall. I mostly left small comments related to wording.

It seems like the definitive guide recommends setting a limit on time-based indices, while I was thinking we wanted to recommend it on all indices?

We're actually still finalizing the recommendation/ plans for field data cache eviction, so I think it'd be best to avoid giving a strong recommendation in this PR.

jtibshirani · 2020-10-30T18:00:57Z

docs/reference/how-to/search-speed.asciidoc

+the cache before requests are received. This should be done carefully because it
+will increase heap usage and delay indexing until the cache is created. This can
+be set dynamically on an existing mapping by setting the
+<<eager-global-ordinals, eager global ordinals>> mappping parameter:


A few small comments to make the language more precise:

Saying "in order to increase aggregation speed" isn't quite accurate because global ordinals help a lot with memory usage. Maybe we could say "Global ordinals are a data structure that are used to optimize the performance of certain aggregations?"

"you can tell {es} to add to the cache" -> "you can tell {es} to construct and cache the global ordinals..."

"until the cache is created" -> "until the global ordinals are constructed"

mappping -> mapping

jtibshirani · 2020-10-30T18:01:44Z

docs/reference/mapping/fields/id-field.asciidoc

-
+The `_id` field is by default not available by default for use with aggregations or sorting.
+To aggregate or sort by the `_id` field, it is recommended to 
+duplicate the `_id` field onto a `keyword` field using the <<copy-to, `copy_to` mapping parameter>>.


Really small comment, the link text is usually just the parameter name: <<copy-to, `copy_to`>>

I just realized that it's not possible to use copy_to with the _id mapper, so this only works for custom IDs where the user can manually duplicate the ID into another field.

Okay I'll clarify that, that wasn't clear in the original text. Going to move this entire section to the top.

jtibshirani · 2020-10-30T18:05:04Z

docs/reference/mapping/types/text.asciidoc

@@ -141,3 +141,112 @@ The following parameters are accepted by `text` fields:
 <<mapping-field-meta,`meta`>>::

    Metadata about the field.
+
+[[fielddata]]
+==== `fielddata`


I like this consolidation, it makes it clear this description only applies to text.

jtibshirani · 2020-10-30T21:14:48Z

docs/reference/mapping/fields/id-field.asciidoc

+To aggregate or sort by the `_id` field, it is recommended to 
+duplicate the `_id` field onto a `keyword` field using the <<copy-to, `copy_to` mapping parameter>>.
+
+It is not recommended to enable `_id` fields to be aggregated using the <<modules-fielddata, in-memory field data cache>>,


Since we soon plan to entirely remove the ability to sort/ aggregate on _id, I think it'd be best not to mention the cluster setting. It mostly just helps with the 7.x -> 8.x upgrade.

It looks like we forgot to mention _id field data in the 8.0 breaking changes docs though. I can fix that in a follow-up.

jtibshirani · 2020-11-02T17:32:28Z

docs/reference/mapping/params/eager-global-ordinals.asciidoc

+cardinality values can use a significant amount of heap memory, and
+could exceed the threshold of the
+<<fielddata-circuit-breaker, field data circuit breaker>>.
+It is recommended to set a specific limit for the field data cache size.


We are actually still discussing this recommendation in #59829, perhaps we could hold off on adding this sentence until we have a conclusion.

Also maybe "Aggregations that include high cardinality values" -> "Aggregations on high cardinality fields" ?

jtibshirani · 2020-11-02T18:46:51Z

docs/reference/mapping/types/parent-join.asciidoc


 The join field shouldn't be used like joins in a relation database. In Elasticsearch the key to good performance
 is to de-normalize your data into documents. Each join field, `has_child` or `has_parent` query adds a
-significant tax to your query performance.
+significant tax to your query performance. It also increases the usage of the JVM heap on the


I'm not actually sure it's a significant contributor to heap usage, since only one join mapping is allowed per index? But it's helpful to know that it produces field data, maybe we could just say "It can also trigger <<eager-global-ordinals, global ordinals>> to be built."

jtibshirani · 2020-11-02T18:50:58Z

docs/reference/modules/indices/fielddata.asciidoc

+The field data cache is an in-memory data structure, built on demand
+based on the type of query that is being run. It contains both
+<<fielddata, `fielddata`>> and <<eager-global-ordinals, global ordinals>>,
+which serve similar functions for different types of queries.


I find the new wording a little confusing: we say it's built on-demand but then below say "other than fields where the cache is built ahead of time".

Maybe we could open with "The field data cache contains field data and global ordinals, which are both used to support aggregations on certain field types. Since these are on-heap data structures, it is important to monitor the cache's use ..." This avoids a discussion of when the field data is built, for details users can refer to eager-global-ordinals.

jtibshirani · 2020-11-02T18:54:53Z

docs/reference/modules/indices/fielddata.asciidoc

-data cache can be controlled using `indices.fielddata.cache.size`. Note:
-reloading  the field data which does not fit into your cache will be expensive
-and  perform poorly.
+Other than fields where the cache is built ahead of time, it is populated as needed


This section duplicates the one from eager-global-ordinals -- I think it'd be good to avoid this duplication, otherwise the sections can easily become inconsistent. Maybe the newly-added link to eager-global-ordinals above is enough?

jtibshirani · 2020-11-02T18:56:27Z

docs/reference/modules/indices/fielddata.asciidoc

+
+The default cache size is unlimited, causing the cache to grow until it
+reaches the limit set by the <<fielddata-circuit-breaker, field data circuit breaker>>.
+It is recommended to set a cache size limit that is smaller than the circuit breaker


Same comment here about the recommendation, we are still discussing it so maybe we could hold off for now. The other information you added here is great though!

jrodewig

Ditto on @jtibshirani's comments. I left some other minor suggestions but nothing major.

Thanks for raising this @wylieconlon. I think moving the field data params under text makes a lot of sense and will help users.

docs/reference/modules/indices/fielddata.asciidoc

docs/reference/modules/indices/circuit_breaker.asciidoc

jtibshirani

This looks good to me, just some last minor comments.

docs/reference/mapping/fields/id-field.asciidoc

jtibshirani · 2020-11-03T19:44:09Z

docs/reference/mapping/fields/id-field.asciidoc

@@ -16,30 +20,23 @@ PUT my-index-000001/_doc/1
  "text": "Document with ID 1"
 }

-PUT my-index-000001/_doc/2?refresh=true
+POST my-index-000001/_doc/?refresh=true


I think we should undo this change, it feels weird to be mixing auto-generated and custom IDs ?

docs/reference/modules/indices/fielddata.asciidoc

polyfractal · 2020-11-03T20:16:37Z

Whoops, I spaced out and forgot to come back to this PR. ++ to @jtibshirani's comments, and I think it looks good overall.

I wonder if we should make a more explicit note about how to think about global ordinals? E.g. if query latency is your top priority, it makes sense to pay the global ord cost up front at indexing time so that users don't hit a query pause while global ords build. Usually folks who require absolute best query latency also have relatively static indices so this works well.

But if your index is changing rapidly, you end up redoing the same global ordinal work over and over as segments are created/merged. So this can make it potentially very expensive for indices that are changing a lot (most notably indices with a lot of updates/deletes, and to a lesser-extent append-only).

I'm not sure a good/easy way to phrase that in the docs. The current text alludes to all the above by describing how it works, but it might be useful to write it out a bit more long form? Not sure. I've seen eager-global-ords used as sort of "tweak this, expect miracles" setting when it's really not... just shifting around when/where work is done :)

jtibshirani · 2020-11-03T20:25:35Z

I wonder if we should make a more explicit note about how to think about global ordinals?

+1 to this suggestion, but maybe we should save it for a follow-up to keep this PR focused on the fielddata cache? The PR already has a large scope.

jrodewig

LGTM. I left some minor editorial suggestions, but nothing major. Thanks again!

docs/reference/modules/indices/fielddata.asciidoc

docs/reference/mapping/fields/id-field.asciidoc

docs/reference/mapping/params/eager-global-ordinals.asciidoc

polyfractal · 2020-11-03T21:59:26Z

+1 to this suggestion, but maybe we should save it for a follow-up to keep this PR focused on the fielddata cache? The PR scope already has a large scope.

Makes sense to me!

elasticmachine · 2020-11-03T22:50:26Z

Pinging @elastic/es-search (:Search/Search)

elasticmachine · 2020-11-03T22:50:26Z

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

docs/reference/how-to/search-speed.asciidoc

docs/reference/mapping/fields/id-field.asciidoc

docs/reference/modules/indices/fielddata.asciidoc

jtibshirani · 2020-11-20T19:08:16Z

I checked with @wylieconlon, and he was fine with me pushing a last round of fixes then merging.

* Clarify that field data cache includes global ordinals * Describe that the cache should be cleared once the limit is reached * Clarify that the `_id` field does not supported aggregations anymore * Fold the `fielddata` mapping parameter page into the `text field docs * Improve cross-linking

Clarify field data cache behavior

7430993

wylieconlon added >docs General docs changes :Docs Team:Docs Meta label for docs team labels Oct 29, 2020

wylieconlon requested review from jpountz and jtibshirani October 29, 2020 19:01

jrodewig reviewed Oct 29, 2020

View reviewed changes

docs/plugins/mapper-annotated-text.asciidoc Outdated Show resolved Hide resolved

Update docs/plugins/mapper-annotated-text.asciidoc

57411de

wylieconlon requested a review from jrodewig October 30, 2020 21:06

jrodewig changed the title ~~Clarify field data cache behavior in docs~~ [DOCS] Clarify field data cache behavior Nov 2, 2020

wylieconlon requested a review from polyfractal November 2, 2020 16:12

[DOCS] Add redirect for fielddata

391ab35

jrodewig added 2 commits November 2, 2020 11:24

[DOCS] Add anchor for redirect

c121d75

Merge branch 'master' into clarify-fielddata

87e148a

jrodewig reviewed Nov 2, 2020

View reviewed changes

jtibshirani reviewed Nov 2, 2020

View reviewed changes

jrodewig reviewed Nov 2, 2020

View reviewed changes

docs/reference/modules/indices/fielddata.asciidoc Outdated Show resolved Hide resolved

docs/reference/modules/indices/circuit_breaker.asciidoc Outdated Show resolved Hide resolved

Review comments

784de58

wylieconlon requested review from jrodewig and jtibshirani November 3, 2020 19:09

jtibshirani approved these changes Nov 3, 2020

View reviewed changes

jrodewig approved these changes Nov 3, 2020

View reviewed changes

jtibshirani removed the :Docs label Nov 3, 2020

jtibshirani added :Analytics/Aggregations Aggregations :Search/Search Search-related issues that do not fall into other categories labels Nov 3, 2020

elasticmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Search Meta label for search team labels Nov 3, 2020

jrodewig added v7.10.1 v7.11.0 v7.9.4 v8.0.0 labels Nov 3, 2020

jpountz reviewed Nov 4, 2020

View reviewed changes

docs/reference/how-to/search-speed.asciidoc Outdated Show resolved Hide resolved

docs/reference/mapping/fields/id-field.asciidoc Outdated Show resolved Hide resolved

docs/reference/modules/indices/fielddata.asciidoc Outdated Show resolved Hide resolved

jtibshirani mentioned this pull request Nov 4, 2020

Document breaking change around _id field data. #64610

Merged

Address review comments.

b83b080

jtibshirani merged commit 10ee0f2 into elastic:master Nov 20, 2020

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DOCS] Clarify field data cache behavior #64375

[DOCS] Clarify field data cache behavior #64375

wylieconlon commented Oct 29, 2020

elasticmachine commented Oct 29, 2020

elasticmachine commented Oct 29, 2020

wylieconlon commented Nov 2, 2020

jrodewig Nov 2, 2020 •

edited

Loading

polyfractal commented Nov 2, 2020

jtibshirani left a comment

jtibshirani Oct 30, 2020

jtibshirani Oct 30, 2020

jtibshirani Nov 2, 2020 •

edited

Loading

wylieconlon Nov 3, 2020

jtibshirani Oct 30, 2020

jtibshirani Oct 30, 2020

jtibshirani Nov 2, 2020

jtibshirani Nov 2, 2020

jtibshirani Nov 2, 2020 •

edited

Loading

jtibshirani Nov 2, 2020

jtibshirani Nov 2, 2020

jrodewig left a comment •

edited

Loading

jtibshirani left a comment

jtibshirani Nov 3, 2020

polyfractal commented Nov 3, 2020

jtibshirani commented Nov 3, 2020 •

edited

Loading

jrodewig left a comment

polyfractal commented Nov 3, 2020

elasticmachine commented Nov 3, 2020

elasticmachine commented Nov 3, 2020

jtibshirani commented Nov 20, 2020

		[[fielddata]]
		=== `fielddata`

[DOCS] Clarify field data cache behavior #64375

[DOCS] Clarify field data cache behavior #64375

Conversation

wylieconlon commented Oct 29, 2020

elasticmachine commented Oct 29, 2020

elasticmachine commented Oct 29, 2020

wylieconlon commented Nov 2, 2020

jrodewig Nov 2, 2020 • edited Loading

Choose a reason for hiding this comment

polyfractal commented Nov 2, 2020

jtibshirani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani Nov 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani Nov 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrodewig left a comment • edited Loading

Choose a reason for hiding this comment

jtibshirani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

polyfractal commented Nov 3, 2020

jtibshirani commented Nov 3, 2020 • edited Loading

jrodewig left a comment

Choose a reason for hiding this comment

polyfractal commented Nov 3, 2020

elasticmachine commented Nov 3, 2020

elasticmachine commented Nov 3, 2020

jtibshirani commented Nov 20, 2020

jrodewig Nov 2, 2020 •

edited

Loading

jtibshirani Nov 2, 2020 •

edited

Loading

jtibshirani Nov 2, 2020 •

edited

Loading

jrodewig left a comment •

edited

Loading

jtibshirani commented Nov 3, 2020 •

edited

Loading