Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Clarify field data cache behavior #64375

Merged
merged 7 commits into from
Nov 20, 2020

Conversation

wylieconlon
Copy link

After talking with @jtibshirani about how the field data cache is used in practice, I have created this PR to clarify all of the small issues I saw in the docs about this:

  1. The field data cache docs did not say that global ordinals were included in the cache size
  2. The cache docs did not describe the need to clear the cache if the limit is reached
  3. The cache docs did not recommend setting an explicit limit Enhancement on indices.fielddata.cache.size #59829
  4. The LRU behavior of the cache under certain conditions was not documented
  5. The _id field docs previously said they supported aggregation, which was changed in 7.0 Add a cluster setting to disallow loading fielddata on _id field #49166
  6. There was a separate page for the fielddata mapping parameter, but it was easier to understand by combining with the docs on the text field.
  7. Lack of cross-linking in general

@wylieconlon wylieconlon added >docs General docs changes :Docs Team:Docs Meta label for docs team labels Oct 29, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-docs (>docs)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-docs (:Docs)

@wylieconlon wylieconlon requested a review from jrodewig October 30, 2020 21:06
@jrodewig jrodewig changed the title Clarify field data cache behavior in docs [DOCS] Clarify field data cache behavior Nov 2, 2020
@wylieconlon
Copy link
Author

A significant portion of this doc is based on the limiting memory usage section of the definitive guide. I think I ended up writing slightly different advice than what @polyfractal wrote there, comparing this text from the guide:

By default, this setting is unbounded—Elasticsearch will never evict data from fielddata. This default was chosen deliberately... A bounded size forces the data structure to evict data. ... However, with the default settings, the fielddata from the old indices is never evicted! fielddata will just keep on growing until you trip the fielddata circuit breaker (see Circuit Breaker), which will prevent you from loading any more fielddata.

to what I wrote:

The default cache size is unlimited, causing the cache to grow until it
reaches the limit set by the field data circuit breaker.
It is recommended to set a cache size limit that is smaller than the circuit breaker
value. Setting the limit will cause the cache to behave as a least-recently-updated
cache, only keeping the most recently requested field data.

It seems like the definitive guide recommends setting a limit on time-based indices, while I was thinking we wanted to recommend it on all indices?

Comment on lines -1 to -2
[[fielddata]]
=== `fielddata`
Copy link
Contributor

@jrodewig jrodewig Nov 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's best practice to add redirects for any removed pages. I added a redirect for this page with 391ab35 and c121d75

@polyfractal
Copy link
Contributor

Haven't looked closely at the PR yet, will be free in a few minutes to take a closer look.

But a note/context about the Definitive Guide advice: it was written at a time when most aggregations worked via field data. Doc values were very new at the time and not widely used yet, so most aggregations were purely on-heap in field data and OOMs were common due to people setting sub-optimal field data settings (e.g. a lot of people used the time-based eviction, or tried to set limits which could be ignored in a variety of circumstances due to how things worked, etc).

So just a big caveat to the "old" advice, it's pretty dated at this point and specific to the environment at the time. So no worries if the advice we write here conflicts with what the DefGuide says. :)

Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wylieconlon for this contribution, it looks great overall. I mostly left small comments related to wording.

It seems like the definitive guide recommends setting a limit on time-based indices, while I was thinking we wanted to recommend it on all indices?

We're actually still finalizing the recommendation/ plans for field data cache eviction, so I think it'd be best to avoid giving a strong recommendation in this PR.

the cache before requests are received. This should be done carefully because it
will increase heap usage and delay indexing until the cache is created. This can
be set dynamically on an existing mapping by setting the
<<eager-global-ordinals, eager global ordinals>> mappping parameter:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small comments to make the language more precise:

  • Saying "in order to increase aggregation speed" isn't quite accurate because global ordinals help a lot with memory usage. Maybe we could say "Global ordinals are a data structure that are used to optimize the performance of certain aggregations?"
  • "you can tell {es} to add to the cache" -> "you can tell {es} to construct and cache the global ordinals..."
  • "until the cache is created" -> "until the global ordinals are constructed"
  • mappping -> mapping


The `_id` field is by default not available by default for use with aggregations or sorting.
To aggregate or sort by the `_id` field, it is recommended to
duplicate the `_id` field onto a `keyword` field using the <<copy-to, `copy_to` mapping parameter>>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really small comment, the link text is usually just the parameter name: <<copy-to, `copy_to`>>

Copy link
Contributor

@jtibshirani jtibshirani Nov 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that it's not possible to use copy_to with the _id mapper, so this only works for custom IDs where the user can manually duplicate the ID into another field.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I'll clarify that, that wasn't clear in the original text. Going to move this entire section to the top.

@@ -141,3 +141,112 @@ The following parameters are accepted by `text` fields:
<<mapping-field-meta,`meta`>>::

Metadata about the field.

[[fielddata]]
==== `fielddata`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this consolidation, it makes it clear this description only applies to text.

To aggregate or sort by the `_id` field, it is recommended to
duplicate the `_id` field onto a `keyword` field using the <<copy-to, `copy_to` mapping parameter>>.

It is not recommended to enable `_id` fields to be aggregated using the <<modules-fielddata, in-memory field data cache>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we soon plan to entirely remove the ability to sort/ aggregate on _id, I think it'd be best not to mention the cluster setting. It mostly just helps with the 7.x -> 8.x upgrade.

It looks like we forgot to mention _id field data in the 8.0 breaking changes docs though. I can fix that in a follow-up.

cardinality values can use a significant amount of heap memory, and
could exceed the threshold of the
<<fielddata-circuit-breaker, field data circuit breaker>>.
It is recommended to set a specific limit for the field data cache size.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are actually still discussing this recommendation in #59829, perhaps we could hold off on adding this sentence until we have a conclusion.

Also maybe "Aggregations that include high cardinality values" -> "Aggregations on high cardinality fields" ?


The join field shouldn't be used like joins in a relation database. In Elasticsearch the key to good performance
is to de-normalize your data into documents. Each join field, `has_child` or `has_parent` query adds a
significant tax to your query performance.
significant tax to your query performance. It also increases the usage of the JVM heap on the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not actually sure it's a significant contributor to heap usage, since only one join mapping is allowed per index? But it's helpful to know that it produces field data, maybe we could just say "It can also trigger <<eager-global-ordinals, global ordinals>> to be built."

The field data cache is an in-memory data structure, built on demand
based on the type of query that is being run. It contains both
<<fielddata, `fielddata`>> and <<eager-global-ordinals, global ordinals>>,
which serve similar functions for different types of queries.
Copy link
Contributor

@jtibshirani jtibshirani Nov 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the new wording a little confusing: we say it's built on-demand but then below say "other than fields where the cache is built ahead of time".

Maybe we could open with "The field data cache contains field data and global ordinals, which are both used to support aggregations on certain field types. Since these are on-heap data structures, it is important to monitor the cache's use ..." This avoids a discussion of when the field data is built, for details users can refer to eager-global-ordinals.

data cache can be controlled using `indices.fielddata.cache.size`. Note:
reloading the field data which does not fit into your cache will be expensive
and perform poorly.
Other than fields where the cache is built ahead of time, it is populated as needed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section duplicates the one from eager-global-ordinals -- I think it'd be good to avoid this duplication, otherwise the sections can easily become inconsistent. Maybe the newly-added link to eager-global-ordinals above is enough?


The default cache size is unlimited, causing the cache to grow until it
reaches the limit set by the <<fielddata-circuit-breaker, field data circuit breaker>>.
It is recommended to set a cache size limit that is smaller than the circuit breaker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here about the recommendation, we are still discussing it so maybe we could hold off for now. The other information you added here is great though!

Copy link
Contributor

@jrodewig jrodewig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto on @jtibshirani's comments. I left some other minor suggestions but nothing major.

Thanks for raising this @wylieconlon. I think moving the field data params under text makes a lot of sense and will help users.

docs/reference/modules/indices/fielddata.asciidoc Outdated Show resolved Hide resolved
docs/reference/modules/indices/circuit_breaker.asciidoc Outdated Show resolved Hide resolved
Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, just some last minor comments.

docs/reference/mapping/fields/id-field.asciidoc Outdated Show resolved Hide resolved
@@ -16,30 +20,23 @@ PUT my-index-000001/_doc/1
"text": "Document with ID 1"
}

PUT my-index-000001/_doc/2?refresh=true
POST my-index-000001/_doc/?refresh=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should undo this change, it feels weird to be mixing auto-generated and custom IDs ?

docs/reference/modules/indices/fielddata.asciidoc Outdated Show resolved Hide resolved
@polyfractal
Copy link
Contributor

Whoops, I spaced out and forgot to come back to this PR. ++ to @jtibshirani's comments, and I think it looks good overall.

I wonder if we should make a more explicit note about how to think about global ordinals? E.g. if query latency is your top priority, it makes sense to pay the global ord cost up front at indexing time so that users don't hit a query pause while global ords build. Usually folks who require absolute best query latency also have relatively static indices so this works well.

But if your index is changing rapidly, you end up redoing the same global ordinal work over and over as segments are created/merged. So this can make it potentially very expensive for indices that are changing a lot (most notably indices with a lot of updates/deletes, and to a lesser-extent append-only).

I'm not sure a good/easy way to phrase that in the docs. The current text alludes to all the above by describing how it works, but it might be useful to write it out a bit more long form? Not sure. I've seen eager-global-ords used as sort of "tweak this, expect miracles" setting when it's really not... just shifting around when/where work is done :)

@jtibshirani
Copy link
Contributor

jtibshirani commented Nov 3, 2020

I wonder if we should make a more explicit note about how to think about global ordinals?

+1 to this suggestion, but maybe we should save it for a follow-up to keep this PR focused on the fielddata cache? The PR already has a large scope.

Copy link
Contributor

@jrodewig jrodewig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I left some minor editorial suggestions, but nothing major. Thanks again!

docs/reference/modules/indices/fielddata.asciidoc Outdated Show resolved Hide resolved
docs/reference/mapping/fields/id-field.asciidoc Outdated Show resolved Hide resolved
@polyfractal
Copy link
Contributor

+1 to this suggestion, but maybe we should save it for a follow-up to keep this PR focused on the fielddata cache? The PR scope already has a large scope.

Makes sense to me!

@jtibshirani jtibshirani removed the :Docs label Nov 3, 2020
@jtibshirani jtibshirani added :Analytics/Aggregations Aggregations :Search/Search Search-related issues that do not fall into other categories labels Nov 3, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@elasticmachine elasticmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Search Meta label for search team labels Nov 3, 2020
docs/reference/how-to/search-speed.asciidoc Outdated Show resolved Hide resolved
docs/reference/mapping/fields/id-field.asciidoc Outdated Show resolved Hide resolved
docs/reference/modules/indices/fielddata.asciidoc Outdated Show resolved Hide resolved
@jtibshirani
Copy link
Contributor

I checked with @wylieconlon, and he was fine with me pushing a last round of fixes then merging.

@jtibshirani jtibshirani merged commit 10ee0f2 into elastic:master Nov 20, 2020
jtibshirani pushed a commit that referenced this pull request Nov 20, 2020
* Clarify that field data cache includes global ordinals
* Describe that the cache should be cleared once the limit is reached
* Clarify that the `_id` field does not supported aggregations anymore
* Fold the `fielddata` mapping parameter page into the `text field docs
* Improve cross-linking
jtibshirani pushed a commit that referenced this pull request Nov 20, 2020
* Clarify that field data cache includes global ordinals
* Describe that the cache should be cleared once the limit is reached
* Clarify that the `_id` field does not supported aggregations anymore
* Fold the `fielddata` mapping parameter page into the `text field docs
* Improve cross-linking
jtibshirani pushed a commit that referenced this pull request Nov 20, 2020
* Clarify that field data cache includes global ordinals
* Describe that the cache should be cleared once the limit is reached
* Clarify that the `_id` field does not supported aggregations anymore
* Fold the `fielddata` mapping parameter page into the `text field docs
* Improve cross-linking
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >docs General docs changes :Search/Search Search-related issues that do not fall into other categories Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Docs Meta label for docs team Team:Search Meta label for search team v7.9.4 v7.10.1 v7.11.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants