From f1cbaa5f6aa0453f017b450905e81880e8973f16 Mon Sep 17 00:00:00 2001 From: Julie Tibshirani Date: Tue, 10 Sep 2019 11:01:45 -0700 Subject: [PATCH] Expand documentation around global ordinals. (#46517) This commit updates the eager_global_ordinals documentation to give more background on what global ordinals are and when they are used. The docs also now mention that global ordinal loading may be expensive, and describes the cases where in which loading them can be avoided. --- .../params/eager-global-ordinals.asciidoc | 137 +++++++++++------- 1 file changed, 83 insertions(+), 54 deletions(-) diff --git a/docs/reference/mapping/params/eager-global-ordinals.asciidoc b/docs/reference/mapping/params/eager-global-ordinals.asciidoc index 162049ec1323f..0224c9afbe599 100644 --- a/docs/reference/mapping/params/eager-global-ordinals.asciidoc +++ b/docs/reference/mapping/params/eager-global-ordinals.asciidoc @@ -1,37 +1,52 @@ [[eager-global-ordinals]] === `eager_global_ordinals` -Global ordinals is a data-structure on top of doc values, that maintains an -incremental numbering for each unique term in a lexicographic order. Each -term has a unique number and the number of term 'A' is lower than the -number of term 'B'. Global ordinals are only supported with -<> and <> fields. In `keyword` fields, they -are available by default but `text` fields can only use them when `fielddata`, -with all of its associated baggage, is enabled. - -Doc values (and fielddata) also have ordinals, which is a unique numbering for -all terms in a particular segment and field. Global ordinals just build on top -of this, by providing a mapping between the segment ordinals and the global -ordinals, the latter being unique across the entire shard. Given that global -ordinals for a specific field are tied to _all the segments of a shard_, they -need to be entirely rebuilt whenever a once new segment becomes visible. - -Global ordinals are used for features that use segment ordinals, such as -the <>, -to improve the execution time. A terms aggregation relies purely on global -ordinals to perform the aggregation at the shard level, then converts global -ordinals to the real term only for the final reduce phase, which combines -results from different shards. - -The loading time of global ordinals depends on the number of terms in a field, -but in general it is low, since it source field data has already been loaded. -The memory overhead of global ordinals is a small because it is very -efficiently compressed. - -By default, global ordinals are loaded at search-time, which is the right -trade-off if you are optimizing for indexing speed. However, if you are more -interested in search speed, it could be beneficial to set -`eager_global_ordinals: true` on fields that you plan to use in terms +==== What are global ordinals? + +To support aggregations and other operations that require looking up field +values on a per-document basis, Elasticsearch uses a data structure called +<>. Term-based field types such as `keyword` store +their doc values using an ordinal mapping for a more compact representation. +This mapping works by assigning each term an incremental integer or 'ordinal' +based on its lexicographic order. The field's doc values store only the +ordinals for each document instead of the original terms, with a separate +lookup structure to convert between ordinals and terms. + +When used during aggregations, ordinals can greatly improve performance. As an +example, the `terms` aggregation relies only on ordinals to collect documents +into buckets at the shard-level, then converts the ordinals back to their +original term values when combining results across shards. + +Each index segment defines its own ordinal mapping, but aggregations collect +data across an entire shard. So to be able to use ordinals for shard-level +operations like aggregations, Elasticsearch creates a unified mapping called +'global ordinals'. The global ordinal mapping is built on top of segment +ordinals, and works by maintaining a map from global ordinal to the local +ordinal for each segment. + +Global ordinals are used if a search contains any of the following components: + +* Bucket aggregations on `keyword` and `flattened` fields. This includes +`terms` aggregations as mentioned above, as well as `composite`, `sampler`, +and `significant_terms`. +* Bucket aggregations on `text` fields that require <> +to be enabled. +* Operations on parent and child documents from a `join` field, including +`has_child` queries and `parent` aggregations. + +NOTE: The global ordinal mapping is an on-heap data structure. When measuring +memory usage, Elasticsearch counts the memory from global ordinals as +'fielddata'. Global ordinals memory is included in the +<>, and is returned +under `fielddata` in the <> response. + +==== Loading global ordinals + +The global ordinal mapping must be built before ordinals can be used during a +search. By default, the mapping is loaded during search on the first time that +global ordinals are needed. This is is the right approach if you are optimizing +for indexing speed, but if search performance is a priority, it's recommended +to eagerly load global ordinals eagerly on fields that will be used in aggregations: [source,js] @@ -49,29 +64,14 @@ PUT my_index/_mapping // CONSOLE // TEST[s/^/PUT my_index\n/] -This will shift the cost of building the global ordinals from search-time to -refresh-time. Elasticsearch will make sure that global ordinals are built -before exposing to searches any changes to the content of the index. -Elasticsearch will also eagerly build global ordinals when starting a new copy -of a shard, such as when increasing the number of replicas or when relocating a -shard onto a new node. - -If a shard has been <> down to a single -segment then its global ordinals are identical to the ordinals for its unique -segment, which means there is no extra cost for using global ordinals on such a -shard. Note that for performance reasons you should only force-merge an index -to which you will never write again. - -On a <>, global ordinals are discarded after each -search and rebuilt again on the next search if needed or if -`eager_global_ordinals` is set. This means `eager_global_ordinals` should not -be used on frozen indices. Instead, force-merge an index to a single segment -before freezing it so that global ordinals need not be built separately on each -search. - -If you ever decide that you do not need to run `terms` aggregations on this -field anymore, then you can disable eager loading of global ordinals at any -time: +When `eager_global_ordinals` is enabled, global ordinals are built when a shard +is <> -- Elasticsearch always loads them before +exposing changes to the content of the index. This shifts the cost of building +global ordinals from search to index-time. Elasticsearch will also eagerly +build global ordinals when creating a new copy of a shard, as can occur when +increasing the number of replicas or relocating a shard onto a new node. + +Eager loading can be disabled at any time by updating the `eager_global_ordinals` setting: [source,js] ------------ @@ -88,3 +88,32 @@ PUT my_index/_mapping // CONSOLE // TEST[continued] +IMPORTANT: On a <>, global ordinals are discarded +after each search and rebuilt again when they're requested. This means that +`eager_global_ordinals` should not be used on frozen indices: it would +cause global ordinals to be reloaded on every search. Instead, the index should +be force-merged to a single segment before being frozen. This avoids building +global ordinals altogether (more details can be found in the next section). + +==== Avoiding global ordinal loading + +Usually, global ordinals do not present a large overhead in terms of their +loading time and memory usage. However, loading global ordinals can be +expensive on indices with large shards, or if the fields contain a large +number of unique term values. Because global ordinals provide a unified mapping +for all segments on the shard, they also need to be rebuilt entirely when a new +segment becomes visible. + +In some cases it is possible to avoid global ordinal loading altogether: + +* The `terms`, `sampler`, and `significant_terms` aggregations support a +parameter +<> +that helps control how buckets are collected. It defaults to `global_ordinals`, +but can be set to `map` to instead use the term values directly. +* If a shard has been <> down to a single +segment, then its segment ordinals are already 'global' to the shard. In this +case, Elasticsearch does not need to build a global ordinal mapping and there +is no additional overhead from using global ordinals. Note that for performance +reasons you should only force-merge an index to which you will never write to +again.