Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework docs for the size of terms agg (backport of #79205) (#80223) #80226

Merged
merged 1 commit into from
Nov 2, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 62 additions & 56 deletions docs/reference/aggregations/bucket/terms-aggregation.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -101,9 +101,6 @@ Response:
<2> when there are lots of unique terms, Elasticsearch only returns the top terms; this number is the sum of the document counts for all buckets that are not part of the response
<3> the list of the top buckets, the meaning of `top` being defined by the <<search-aggregations-bucket-terms-aggregation-order,order>>

By default, the `terms` aggregation will return the buckets for the top ten terms ordered by the `doc_count`. One can
change this default behaviour by setting the `size` parameter.

[[search-aggregations-bucket-terms-aggregation-types]]
The `field` can be <<keyword>>, <<number>>, <<ip, `ip`>>, <<boolean, `boolean`>>,
or <<binary, `binary`>>.
Expand All @@ -117,59 +114,68 @@ memory usage.
[[search-aggregations-bucket-terms-aggregation-size]]
==== Size

The `size` parameter can be set to define how many term buckets should be returned out of the overall terms list. By
default, the node coordinating the search process will request each shard to provide its own top `size` term buckets
and once all shards respond, it will reduce the results to the final list that will then be returned to the client.
This means that if the number of unique terms is greater than `size`, the returned list is slightly off and not accurate
(it could be that the term counts are slightly off and it could even be that a term that should have been in the top
size buckets was not returned).
By default, the `terms` aggregation returns the top ten terms with the most
documents. Use the `size` parameter to return more terms, up to the
<<search-settings-max-buckets,search.max_buckets>> limit.

If your data contains 100 or 1000 unique terms, you can increase the `size` of
the `terms` aggregation to return them all. If you have more unique terms and
you need them all, use the
<<search-aggregations-bucket-composite-aggregation,composite aggregation>>
instead.

NOTE: If you want to retrieve **all** terms or all combinations of terms in a nested `terms` aggregation
you should use the <<search-aggregations-bucket-composite-aggregation,Composite>> aggregation which
allows to paginate over all possible terms rather than setting a size greater than the cardinality of the field in the
`terms` aggregation. The `terms` aggregation is meant to return the `top` terms and does not allow pagination.
Larger values of `size` use more memory to compute and, push the whole
aggregation close to the `max_buckets` limit. You'll know you've gone too large
if the request fails with a message about `max_buckets`.

[[search-aggregations-bucket-terms-aggregation-shard-size]]
==== Shard Size

The higher the requested `size` is, the more accurate the results will be, but also, the more expensive it will be to
compute the final results (both due to bigger priority queues that are managed on a shard level and due to bigger data
transfers between the nodes and the client).
To get more accurate results, the `terms` agg fetches more than
the top `size` terms from each shard. It fetches the top `shard_size` terms,
which defaults to `size * 1.5 + 10`.

This is to handle the case when one term has many documents on one shard but is
just below the `size` threshold on all other shards. If each shard only
returned `size` terms, the aggregation would return an partial doc count for
the term. So `terms` returns more terms in an attempt to catch the missing
terms. This helps, but it's still quite possible to return a partial doc
count for a term. It just takes a term with more disparate per-shard doc counts.

The `shard_size` parameter can be used to minimize the extra work that comes with bigger requested `size`. When defined,
it will determine how many terms the coordinating node will request from each shard. Once all the shards responded, the
coordinating node will then reduce them to a final result which will be based on the `size` parameter - this way,
one can increase the accuracy of the returned terms and avoid the overhead of streaming a big list of buckets back to
the client.
You can increase `shard_size` to better account for these disparate doc counts
and improve the accuracy of the selection of top terms. It is much cheaper to increase
the `shard_size` than to increase the `size`. However, it still takes more
bytes over the wire and waiting in memory on the coordinating node.

IMPORTANT: This guidance only applies if you're using the `terms` aggregation's
default sort `order`. If you're sorting by anything other than document count in
descending order, see <<search-aggregations-bucket-terms-aggregation-order>>.

NOTE: `shard_size` cannot be smaller than `size` (as it doesn't make much sense). When it is, Elasticsearch will
override it and reset it to be equal to `size`.


The default `shard_size` is `(size * 1.5 + 10)`.

[[terms-agg-doc-count-error]]
==== Document count error

`doc_count` values for a `terms` aggregation may be approximate. As a result,
any sub-aggregations on the `terms` aggregation may also be approximate.

To calculate `doc_count` values, each shard provides its own top terms and
document counts. The aggregation combines these shard-level results to calculate
its final `doc_count` values. To measure the accuracy of `doc_count` values, the
aggregation results include the following properties:

`sum_other_doc_count`::
(integer) The total document count for any terms not included in the results.

`doc_count_error_upper_bound`::
(integer) The highest possible document count for any single term not included
in the results. If `0`, `doc_count` values are accurate.
`sum_other_doc_count` is the number of documents that didn't make it into the
the top `size` terms. If this is greater than `0`, you can be sure that the
`terms` agg had to throw away some buckets, either because they didn't fit into
`size` on the coordinating node or they didn't fit into `shard_size` on the
data node.

==== Per bucket document count error

To get the `doc_count_error_upper_bound` for each term, set
`show_term_doc_count_error` to `true`:
If you set the `show_term_doc_count_error` parameter to `true`, the `terms`
aggregation will include `doc_count_error_upper_bound`, which is an upper bound
to the error on the `doc_count` returned by each shard. It's the
sum of the size of the largest bucket on each shard that didn't fit into
`shard_size`.

In more concrete terms, imagine there is one bucket that is very large on one
shard and just outside the `shard_size` on all the other shards. In that case,
the `terms` agg will return the bucket because it is large, but it'll be missing
data from many documents on the shards where the term fell below the `shard_size` threshold.
`doc_count_error_upper_bound` is the maximum number of those missing documents.

[source,console]
--------------------------------------------------
Expand All @@ -189,10 +195,6 @@ GET /_search
// TEST[s/_search/_search\?filter_path=aggregations/]


This shows an error value for each term returned by the aggregation which represents the 'worst case' error in the document count
and can be useful when deciding on a value for the `shard_size` parameter. This is calculated by summing the document counts for
the last term returned by all shards which did not return the term.

These errors can only be calculated in this way when the terms are ordered by descending document count. When the aggregation is
ordered by the terms values themselves (either ascending or descending) there is no error in the document count since if a shard
does not return a particular term which appears in the results from another shard, it must not have that term in its index. When the
Expand All @@ -202,19 +204,17 @@ determined and is given a value of -1 to indicate this.
[[search-aggregations-bucket-terms-aggregation-order]]
==== Order

The order of the buckets can be customized by setting the `order` parameter. By default, the buckets are ordered by
their `doc_count` descending. It is possible to change this behaviour as documented below:
By default, the `terms` aggregation orders terms by descending document `_count`.
Use the `order` parameter to specify a different sort order.

WARNING: Sorting by ascending `_count` or by sub aggregation is discouraged as it increases the
<<terms-agg-doc-count-error,error>> on document counts.
It is fine when a single shard is queried, or when the field that is being aggregated was used
as a routing key at index time: in these cases results will be accurate since shards have disjoint
values. However otherwise, errors are unbounded. One particular case that could still be useful
is sorting by <<search-aggregations-metrics-min-aggregation,`min`>> or
<<search-aggregations-metrics-max-aggregation,`max`>> aggregation: counts will not be accurate
but at least the top buckets will be correctly picked.
WARNING: Avoid using `"order": { "_count": "asc" }`. If you need to find rare
terms, use the
<<search-aggregations-bucket-rare-terms-aggregation,`rare_terms`>> aggregation
instead. Due to the way the `terms` aggregation
<<search-aggregations-bucket-terms-aggregation-shard-size,gets terms from
shards>>, sorting by ascending doc count often produces inaccurate results.

Ordering the buckets by their doc `_count` in an ascending manner:
If you really must, this is how you sort on doc count ascending:

[source,console]
--------------------------------------------------
Expand Down Expand Up @@ -248,7 +248,13 @@ GET /_search
}
--------------------------------------------------

deprecated[6.0.0, Use `_key` instead of `_term` to order buckets by their term]
WARNING: Test any sorts on sub-aggregations before using them in production.
Sorting on a sub-aggregation may return errors or inaccurate results. For
example, due to the way the `terms` aggregation
<<search-aggregations-bucket-terms-aggregation-shard-size,gets results from
shards>>, sorting on a `max` sub-aggregation in _ascending_ order often produces
inaccurate results. However, sorting on a `max` sub-aggregation in _descending_
order is typically safe.

Ordering the buckets by single value metrics sub-aggregation (identified by the aggregation name):

Expand Down