Skip to content

Commit

Permalink
Documentation notes for Range field histograms (#46890) (#47366)
Browse files Browse the repository at this point in the history
  • Loading branch information
not-napoleon authored Oct 1, 2019
1 parent 5ba543f commit 5bdf253
Show file tree
Hide file tree
Showing 4 changed files with 202 additions and 7 deletions.
1 change: 1 addition & 0 deletions docs/reference/aggregations/bucket.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -67,3 +67,4 @@ include::bucket/significanttext-aggregation.asciidoc[]

include::bucket/terms-aggregation.asciidoc[]

include::bucket/range-field-note.asciidoc[]
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

This multi-bucket aggregation is similar to the normal
<<search-aggregations-bucket-histogram-aggregation,histogram>>, but it can
only be used with date values. Because dates are represented internally in
only be used with date or date range values. Because dates are represented internally in
Elasticsearch as long values, it is possible, but not as accurate, to use the
normal `histogram` on dates as well. The main difference in the two APIs is
that here the interval can be specified using date/time expressions. Time-based
Expand Down
25 changes: 19 additions & 6 deletions docs/reference/aggregations/bucket/histogram-aggregation.asciidoc
Original file line number Diff line number Diff line change
@@ -1,19 +1,24 @@
[[search-aggregations-bucket-histogram-aggregation]]
=== Histogram Aggregation

A multi-bucket values source based aggregation that can be applied on numeric values extracted from the documents.
It dynamically builds fixed size (a.k.a. interval) buckets over the values. For example, if the documents have a field
that holds a price (numeric), we can configure this aggregation to dynamically build buckets with interval `5`
(in case of price it may represent $5). When the aggregation executes, the price field of every document will be
evaluated and will be rounded down to its closest bucket - for example, if the price is `32` and the bucket size is `5`
then the rounding will yield `30` and thus the document will "fall" into the bucket that is associated with the key `30`.
A multi-bucket values source based aggregation that can be applied on numeric values or numeric range values extracted
from the documents. It dynamically builds fixed size (a.k.a. interval) buckets over the values. For example, if the
documents have a field that holds a price (numeric), we can configure this aggregation to dynamically build buckets with
interval `5` (in case of price it may represent $5). When the aggregation executes, the price field of every document
will be evaluated and will be rounded down to its closest bucket - for example, if the price is `32` and the bucket size
is `5` then the rounding will yield `30` and thus the document will "fall" into the bucket that is associated with the
key `30`.
To make this more formal, here is the rounding function that is used:

[source,java]
--------------------------------------------------
bucket_key = Math.floor((value - offset) / interval) * interval + offset
--------------------------------------------------

For range values, a document can fall into multiple buckets. The first bucket is computed from the lower
bound of the range in the same way as a bucket for a single value is computed. The final bucket is computed in the same
way from the upper bound of the range, and the range is counted in all buckets in between and including those two.

The `interval` must be a positive decimal, while the `offset` must be a decimal in `[0, interval)`
(a decimal greater than or equal to `0` and less than `interval`)

Expand Down Expand Up @@ -175,6 +180,14 @@ POST /sales/_search?size=0
--------------------------------------------------
// TEST[setup:sales]

When aggregating ranges, buckets are based on the values of the returned documents. This means the response may include
buckets outside of a query's range. For example, if your query looks for values greater than 100, and you have a range
covering 50 to 150, and an interval of 50, that document will land in 3 buckets - 50, 100, and 150. In general, it's
best to think of the query and aggregation steps as independent - the query selects a set of documents, and then the
aggregation buckets those documents without regard to how they were selected.
See <<search-aggregations-bucket-range-field-note,note on bucketing range
fields>> for more information and an example.

==== Order

By default the returned buckets are sorted by their `key` ascending, though the order behaviour can be controlled using
Expand Down
181 changes: 181 additions & 0 deletions docs/reference/aggregations/bucket/range-field-note.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
[[search-aggregations-bucket-range-field-note]]
=== Subtleties of bucketing range fields

==== Documents are counted for each bucket they land in

Since a range represents multiple values, running a bucket aggregation over a
range field can result in the same document landing in multiple buckets. This
can lead to surprising behavior, such as the sum of bucket counts being higher
than the number of matched documents. For example, consider the following
index:
[source, console]
--------------------------------------------------
PUT range_index
{
"settings": {
"number_of_shards": 2
},
"mappings": {
"properties": {
"expected_attendees": {
"type": "integer_range"
},
"time_frame": {
"type": "date_range",
"format": "yyyy-MM-dd||epoch_millis"
}
}
}
}
PUT range_index/_doc/1?refresh
{
"expected_attendees" : {
"gte" : 10,
"lte" : 20
},
"time_frame" : {
"gte" : "2019-10-28",
"lte" : "2019-11-04"
}
}
--------------------------------------------------
// TESTSETUP

The range is wider than the interval in the following aggregation, and thus the
document will land in multiple buckets.

[source, console]
--------------------------------------------------
POST /range_index/_search?size=0
{
"aggs" : {
"range_histo" : {
"histogram" : {
"field" : "expected_attendees",
"interval" : 5
}
}
}
}
--------------------------------------------------

Since the interval is `5` (and the offset is `0` by default), we expect buckets `10`,
`15`, and `20`. Our range document will fall in all three of these buckets.

[source, console-result]
--------------------------------------------------
{
...
"aggregations" : {
"range_histo" : {
"buckets" : [
{
"key" : 10.0,
"doc_count" : 1
},
{
"key" : 15.0,
"doc_count" : 1
},
{
"key" : 20.0,
"doc_count" : 1
}
]
}
}
}
--------------------------------------------------
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]

A document cannot exist partially in a bucket; For example, the above document
cannot count as one-third in each of the above three buckets. In this example,
since the document's range landed in multiple buckets, the full value of that
document would also be counted in any sub-aggregations for each bucket as well.

==== Query bounds are not aggregation filters

Another unexpected behavior can arise when a query is used to filter on the
field being aggregated. In this case, a document could match the query but
still have one or both of the endpoints of the range outside the query.
Consider the following aggregation on the above document:

[source, console]
--------------------------------------------------
POST /range_index/_search?size=0
{
"query": {
"range": {
"time_frame": {
"gte": "2019-11-01",
"format": "yyyy-MM-dd"
}
}
},
"aggs" : {
"november_data" : {
"date_histogram" : {
"field" : "time_frame",
"calendar_interval" : "day"
}
}
}
}
--------------------------------------------------

Even though the query only considers days in November, the aggregation
generates 8 buckets (4 in October, 4 in November) because the aggregation is
calculated over the ranges of all matching documents.

[source, console-result]
--------------------------------------------------
{
...
"aggregations" : {
"november_data" : {
"buckets" : [
{
"key" : 1572220800000,
"doc_count" : 1
},
{
"key" : 1572307200000,
"doc_count" : 1
},
{
"key" : 1572393600000,
"doc_count" : 1
},
{
"key" : 1572480000000,
"doc_count" : 1
},
{
"key" : 1572566400000,
"doc_count" : 1
},
{
"key" : 1572652800000,
"doc_count" : 1
},
{
"key" : 1572739200000,
"doc_count" : 1
},
{
"key" : 1572825600000,
"doc_count" : 1
}
]
}
}
}
--------------------------------------------------
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]

Depending on the use case, a `CONTAINS` query could limit the documents to only
those that fall entirely in the queried range. In this example, the one
document would not be included and the aggregation would be empty. Filtering
the buckets after the aggregation is also an option, for use cases where the
document should be counted but the out of bounds data can be safely ignored.

0 comments on commit 5bdf253

Please sign in to comment.