Skip to content

Commit

Permalink
New Histogram field mapper that supports percentiles aggregations. (#…
Browse files Browse the repository at this point in the history
…48580)

This commit adds  a new histogram field mapper that consists in a pre-aggregated format of numerical data to be used in percentiles aggregations.
  • Loading branch information
iverase authored Nov 28, 2019
1 parent a354c60 commit eade4f0
Show file tree
Hide file tree
Showing 32 changed files with 2,131 additions and 76 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@
=== Percentiles Aggregation

A `multi-value` metrics aggregation that calculates one or more percentiles
over numeric values extracted from the aggregated documents. These values
can be extracted either from specific numeric fields in the documents, or
be generated by a provided script.
over numeric values extracted from the aggregated documents. These values can be
generated by a provided script or extracted from specific numeric or
<<histogram,histogram fields>> in the documents.

Percentiles show the point at which a certain percentage of observed values
occur. For example, the 95th percentile is the value which is greater than 95%
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@
=== Percentile Ranks Aggregation

A `multi-value` metrics aggregation that calculates one or more percentile ranks
over numeric values extracted from the aggregated documents. These values
can be extracted either from specific numeric fields in the documents, or
be generated by a provided script.
over numeric values extracted from the aggregated documents. These values can be
generated by a provided script or extracted from specific numeric or
<<histogram,histogram fields>> in the documents.

[NOTE]
==================================================
Expand Down
5 changes: 5 additions & 0 deletions docs/reference/mapping/types.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ string:: <<text,`text`>> and <<keyword,`keyword`>>
<<ip>>:: `ip` for IPv4 and IPv6 addresses
<<completion-suggester,Completion datatype>>::
`completion` to provide auto-complete suggestions

<<token-count>>:: `token_count` to count the number of tokens in a string
{plugins}/mapper-murmur3.html[`mapper-murmur3`]:: `murmur3` to compute hashes of values at index-time and store them in the index
{plugins}/mapper-annotated-text.html[`mapper-annotated-text`]:: `annotated-text` to index text containing special markup (typically used for identifying named entities)
Expand All @@ -54,6 +55,8 @@ string:: <<text,`text`>> and <<keyword,`keyword`>>

<<shape>>:: `shape` for arbitrary cartesian geometries.

<<histogram>>:: `histogram` for pre-aggregated numerical values for percentiles aggregations.

[float]
[[types-array-handling]]
=== Arrays
Expand Down Expand Up @@ -89,6 +92,8 @@ include::types/date_nanos.asciidoc[]

include::types/dense-vector.asciidoc[]

include::types/histogram.asciidoc[]

include::types/flattened.asciidoc[]

include::types/geo-point.asciidoc[]
Expand Down
119 changes: 119 additions & 0 deletions docs/reference/mapping/types/histogram.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
[role="xpack"]
[testenv="basic"]
[[histogram]]
=== Histogram datatype
++++
<titleabbrev>Histogram</titleabbrev>
++++

A field to store pre-aggregated numerical data representing a histogram.
This data is defined using two paired arrays:

* A `values` array of <<number, `double`>> numbers, representing the buckets for
the histogram. These values must be provided in ascending order.
* A corresponding `counts` array of <<number, `integer`>> numbers, representing how
many values fall into each bucket. These numbers must be positive or zero.

Because the elements in the `values` array correspond to the elements in the
same position of the `count` array, these two arrays must have the same length.

[IMPORTANT]
========
* A `histogram` field can only store a single pair of `values` and `count` arrays
per document. Nested arrays are not supported.
* `histogram` fields do not support sorting.
========

[[histogram-uses]]
==== Uses

`histogram` fields are primarily intended for use with aggregations. To make it
more readily accessible for aggregations, `histogram` field data is stored as a
binary <<doc-values,doc values>> and not indexed. Its size in bytes is at most
`13 * numValues`, where `numValues` is the length of the provided arrays.

Because the data is not indexed, you only can use `histogram` fields for the
following aggregations and queries:

* <<search-aggregations-metrics-percentile-aggregation,percentiles>> aggregation
* <<search-aggregations-metrics-percentile-rank-aggregation,percentile ranks>> aggregation
* <<query-dsl-exists-query,exists>> query

[[mapping-types-histogram-building-histogram]]
==== Building a histogram

When using a histogram as part of an aggregation, the accuracy of the results will depend on how the
histogram was constructed. It is important to consider the percentiles aggregation mode that will be used
to build it. Some possibilities include:

- For the <<search-aggregations-metrics-percentile-aggregation, T-Digest>> mode, the `values` array represents
the mean centroid positions and the `counts` array represents the number of values that are attributed to each
centroid. If the algorithm has already started to approximate the percentiles, this inaccuracy is
carried over in the histogram.

- For the <<_hdr_histogram,High Dynamic Range (HDR)>> histogram mode, the `values` array represents fixed upper
limits of each bucket interval, and the `counts` array represents the number of values that are attributed to each
interval. This implementation maintains a fixed worse-case percentage error (specified as a number of significant digits),
therefore the value used when generating the histogram would be the maximum accuracy you can achieve at aggregation time.

The histogram field is "algorithm agnostic" and does not store data specific to either T-Digest or HDRHistogram. While this
means the field can technically be aggregated with either algorithm, in practice the user should chose one algorithm and
index data in that manner (e.g. centroids for T-Digest or intervals for HDRHistogram) to ensure best accuracy.

[[histogram-ex]]
==== Examples

The following <<indices-create-index, create index>> API request creates a new index with two field mappings:

* `my_histogram`, a `histogram` field used to store percentile data
* `my_text`, a `keyword` field used to store a title for the histogram

[ INSERT CREATE INDEX SNIPPET ]
[source,console]
--------------------------------------------------
PUT my_index
{
"mappings": {
"properties": {
"my_histogram": {
"type" : "histogram"
},
"my_text" : {
"type" : "keyword"
}
}
}
}
--------------------------------------------------

The following <<docs-index_,index>> API requests store pre-aggregated for
two histograms: `histogram_1` and `histogram_2`.

[source,console]
--------------------------------------------------
PUT my_index/_doc/1
{
"my_text" : "histogram_1",
"my_histogram" : {
"values" : [0.1, 0.2, 0.3, 0.4, 0.5], <1>
"counts" : [3, 7, 23, 12, 6] <2>
}
}
PUT my_index/_doc/2
{
"my_text" : "histogram_2",
"my_histogram" : {
"values" : [0.1, 0.25, 0.35, 0.4, 0.45, 0.5], <1>
"counts" : [8, 17, 8, 7, 6, 2] <2>
}
}
--------------------------------------------------
<1> Values for each bucket. Values in the array are treated as doubles and must be given in
increasing order. For <<search-aggregations-metrics-percentile-aggregation-approximation, T-Digest>>
histograms this value represents the mean value. In case of HDR histograms this represents the value iterated to.
<2> Count for each bucket. Values in the arrays are treated as integers and must be positive or zero.
Negative values will be rejected. The relation between a bucket and a count is given by the position in the array.



Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.index.fielddata;


import java.io.IOException;

/**
* {@link AtomicFieldData} specialization for histogram data.
*/
public interface AtomicHistogramFieldData extends AtomicFieldData {

/**
* Return Histogram values.
*/
HistogramValues getHistogramValues() throws IOException;

}
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

package org.elasticsearch.index.fielddata;

import java.io.IOException;

/**
* Per-document histogram value. Every value of the histogram consist on
* a value and a count.
*/
public abstract class HistogramValue {

/**
* Advance this instance to the next value of the histogram
* @return true if there is a next value
*/
public abstract boolean next() throws IOException;

/**
* the current value of the histogram
* @return the current value of the histogram
*/
public abstract double value();

/**
* The current count of the histogram
* @return the current count of the histogram
*/
public abstract int count();

}
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

package org.elasticsearch.index.fielddata;

import java.io.IOException;

/**
* Per-segment histogram values.
*/
public abstract class HistogramValues {

/**
* Advance this instance to the given document id
* @return true if there is a value for this document
*/
public abstract boolean advanceExact(int doc) throws IOException;

/**
* Get the {@link HistogramValue} associated with the current document.
* The returned {@link HistogramValue} might be reused across calls.
*/
public abstract HistogramValue histogram() throws IOException;

}
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

package org.elasticsearch.index.fielddata;


import org.elasticsearch.index.Index;
import org.elasticsearch.index.fielddata.plain.DocValuesIndexFieldData;

/**
* Specialization of {@link IndexFieldData} for histograms.
*/
public abstract class IndexHistogramFieldData extends DocValuesIndexFieldData implements IndexFieldData<AtomicHistogramFieldData> {

public IndexHistogramFieldData(Index index, String fieldName) {
super(index, fieldName);
}
}
Loading

0 comments on commit eade4f0

Please sign in to comment.