New Histogram field mapper that supports percentiles aggregations. (#…

…48580) This commit adds a new histogram field mapper that consists in a pre-aggregated format of numerical data to be used in percentiles aggregations.
elastic · Nov 28, 2019 · eade4f0 · eade4f0
1 parent a354c60
commit eade4f0
Show file tree

Hide file tree

Showing 32 changed files with 2,131 additions and 76 deletions.
diff --git a/docs/reference/aggregations/metrics/percentile-aggregation.asciidoc b/docs/reference/aggregations/metrics/percentile-aggregation.asciidoc
@@ -2,9 +2,9 @@
 === Percentiles Aggregation
 
 A `multi-value` metrics aggregation that calculates one or more percentiles
-over numeric values extracted from the aggregated documents.  These values
-can be extracted either from specific numeric fields in the documents, or
-be generated by a provided script.
+over numeric values extracted from the aggregated documents. These values can be
+generated by a provided script or extracted from specific numeric or
+<<histogram,histogram fields>> in the documents.
 
 Percentiles show the point at which a certain percentage of observed values
 occur.  For example, the 95th percentile is the value which is greater than 95%

diff --git a/docs/reference/aggregations/metrics/percentile-rank-aggregation.asciidoc b/docs/reference/aggregations/metrics/percentile-rank-aggregation.asciidoc
@@ -2,9 +2,9 @@
 === Percentile Ranks Aggregation
 
 A `multi-value` metrics aggregation that calculates one or more percentile ranks
-over numeric values extracted from the aggregated documents.  These values
-can be extracted either from specific numeric fields in the documents, or
-be generated by a provided script.
+over numeric values extracted from the aggregated documents. These values can be
+generated by a provided script or extracted from specific numeric or
+<<histogram,histogram fields>> in the documents.
 
 [NOTE]
 ==================================================

diff --git a/docs/reference/mapping/types.asciidoc b/docs/reference/mapping/types.asciidoc
@@ -32,6 +32,7 @@ string::         <<text,`text`>> and <<keyword,`keyword`>>
 <<ip>>::            `ip` for IPv4 and IPv6 addresses
 <<completion-suggester,Completion datatype>>::
                     `completion` to provide auto-complete suggestions
+
 <<token-count>>::   `token_count` to count the number of tokens in a string
 {plugins}/mapper-murmur3.html[`mapper-murmur3`]:: `murmur3` to compute hashes of values at index-time and store them in the index
 {plugins}/mapper-annotated-text.html[`mapper-annotated-text`]:: `annotated-text` to index text containing special markup (typically used for identifying named entities)
@@ -54,6 +55,8 @@ string::         <<text,`text`>> and <<keyword,`keyword`>>
 
 <<shape>>:: `shape` for arbitrary cartesian geometries.
 
+<<histogram>>:: `histogram` for pre-aggregated numerical values for percentiles aggregations.
+
 [float]
 [[types-array-handling]]
 === Arrays
@@ -89,6 +92,8 @@ include::types/date_nanos.asciidoc[]
 
 include::types/dense-vector.asciidoc[]
 
+include::types/histogram.asciidoc[]
+
 include::types/flattened.asciidoc[]
 
 include::types/geo-point.asciidoc[]

diff --git a/docs/reference/mapping/types/histogram.asciidoc b/docs/reference/mapping/types/histogram.asciidoc
@@ -0,0 +1,119 @@
+[role="xpack"]
+[testenv="basic"]
+[[histogram]]
+=== Histogram datatype
+++++
+<titleabbrev>Histogram</titleabbrev>
+++++
+
+A  field to store pre-aggregated numerical data representing a histogram.
+This data is defined using two paired arrays:
+
+* A `values` array of <<number, `double`>> numbers, representing the buckets for
+the histogram. These values must be provided in ascending order.
+* A corresponding `counts` array of <<number, `integer`>> numbers, representing how
+many values fall into each bucket. These numbers must be positive or zero.
+
+Because the elements in the `values` array correspond to the elements in the
+same position of the `count` array, these two arrays must have the same length.
+
+[IMPORTANT]
+========
+* A `histogram` field can only store a single pair of `values` and `count` arrays
+per document. Nested arrays are not supported.
+* `histogram` fields do not support sorting.
+========
+
+[[histogram-uses]]
+==== Uses
+
+`histogram` fields are primarily intended for use with aggregations. To make it
+more readily accessible for aggregations, `histogram` field data is stored as a
+binary <<doc-values,doc values>> and not indexed. Its size in bytes is at most
+`13 * numValues`, where `numValues` is the length of the provided arrays.
+
+Because the data is not indexed, you only can use `histogram` fields for the
+following aggregations and queries:
+
+* <<search-aggregations-metrics-percentile-aggregation,percentiles>> aggregation
+* <<search-aggregations-metrics-percentile-rank-aggregation,percentile ranks>> aggregation
+* <<query-dsl-exists-query,exists>> query
+
+[[mapping-types-histogram-building-histogram]]
+==== Building a histogram
+
+When using a histogram as part of an aggregation, the accuracy of the results will depend on how the
+histogram was constructed. It is important to consider the percentiles aggregation mode that will be used
+to build it. Some possibilities include:
+
+- For the <<search-aggregations-metrics-percentile-aggregation, T-Digest>> mode, the `values` array represents
+the mean centroid positions and the `counts` array represents the number of values that are attributed to each
+centroid. If the algorithm has already started to approximate the percentiles, this inaccuracy is
+carried over in the histogram.
+
+- For the <<_hdr_histogram,High Dynamic Range (HDR)>> histogram mode, the `values` array represents fixed upper
+limits of each bucket interval, and the `counts` array represents the number of values that are attributed to each
+interval. This implementation maintains a fixed worse-case percentage error (specified as a number of significant digits),
+therefore the value used when generating the histogram would be the maximum accuracy you can achieve at aggregation time.
+
+The histogram field is "algorithm agnostic" and does not store data specific to either T-Digest or HDRHistogram. While this
+means the field can technically be aggregated with either algorithm, in practice the user should chose one algorithm and
+index data in that manner (e.g. centroids for T-Digest or intervals for HDRHistogram) to ensure best accuracy.
+
+[[histogram-ex]]
+==== Examples
+
+The following <<indices-create-index, create index>> API request creates a new index with two field mappings:
+
+* `my_histogram`, a `histogram` field used to store percentile data
+* `my_text`, a `keyword` field used to store a title for the histogram
+
+[ INSERT CREATE INDEX SNIPPET ]
+[source,console]
+--------------------------------------------------
+PUT my_index
+{
+  "mappings": {
+    "properties": {
+      "my_histogram": {
+        "type" : "histogram"
+      },
+      "my_text" : {
+        "type" : "keyword"
+      }
+    }
+  }
+}
+--------------------------------------------------
+
+The following <<docs-index_,index>> API requests store pre-aggregated for
+two histograms: `histogram_1` and `histogram_2`.
+
+[source,console]
+--------------------------------------------------
+PUT my_index/_doc/1
+{
+  "my_text" : "histogram_1",
+  "my_histogram" : {
+      "values" : [0.1, 0.2, 0.3, 0.4, 0.5], <1>
+      "counts" : [3, 7, 23, 12, 6] <2>
+   }
+}
+
+PUT my_index/_doc/2
+{
+  "my_text" : "histogram_2",
+  "my_histogram" : {
+      "values" : [0.1, 0.25, 0.35, 0.4, 0.45, 0.5], <1>
+      "counts" : [8, 17, 8, 7, 6, 2] <2>
+   }
+}
+--------------------------------------------------
+<1> Values for each bucket. Values in the array are treated as doubles and must be given in
+increasing order. For <<search-aggregations-metrics-percentile-aggregation-approximation, T-Digest>>
+histograms this value represents the mean value. In case of HDR histograms this represents the value iterated to.
+<2> Count for each bucket. Values in the arrays are treated as integers and must be positive or zero.
+Negative values will be rejected. The relation between a bucket and a count is given by the position in the array.
+
+
+
diff --git a/server/src/main/java/org/elasticsearch/index/fielddata/AtomicHistogramFieldData.java b/server/src/main/java/org/elasticsearch/index/fielddata/AtomicHistogramFieldData.java
@@ -0,0 +1,34 @@
+/*
+ * Licensed to Elasticsearch under one or more contributor
+ * license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright
+ * ownership. Elasticsearch licenses this file to you under
+ * the Apache License, Version 2.0 (the "License"); you may
+ * not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.elasticsearch.index.fielddata;
+
+
+import java.io.IOException;
+
+/**
+ * {@link AtomicFieldData} specialization for histogram data.
+ */
+public interface AtomicHistogramFieldData extends AtomicFieldData {
+
+    /**
+     * Return Histogram values.
+     */
+    HistogramValues getHistogramValues() throws IOException;
+
+}
diff --git a/server/src/main/java/org/elasticsearch/index/fielddata/HistogramValue.java b/server/src/main/java/org/elasticsearch/index/fielddata/HistogramValue.java
@@ -0,0 +1,48 @@
+/*
+ * Licensed to Elasticsearch under one or more contributor
+ * license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright
+ * ownership. Elasticsearch licenses this file to you under
+ * the Apache License, Version 2.0 (the "License"); you may
+ * not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.elasticsearch.index.fielddata;
+
+import java.io.IOException;
+
+/**
+ * Per-document histogram value. Every value of the histogram consist on
+ * a value and a count.
+ */
+public abstract class HistogramValue {
+
+    /**
+     * Advance this instance to the next value of the histogram
+     * @return true if there is a next value
+     */
+    public abstract boolean next() throws IOException;
+
+    /**
+     * the current value of the histogram
+     * @return the current value of the histogram
+     */
+    public abstract double value();
+
+    /**
+     * The current count of the histogram
+     * @return the current count of the histogram
+     */
+    public abstract int count();
+
+}
diff --git a/server/src/main/java/org/elasticsearch/index/fielddata/HistogramValues.java b/server/src/main/java/org/elasticsearch/index/fielddata/HistogramValues.java
@@ -0,0 +1,41 @@
+/*
+ * Licensed to Elasticsearch under one or more contributor
+ * license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright
+ * ownership. Elasticsearch licenses this file to you under
+ * the Apache License, Version 2.0 (the "License"); you may
+ * not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.elasticsearch.index.fielddata;
+
+import java.io.IOException;
+
+/**
+ * Per-segment histogram values.
+ */
+public abstract class HistogramValues {
+
+    /**
+     * Advance this instance to the given document id
+     * @return true if there is a value for this document
+     */
+    public abstract boolean advanceExact(int doc) throws IOException;
+
+    /**
+     * Get the {@link HistogramValue} associated with the current document.
+     * The returned {@link HistogramValue} might be reused across calls.
+     */
+    public abstract HistogramValue histogram() throws IOException;
+
+}
diff --git a/server/src/main/java/org/elasticsearch/index/fielddata/IndexHistogramFieldData.java b/server/src/main/java/org/elasticsearch/index/fielddata/IndexHistogramFieldData.java
@@ -0,0 +1,34 @@
+/*
+ * Licensed to Elasticsearch under one or more contributor
+ * license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright
+ * ownership. Elasticsearch licenses this file to you under
+ * the Apache License, Version 2.0 (the "License"); you may
+ * not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.elasticsearch.index.fielddata;
+
+
+import org.elasticsearch.index.Index;
+import org.elasticsearch.index.fielddata.plain.DocValuesIndexFieldData;
+
+/**
+ * Specialization of {@link IndexFieldData} for histograms.
+ */
+public abstract class IndexHistogramFieldData extends DocValuesIndexFieldData implements IndexFieldData<AtomicHistogramFieldData> {
+
+    public IndexHistogramFieldData(Index index, String fieldName) {
+        super(index, fieldName);
+    }
+}