Consider query when optimizing date rounding #63403

nik9000 · 2020-10-07T13:50:05Z

Before this change we inspected the index when optimizing
date_histogram aggregations, precalculating the divisions for the
buckets for the entire range of dates on the index so long as there
aren't a ton of these buckets. This works very well when you query all
of the dates in the index which is quite common - after all, folks
frequently want to query a week of data and have daily indices.

But it doesn't work as well when the index is much larger than the
query. This is quite common when dumping data into ES just to
investigate it but less common in the traditional time series use case.
But even there it still happens, it is just less impactful. Consider
the default query produced by Kibana's Discover app: a range of 15
minutes and a interval of 30 seconds. This optimization saves something
like 3 to 12 nanoseconds per document, so that 15 minutes would have to
have hundreds of millions of documents for it to be impactful.

Anyway, this commit takes the query into account when precalculating the
buckets. Mostly this is good when you have "dirty data". Immagine
loading 80 billion docs in an index to investigate them. Most of them
have dates around 2015 and 2016 but some have dates in 1970 and
others have dates in 2030. These outlier dates are "dirty" "garbage".
Well, without this change a date_histogram across many of these docs
is significantly slowed down because we don't precalculate the range due
to the outliers. That's just rude! So this change takes the query into
account.

The bulk of the code change here is plumbing the query into place. It
turns out that its a ton of plumbing, so instead of just adding a
Query member in hundreds of args replace QueryShardContext with a
new AggregationContext which does two things:

Has the top level Query.
Exposes just the parts of QueryShardContext that we actually need
to run aggregation. This lets us simplify a few tests now and will
let us simplify many, many tests later.

Before this change we inspected the index when optimizing `date_histogram` aggregations, precalculating the divisions for the buckets for the entire range of dates on the index so long as there aren't a ton of these buckets. This works very well when you query all of the dates in the index which is quite common - after all, folks frequently want to query a week of data and have daily indices. But it doesn't work as well when the index is much larger than the query. This is quite common when dumping data into ES just to investigate it but less common in the traditional time series use case. But even there it still happens, it is just less impactful. Consider the default query produced by Kibana's Discover app: a range of 15 minutes and a interval of 30 seconds. This optimization saves something like 3 to 12 nanoseconds per document, so that 15 minutes would have to have hundreds of millions of documents for it to be impactful. Anyway, this commit takes the query into account when precalculating the buckets. Mostly this is good when you have "dirty data". Immagine loading 80 billion docs in an index to investigate them. Most of them have dates around 2015 and 2016 but some have dates in 1970 and others have dates in 2030. These outlier dates are "dirty" "garbage". Well, without this change a `date_histogram` across many of these docs is significantly slowed down because we don't precalculate the range due to the outliers. That's just rude! So this change takes the query into account. The bulk of the code change here is plumbing the query into place. It turns out that its a *ton* of plumbing, so instead of just adding a `Query` member in hundreds of args replace `QueryShardContext` with a new `AggregationContext` which does two things: 1. Has the top level `Query`. 2. Exposes just the parts of `QueryShardContext` that we actually need to run aggregation. This lets us simplify a few tests now and will let us simplify many, many tests later.

nik9000 · 2020-10-07T14:23:41Z

Here are some performance results:

| no optimization  |                Min Throughput | date_histogram_fixed_interval_with_tz |    0.08 |  ops/s |
| no optimization  |             Median Throughput | date_histogram_fixed_interval_with_tz |    0.08 |  ops/s |
| no optimization  |                Max Throughput | date_histogram_fixed_interval_with_tz |    0.08 |  ops/s |
| no optimization  |       50th percentile latency | date_histogram_fixed_interval_with_tz | 16659.4 |     ms |
| no optimization  |       90th percentile latency | date_histogram_fixed_interval_with_tz | 18515.2 |     ms |
| no optimization  |      100th percentile latency | date_histogram_fixed_interval_with_tz | 18946.2 |     ms |
| no optimization  |  50th percentile service time | date_histogram_fixed_interval_with_tz | 13087.2 |     ms |
| no optimization  |  90th percentile service time | date_histogram_fixed_interval_with_tz | 13183.9 |     ms |
| no optimization  | 100th percentile service time | date_histogram_fixed_interval_with_tz |   13393 |     ms |
| no optimization  |                    error rate | date_histogram_fixed_interval_with_tz |       0 |      % |
| with this change |                Min Throughput | date_histogram_fixed_interval_with_tz |    0.08 |  ops/s |
| with this change |             Median Throughput | date_histogram_fixed_interval_with_tz |    0.08 |  ops/s |
| with this change |                Max Throughput | date_histogram_fixed_interval_with_tz |    0.08 |  ops/s |
| with this change |       50th percentile latency | date_histogram_fixed_interval_with_tz | 11981.5 |     ms |
| with this change |       90th percentile latency | date_histogram_fixed_interval_with_tz | 12099.6 |     ms |
| with this change |      100th percentile latency | date_histogram_fixed_interval_with_tz | 12221.6 |     ms |
| with this change |  50th percentile service time | date_histogram_fixed_interval_with_tz | 11980.5 |     ms |
| with this change |  90th percentile service time | date_histogram_fixed_interval_with_tz | 12098.6 |     ms |
| with this change | 100th percentile service time | date_histogram_fixed_interval_with_tz | 12220.5 |     ms |
| with this change |                    error rate | date_histogram_fixed_interval_with_tz |       0 |      % |

"no optimization" is without #63245. Without this change and with #63245 we perform worse on dirty data because we take the time to precalculate and give up. Without the optimizations my performance test machine couldn't barely not hit the target interval of 13 operations per second. Even the 50% percentile service time was above 13 seconds. Barely. With this optimization it is barely under twelve. So with dirty data this saves about a second or about 8%. Not bad but we can do better! And we will. Eventually. This unblocks that "doing better" on dirty data.

nik9000 · 2020-10-07T16:55:19Z

All the BWC failures look like standard branch cut day fun.

nik9000 · 2020-10-07T17:55:29Z

modules/parent-join/src/test/java/org/elasticsearch/join/mapper/ParentJoinFieldMapperTests.java

 import static org.hamcrest.Matchers.containsString;

-public class ParentJoinFieldMapperTests extends ESSingleNodeTestCase {


I simplified this when I bumped into it working on solving this issue. It's not strictly related, but some changes were indeed required to be compatible with the rest of the changes. The key simplification is that we don't stand up a whole node any more - just set up the mapper parsing infrastructure and some lucene indices.

nik9000 · 2020-10-07T17:56:05Z

server/src/main/java/org/elasticsearch/index/query/QueryShardContext.java

-
-    public AggregationUsageService getUsageService() {
-        return valuesSourceRegistry.getUsageService();
-    }


We don't need this at all any more - the caller now gets it form the ValuesSourceRegistry.

nik9000 · 2020-10-07T17:56:31Z

server/src/main/java/org/elasticsearch/search/SearchService.java

            try {
-                AggregatorFactories factories = source.aggregations().build(queryShardContext, null);


This is the start of the actual plumbing.

nik9000 · 2020-10-07T17:57:30Z

...java/org/elasticsearch/search/aggregations/bucket/histogram/AutoDateHistogramAggregator.java

@@ -70,7 +69,6 @@ static AutoDateHistogramAggregator build(
        AggregatorFactories factories,
        int targetBuckets,
        RoundingInfo[] roundingInfos,
-        Function<Rounding, Rounding.Prepared> roundingPreparer,


There was a TODO around moving this to the ctor which I bumped into while I was fixing the calls next to it.

nik9000 · 2020-10-07T17:58:04Z

.../org/elasticsearch/search/aggregations/bucket/histogram/DateHistogramAggregationBuilder.java

@@ -441,14 +441,14 @@ protected ValuesSourceAggregatorFactory innerBuild(QueryShardContext queryShardC
        LongBounds roundedBounds = null;
        if (this.extendedBounds != null) {
            // parse any string bounds to longs and round
-            roundedBounds = this.extendedBounds.parseAndValidate(name, "extended_bounds" , queryShardContext, config.format())
+            roundedBounds = this.extendedBounds.parseAndValidate(name, "extended_bounds" , context::nowInMillis, config.format())


I switched the parsing so we don't need to pass the whole query shard context in, now it is easier to test too!

nik9000 · 2020-10-07T18:03:30Z

server/src/main/java/org/elasticsearch/search/aggregations/support/CoreValuesSourceType.java

-                 * search index (the first test) and the resolution which is
-                 * on the DateFieldType.
-                 */
+            if (fieldContext.fieldType() instanceof DateFieldType == false) {


This means that runtime fields can't do it. It makes me think that we're doing something wrong, but I think that is something to solve in a follow up.

Agreed. The instanceof check definitely smells wrong here, but I don't know what the right answer is.

nik9000 · 2020-10-07T18:08:03Z

server/src/test/java/org/elasticsearch/search/aggregations/AggregatorBaseTests.java

 import java.util.Map;
 import java.util.function.Function;

 import static org.hamcrest.Matchers.equalTo;
 import static org.mockito.Mockito.mock;
 import static org.mockito.Mockito.when;

-public class AggregatorBaseTests extends ESSingleNodeTestCase {


This one was so short I thought I could clean up lots of these ESSingleNodeTestCases in this PR. It turns out that no, no, I can't.

nik9000 · 2020-10-07T18:09:38Z

server/src/test/java/org/elasticsearch/search/aggregations/support/ValuesSourceConfigTests.java


+// TODO: This whole set of tests needs to be rethought.
+public class ValuesSourceConfigTests extends MapperServiceTestCase {


No more entire node!

nik9000 · 2020-10-07T18:10:15Z

test/framework/src/main/java/org/elasticsearch/index/mapper/MapperServiceTestCase.java

+        return source("1", build, null);
+    }
+
+    protected final SourceToParse source(String id, CheckedConsumer<XContentBuilder, IOException> build, @Nullable String routing)


parent/child tests wanted this one.

nik9000 · 2020-10-07T18:10:48Z

test/framework/src/main/java/org/elasticsearch/index/mapper/MapperServiceTestCase.java

@@ -222,7 +251,111 @@ protected final XContentBuilder fieldMapping(CheckedConsumer<XContentBuilder, IO
        });
    }

-    QueryShardContext createQueryShardContext(MapperService mapperService) {
+    private AggregationContext aggregationContext(MapperService mapperService, IndexSearcher searcher, Query query) {


I decided not to go with mockito here partially because I wanted to suffer every time I added a new method to AggreationContext.

I can't tell if you're joking or not.

I'm really not. Suffering makes you think "should I really add this method? this class is already big. maybe there is a cleaner way."

modules/parent-join/src/main/java/org/elasticsearch/join/mapper/ParentJoinFieldMapper.java

server/src/main/java/org/elasticsearch/search/aggregations/AbstractAggregationBuilder.java

...ava/org/elasticsearch/search/aggregations/bucket/composite/CompositeValuesSourceBuilder.java

...ain/java/org/elasticsearch/search/aggregations/bucket/histogram/DateHistogramAggregator.java

not-napoleon · 2020-10-08T16:25:33Z

...ain/java/org/elasticsearch/search/aggregations/metrics/ScriptedMetricAggregationBuilder.java

            ScriptedMetricAggContexts.CombineScript.CONTEXT);
        Map<String, Object> combineScriptParams = combineScript.getParams();

        return new ScriptedMetricAggregatorFactory(name, compiledMapScript, mapScriptParams, compiledInitScript,
                initScriptParams, compiledCombineScript, combineScriptParams, reduceScript,
-                params, queryShardContext.lookup(), queryShardContext, parent, subfactoriesBuilder, metadata);
+                params, context, parent, subfactoriesBuilder, metadata);


I like this. It annoys me when we pass in both an object and something derived from that object like the old version had.

not-napoleon · 2020-10-08T16:42:12Z

server/src/main/java/org/elasticsearch/search/aggregations/support/AggregationContext.java

+import java.util.List;
+import java.util.Optional;
+
+public abstract class AggregationContext {


We should have some class level javadoc for this.

not-napoleon · 2020-10-08T16:58:46Z

server/src/main/java/org/elasticsearch/search/aggregations/support/AggregationContext.java

+     * top level {@link Query}.
+     */
+    public static AggregationContext from(QueryShardContext context, Query query) {
+        return new AggregationContext() {


What are we gaining by making AggregationConetxt abstract and building it via this anonymous closure thing? Seems to me, we could just store a reference to a QueryShardContext in a concrete class and serve these same methods up directly. I think that would be more readable, but maybe there's another consideration I haven't thought of?

not-napoleon · 2020-10-08T17:04:11Z

server/src/main/java/org/elasticsearch/search/aggregations/support/CoreValuesSourceType.java

-                 * search index (the first test) and the resolution which is
-                 * on the DateFieldType.
-                 */
+            if (fieldContext.fieldType() instanceof DateFieldType == false) {


Agreed. The instanceof check definitely smells wrong here, but I don't know what the right answer is.

not-napoleon · 2020-10-08T17:45:41Z

test/framework/src/main/java/org/elasticsearch/index/mapper/MapperServiceTestCase.java

@@ -222,7 +251,111 @@ protected final XContentBuilder fieldMapping(CheckedConsumer<XContentBuilder, IO
        });
    }

-    QueryShardContext createQueryShardContext(MapperService mapperService) {
+    private AggregationContext aggregationContext(MapperService mapperService, IndexSearcher searcher, Query query) {


I can't tell if you're joking or not.

nik9000 · 2020-10-12T14:40:26Z

@not-napoleon I've pushed patches for all of your notes. I also explained my reasoning around making the class abstract with a "production" implementation. Maybe I'm just in a strange mood, but it makes me feel better about having such a large "holder" sort of class.

not-napoleon

This looks good. I think clearly documenting what the production path looks like solves my concerns around making AggregationContext abstract. Thank you for addressing the nits too!

nik9000 · 2020-10-12T16:21:38Z

@elasticmachine, retest this please

elasticmachine · 2020-10-12T17:12:11Z

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

Before this change we inspected the index when optimizing `date_histogram` aggregations, precalculating the divisions for the buckets for the entire range of dates on the index so long as there aren't a ton of these buckets. This works very well when you query all of the dates in the index which is quite common - after all, folks frequently want to query a week of data and have daily indices. But it doesn't work as well when the index is much larger than the query. This is quite common when dumping data into ES just to investigate it but less common in the traditional time series use case. But even there it still happens, it is just less impactful. Consider the default query produced by Kibana's Discover app: a range of 15 minutes and a interval of 30 seconds. This optimization saves something like 3 to 12 nanoseconds per document, so that 15 minutes would have to have hundreds of millions of documents for it to be impactful. Anyway, this commit takes the query into account when precalculating the buckets. Mostly this is good when you have "dirty data". Immagine loading 80 billion docs in an index to investigate them. Most of them have dates around 2015 and 2016 but some have dates in 1970 and others have dates in 2030. These outlier dates are "dirty" "garbage". Well, without this change a `date_histogram` across many of these docs is significantly slowed down because we don't precalculate the range due to the outliers. That's just rude! So this change takes the query into account. The bulk of the code change here is plumbing the query into place. It turns out that its a *ton* of plumbing, so instead of just adding a `Query` member in hundreds of args replace `QueryShardContext` with a new `AggregationContext` which does two things: 1. Has the top level `Query`. 2. Exposes just the parts of `QueryShardContext` that we actually need to run aggregation. This lets us simplify a few tests now and will let us simplify many, many tests later.

…3571) Before this change we inspected the index when optimizing `date_histogram` aggregations, precalculating the divisions for the buckets for the entire range of dates on the index so long as there aren't a ton of these buckets. This works very well when you query all of the dates in the index which is quite common - after all, folks frequently want to query a week of data and have daily indices. But it doesn't work as well when the index is much larger than the query. This is quite common when dumping data into ES just to investigate it but less common in the traditional time series use case. But even there it still happens, it is just less impactful. Consider the default query produced by Kibana's Discover app: a range of 15 minutes and a interval of 30 seconds. This optimization saves something like 3 to 12 nanoseconds per document, so that 15 minutes would have to have hundreds of millions of documents for it to be impactful. Anyway, this commit takes the query into account when precalculating the buckets. Mostly this is good when you have "dirty data". Immagine loading 80 billion docs in an index to investigate them. Most of them have dates around 2015 and 2016 but some have dates in 1970 and others have dates in 2030. These outlier dates are "dirty" "garbage". Well, without this change a `date_histogram` across many of these docs is significantly slowed down because we don't precalculate the range due to the outliers. That's just rude! So this change takes the query into account. The bulk of the code change here is plumbing the query into place. It turns out that its a *ton* of plumbing, so instead of just adding a `Query` member in hundreds of args replace `QueryShardContext` with a new `AggregationContext` which does two things: 1. Has the top level `Query`. 2. Exposes just the parts of `QueryShardContext` that we actually need to run aggregation. This lets us simplify a few tests now and will let us simplify many, many tests later.

nik9000 requested a review from not-napoleon October 7, 2020 13:50

nik9000 added 3 commits October 7, 2020 10:39

Merge branch 'master' into limit_rounding_to_query

3c6953c

Update after merge

b25aae8

Ooops

788846f

nik9000 commented Oct 7, 2020

View reviewed changes

nik9000 marked this pull request as ready for review October 7, 2020 21:55

not-napoleon requested changes Oct 8, 2020

View reviewed changes

nik9000 added 5 commits October 12, 2020 10:11

Merge branch 'master' into limit_rounding_to_query

569e0e7

Update javadoc

3d7e639

Promote AggregationUsageService

04caadd

Iter

3316c9f

Hit it with the formatter

4b3073b

nik9000 requested a review from not-napoleon October 12, 2020 14:39

Iter

6d8cd95

not-napoleon approved these changes Oct 12, 2020

View reviewed changes

Ooops

18ac471

nik9000 merged commit 4aaffc6 into elastic:master Oct 12, 2020

nik9000 added :Analytics/Aggregations Aggregations >enhancement backport pending v7.11.0 v8.0.0 labels Oct 12, 2020

elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Oct 12, 2020

nik9000 removed the backport pending label Oct 14, 2020

javanna mentioned this pull request Oct 21, 2020

CoreValuesSourceTypeTests.testDatePrepareRoundingWithDocs fails #63969

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider query when optimizing date rounding #63403

Consider query when optimizing date rounding #63403

nik9000 commented Oct 7, 2020

nik9000 commented Oct 7, 2020

nik9000 commented Oct 7, 2020

nik9000 Oct 7, 2020

nik9000 Oct 7, 2020

nik9000 Oct 7, 2020

nik9000 Oct 7, 2020

nik9000 Oct 7, 2020

nik9000 Oct 7, 2020

not-napoleon Oct 8, 2020

nik9000 Oct 7, 2020

nik9000 Oct 7, 2020

nik9000 Oct 7, 2020

nik9000 Oct 7, 2020

not-napoleon Oct 8, 2020

nik9000 Oct 12, 2020

not-napoleon Oct 8, 2020

not-napoleon Oct 8, 2020

not-napoleon Oct 8, 2020

not-napoleon Oct 8, 2020

not-napoleon Oct 8, 2020

nik9000 commented Oct 12, 2020

not-napoleon left a comment

nik9000 commented Oct 12, 2020

elasticmachine commented Oct 12, 2020

		import static org.hamcrest.Matchers.containsString;

		public class ParentJoinFieldMapperTests extends ESSingleNodeTestCase {

		try {
		AggregatorFactories factories = source.aggregations().build(queryShardContext, null);


		// TODO: This whole set of tests needs to be rethought.
		public class ValuesSourceConfigTests extends MapperServiceTestCase {

Consider query when optimizing date rounding #63403

Consider query when optimizing date rounding #63403

Conversation

nik9000 commented Oct 7, 2020

nik9000 commented Oct 7, 2020

nik9000 commented Oct 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nik9000 commented Oct 12, 2020

not-napoleon left a comment

Choose a reason for hiding this comment

nik9000 commented Oct 12, 2020

elasticmachine commented Oct 12, 2020