-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider query when optimizing date rounding #63403
Conversation
Before this change we inspected the index when optimizing `date_histogram` aggregations, precalculating the divisions for the buckets for the entire range of dates on the index so long as there aren't a ton of these buckets. This works very well when you query all of the dates in the index which is quite common - after all, folks frequently want to query a week of data and have daily indices. But it doesn't work as well when the index is much larger than the query. This is quite common when dumping data into ES just to investigate it but less common in the traditional time series use case. But even there it still happens, it is just less impactful. Consider the default query produced by Kibana's Discover app: a range of 15 minutes and a interval of 30 seconds. This optimization saves something like 3 to 12 nanoseconds per document, so that 15 minutes would have to have hundreds of millions of documents for it to be impactful. Anyway, this commit takes the query into account when precalculating the buckets. Mostly this is good when you have "dirty data". Immagine loading 80 billion docs in an index to investigate them. Most of them have dates around 2015 and 2016 but some have dates in 1970 and others have dates in 2030. These outlier dates are "dirty" "garbage". Well, without this change a `date_histogram` across many of these docs is significantly slowed down because we don't precalculate the range due to the outliers. That's just rude! So this change takes the query into account. The bulk of the code change here is plumbing the query into place. It turns out that its a *ton* of plumbing, so instead of just adding a `Query` member in hundreds of args replace `QueryShardContext` with a new `AggregationContext` which does two things: 1. Has the top level `Query`. 2. Exposes just the parts of `QueryShardContext` that we actually need to run aggregation. This lets us simplify a few tests now and will let us simplify many, many tests later.
Here are some performance results:
"no optimization" is without #63245. Without this change and with #63245 we perform worse on dirty data because we take the time to precalculate and give up. Without the optimizations my performance test machine couldn't barely not hit the target interval of 13 operations per second. Even the 50% percentile service time was above 13 seconds. Barely. With this optimization it is barely under twelve. So with dirty data this saves about a second or about 8%. Not bad but we can do better! And we will. Eventually. This unblocks that "doing better" on dirty data. |
All the BWC failures look like standard branch cut day fun. |
import static org.hamcrest.Matchers.containsString; | ||
|
||
public class ParentJoinFieldMapperTests extends ESSingleNodeTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simplified this when I bumped into it working on solving this issue. It's not strictly related, but some changes were indeed required to be compatible with the rest of the changes. The key simplification is that we don't stand up a whole node any more - just set up the mapper parsing infrastructure and some lucene indices.
|
||
public AggregationUsageService getUsageService() { | ||
return valuesSourceRegistry.getUsageService(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this at all any more - the caller now gets it form the ValuesSourceRegistry.
try { | ||
AggregatorFactories factories = source.aggregations().build(queryShardContext, null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the start of the actual plumbing.
@@ -70,7 +69,6 @@ static AutoDateHistogramAggregator build( | |||
AggregatorFactories factories, | |||
int targetBuckets, | |||
RoundingInfo[] roundingInfos, | |||
Function<Rounding, Rounding.Prepared> roundingPreparer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a TODO around moving this to the ctor which I bumped into while I was fixing the calls next to it.
@@ -441,14 +441,14 @@ protected ValuesSourceAggregatorFactory innerBuild(QueryShardContext queryShardC | |||
LongBounds roundedBounds = null; | |||
if (this.extendedBounds != null) { | |||
// parse any string bounds to longs and round | |||
roundedBounds = this.extendedBounds.parseAndValidate(name, "extended_bounds" , queryShardContext, config.format()) | |||
roundedBounds = this.extendedBounds.parseAndValidate(name, "extended_bounds" , context::nowInMillis, config.format()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched the parsing so we don't need to pass the whole query shard context in, now it is easier to test too!
* search index (the first test) and the resolution which is | ||
* on the DateFieldType. | ||
*/ | ||
if (fieldContext.fieldType() instanceof DateFieldType == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that runtime fields can't do it. It makes me think that we're doing something wrong, but I think that is something to solve in a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. The instanceof
check definitely smells wrong here, but I don't know what the right answer is.
import java.util.Map; | ||
import java.util.function.Function; | ||
|
||
import static org.hamcrest.Matchers.equalTo; | ||
import static org.mockito.Mockito.mock; | ||
import static org.mockito.Mockito.when; | ||
|
||
public class AggregatorBaseTests extends ESSingleNodeTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one was so short I thought I could clean up lots of these ESSingleNodeTestCase
s in this PR. It turns out that no, no, I can't.
|
||
// TODO: This whole set of tests needs to be rethought. | ||
public class ValuesSourceConfigTests extends MapperServiceTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No more entire node!
return source("1", build, null); | ||
} | ||
|
||
protected final SourceToParse source(String id, CheckedConsumer<XContentBuilder, IOException> build, @Nullable String routing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parent/child tests wanted this one.
@@ -222,7 +251,111 @@ protected final XContentBuilder fieldMapping(CheckedConsumer<XContentBuilder, IO | |||
}); | |||
} | |||
|
|||
QueryShardContext createQueryShardContext(MapperService mapperService) { | |||
private AggregationContext aggregationContext(MapperService mapperService, IndexSearcher searcher, Query query) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided not to go with mockito here partially because I wanted to suffer every time I added a new method to AggreationContext.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't tell if you're joking or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really not. Suffering makes you think "should I really add this method? this class is already big. maybe there is a cleaner way."
modules/parent-join/src/main/java/org/elasticsearch/join/mapper/ParentJoinFieldMapper.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/aggregations/AbstractAggregationBuilder.java
Outdated
Show resolved
Hide resolved
...ava/org/elasticsearch/search/aggregations/bucket/composite/CompositeValuesSourceBuilder.java
Outdated
Show resolved
Hide resolved
...ava/org/elasticsearch/search/aggregations/bucket/composite/CompositeValuesSourceBuilder.java
Outdated
Show resolved
Hide resolved
...ain/java/org/elasticsearch/search/aggregations/bucket/histogram/DateHistogramAggregator.java
Outdated
Show resolved
Hide resolved
ScriptedMetricAggContexts.CombineScript.CONTEXT); | ||
Map<String, Object> combineScriptParams = combineScript.getParams(); | ||
|
||
return new ScriptedMetricAggregatorFactory(name, compiledMapScript, mapScriptParams, compiledInitScript, | ||
initScriptParams, compiledCombineScript, combineScriptParams, reduceScript, | ||
params, queryShardContext.lookup(), queryShardContext, parent, subfactoriesBuilder, metadata); | ||
params, context, parent, subfactoriesBuilder, metadata); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this. It annoys me when we pass in both an object and something derived from that object like the old version had.
import java.util.List; | ||
import java.util.Optional; | ||
|
||
public abstract class AggregationContext { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have some class level javadoc for this.
* top level {@link Query}. | ||
*/ | ||
public static AggregationContext from(QueryShardContext context, Query query) { | ||
return new AggregationContext() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are we gaining by making AggregationConetxt
abstract and building it via this anonymous closure thing? Seems to me, we could just store a reference to a QueryShardContext
in a concrete class and serve these same methods up directly. I think that would be more readable, but maybe there's another consideration I haven't thought of?
* search index (the first test) and the resolution which is | ||
* on the DateFieldType. | ||
*/ | ||
if (fieldContext.fieldType() instanceof DateFieldType == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. The instanceof
check definitely smells wrong here, but I don't know what the right answer is.
@@ -222,7 +251,111 @@ protected final XContentBuilder fieldMapping(CheckedConsumer<XContentBuilder, IO | |||
}); | |||
} | |||
|
|||
QueryShardContext createQueryShardContext(MapperService mapperService) { | |||
private AggregationContext aggregationContext(MapperService mapperService, IndexSearcher searcher, Query query) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't tell if you're joking or not.
@not-napoleon I've pushed patches for all of your notes. I also explained my reasoning around making the class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. I think clearly documenting what the production path looks like solves my concerns around making AggregationContext
abstract. Thank you for addressing the nits too!
@elasticmachine, retest this please |
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
Before this change we inspected the index when optimizing `date_histogram` aggregations, precalculating the divisions for the buckets for the entire range of dates on the index so long as there aren't a ton of these buckets. This works very well when you query all of the dates in the index which is quite common - after all, folks frequently want to query a week of data and have daily indices. But it doesn't work as well when the index is much larger than the query. This is quite common when dumping data into ES just to investigate it but less common in the traditional time series use case. But even there it still happens, it is just less impactful. Consider the default query produced by Kibana's Discover app: a range of 15 minutes and a interval of 30 seconds. This optimization saves something like 3 to 12 nanoseconds per document, so that 15 minutes would have to have hundreds of millions of documents for it to be impactful. Anyway, this commit takes the query into account when precalculating the buckets. Mostly this is good when you have "dirty data". Immagine loading 80 billion docs in an index to investigate them. Most of them have dates around 2015 and 2016 but some have dates in 1970 and others have dates in 2030. These outlier dates are "dirty" "garbage". Well, without this change a `date_histogram` across many of these docs is significantly slowed down because we don't precalculate the range due to the outliers. That's just rude! So this change takes the query into account. The bulk of the code change here is plumbing the query into place. It turns out that its a *ton* of plumbing, so instead of just adding a `Query` member in hundreds of args replace `QueryShardContext` with a new `AggregationContext` which does two things: 1. Has the top level `Query`. 2. Exposes just the parts of `QueryShardContext` that we actually need to run aggregation. This lets us simplify a few tests now and will let us simplify many, many tests later.
…3571) Before this change we inspected the index when optimizing `date_histogram` aggregations, precalculating the divisions for the buckets for the entire range of dates on the index so long as there aren't a ton of these buckets. This works very well when you query all of the dates in the index which is quite common - after all, folks frequently want to query a week of data and have daily indices. But it doesn't work as well when the index is much larger than the query. This is quite common when dumping data into ES just to investigate it but less common in the traditional time series use case. But even there it still happens, it is just less impactful. Consider the default query produced by Kibana's Discover app: a range of 15 minutes and a interval of 30 seconds. This optimization saves something like 3 to 12 nanoseconds per document, so that 15 minutes would have to have hundreds of millions of documents for it to be impactful. Anyway, this commit takes the query into account when precalculating the buckets. Mostly this is good when you have "dirty data". Immagine loading 80 billion docs in an index to investigate them. Most of them have dates around 2015 and 2016 but some have dates in 1970 and others have dates in 2030. These outlier dates are "dirty" "garbage". Well, without this change a `date_histogram` across many of these docs is significantly slowed down because we don't precalculate the range due to the outliers. That's just rude! So this change takes the query into account. The bulk of the code change here is plumbing the query into place. It turns out that its a *ton* of plumbing, so instead of just adding a `Query` member in hundreds of args replace `QueryShardContext` with a new `AggregationContext` which does two things: 1. Has the top level `Query`. 2. Exposes just the parts of `QueryShardContext` that we actually need to run aggregation. This lets us simplify a few tests now and will let us simplify many, many tests later.
Before this change we inspected the index when optimizing
date_histogram
aggregations, precalculating the divisions for thebuckets for the entire range of dates on the index so long as there
aren't a ton of these buckets. This works very well when you query all
of the dates in the index which is quite common - after all, folks
frequently want to query a week of data and have daily indices.
But it doesn't work as well when the index is much larger than the
query. This is quite common when dumping data into ES just to
investigate it but less common in the traditional time series use case.
But even there it still happens, it is just less impactful. Consider
the default query produced by Kibana's Discover app: a range of 15
minutes and a interval of 30 seconds. This optimization saves something
like 3 to 12 nanoseconds per document, so that 15 minutes would have to
have hundreds of millions of documents for it to be impactful.
Anyway, this commit takes the query into account when precalculating the
buckets. Mostly this is good when you have "dirty data". Immagine
loading 80 billion docs in an index to investigate them. Most of them
have dates around 2015 and 2016 but some have dates in 1970 and
others have dates in 2030. These outlier dates are "dirty" "garbage".
Well, without this change a
date_histogram
across many of these docsis significantly slowed down because we don't precalculate the range due
to the outliers. That's just rude! So this change takes the query into
account.
The bulk of the code change here is plumbing the query into place. It
turns out that its a ton of plumbing, so instead of just adding a
Query
member in hundreds of args replaceQueryShardContext
with anew
AggregationContext
which does two things:Query
.QueryShardContext
that we actually needto run aggregation. This lets us simplify a few tests now and will
let us simplify many, many tests later.