-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup time_series agg by caching current tsid ordinal, parent bucket ordinal and buck ordinal #91784
Speedup time_series agg by caching current tsid ordinal, parent bucket ordinal and buck ordinal #91784
Conversation
…t ordinal and buck ordinal. This avoids needlessly adding the same parent bucket ordinal or TSIDs to `BytesKeyedBucketOrds`. Relates to elastic#74660
Pinging @elastic/es-analytics-geo (Team:Analytics) |
@Override | ||
public void collect(int doc, long bucket) throws IOException { | ||
if (currentBucket == bucket && currentTsidOrd == aggCtx.getTsidOrd()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this aggregation is always called with bucket == 0
so maybe remove this currentBucket == bucket
condition here and replace it with an assertion that bucket
is zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is only the case if time_series
aggregation is a top level aggregation.
I don't think this is always the case and for example a date_histogram
aggregation is a likely parent aggregation in the case of time_series
aggregation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that makes sense. Then maybe the currentBucket == bucket
condition is a bit fragile as it would prevent the optimization from kicking in if this collector is called with different bucket ordinals even though it the TSID doesn't change. What about changing currentTsidOrd
and currentBucketOrdinal
to arrays that are keyed by bucket
, ie. tracking separately the current TSID and bucket ordinal per bucket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the case we know there isn't a parent aggregation then we can return a leaf bucket collector that only does the currentTsidOrd == aggCtx.getTsidOrd()
check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about changing currentTsidOrd and currentBucketOrdinal to arrays that are keyed by bucket, ie. tracking separately the current TSID and bucket ordinal per bucket?
We'd need a BigArray thing because we can have many buckets above us and we'd want to track the memory. It could be useful though. If you ever "come back" to the same bucket. But I don't think, say, date_histogram
will. I think once it's finished collecting some date for the time series it'll never come back.
I think the case we know there isn't a parent aggregation then we can return a leaf bucket collector that only does the currentTsidOrd == aggCtx.getTsidOrd() check.
That's what the CardinalityUpperBound bucketCardinality
argument to the ctor is for. If it's exactly one you know you don't have a parent agg or it always makes a single bucket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually things should be fine with a terms
aggregation as a parent too, because almost all the time this terms
aggregation would run on a field that is a dimension or a tag. So bucket
would only change when tsidOrd
changes too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that is true if the field of the terms
agg is a dimension field. But maybe not in the case if the field is a non dimension field (label field)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. My intuition is that we don't have to care much about this case of labels that could take different values for the same TSID (as long as the aggregation output is correct, just less optimized), do we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intuition is that we don't have to care much about this case of labels that could take different values for the same TSID (as long as the aggregation output is correct, just less optimized), do we?
This is what I think as well. This change favours parent aggs based on dimension fields, @timestamp
or no parent aggregation. For the other cases, this change doesn't help, but I think doesn't hurt either. We are always correct, since we then fall back to checking whether we have seen tsid & bucket combination in BytesKeyedBucketOrds
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment: f23de49
@Override | ||
public void collect(int doc, long bucket) throws IOException { | ||
if (currentBucket == bucket && currentTsidOrd == aggCtx.getTsidOrd()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about changing currentTsidOrd and currentBucketOrdinal to arrays that are keyed by bucket, ie. tracking separately the current TSID and bucket ordinal per bucket?
We'd need a BigArray thing because we can have many buckets above us and we'd want to track the memory. It could be useful though. If you ever "come back" to the same bucket. But I don't think, say, date_histogram
will. I think once it's finished collecting some date for the time series it'll never come back.
I think the case we know there isn't a parent aggregation then we can return a leaf bucket collector that only does the currentTsidOrd == aggCtx.getTsidOrd() check.
That's what the CardinalityUpperBound bucketCardinality
argument to the ctor is for. If it's exactly one you know you don't have a parent agg or it always makes a single bucket.
collectExistingBucket(sub, doc, currentBucketOrdinal); | ||
return; | ||
} | ||
|
||
long bucketOrdinal = bucketOrds.add(bucket, aggCtx.getTsid()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a tsidOrd can we use that as a key instead of the bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we need the tsid as bytes in the buildAggregations()
method and I don't think we can use this tsidord that TimeSeriesIndexSearcher
generates as a lookup to get the tsid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed this tsidord is neither a segment ordinal nor a global ordinal but an ordinal for TSIDs that intersect with the query on this shard. So it can't be used as a key to retrieve TSIDs from the terms dictionary. (Separately, maybe we should rename this method to reduce chances that someone mistakenly uses it as an ordinal for the _tsid
terms dict? I worry a bit that it would look like it works for common cases like the query only consists of a range query on the @timestamp
and this time range happens to match every unique TSID, so tests might not even catch the problem if someone started doing this?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename the method in a follow up change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great.
Benchmarking this change to time series aggregation by using the query added in elastic/rally-tracks#348 results in a good improvement:
|
* upstream/main: (209 commits) Remove unused methods and classes from HLRC (elastic#92030) Clean up on exception while chunking XContent (elastic#92024) Add profiling plugin (elastic#91640) Remove unused methods and classes from HLRC (elastic#92012) Remove IndexerState from HLRC (elastic#92023) Ensure cached time elapses in ClusterServiceIT (elastic#91986) Chunked encoding for RestGetIndicesAction (elastic#92016) Simplify shardsWithState (elastic#91991) [DOCS] Updates ML decider docs by mentioning CPU as scaling criterion (elastic#92018) Add chunking to ClusterState.Custom impls (elastic#91963) Speedup time_series agg by caching current tsid ordinal, parent bucket ordinal and buck ordinal (elastic#91784) Drop the ingest listener call count tracking (elastic#92003) [DOCS] fixes issue number 91889 - missing [discrete] header (elastic#91976) Fix PersistentTasksClusterServiceTests (elastic#92002) [docs] Update search-settings documentation to reflect the fact that the indices.query.bool.max_clause_count setting has been deprecated (elastic#91811) Clarify writability in Netty4HttpPipeliningHandler (elastic#91982) Load stable plugins as synthetic modules (elastic#91869) Handle any exception thrown while generating source for an IngestDocument (elastic#91981) fixing Apache HttpHost url on java-rest doc (elastic#91945) Implement repair functionality for aliases colliding with indices bug (elastic#91887) ...
This avoids needlessly adding the same parent bucket ordinal or TSIDs to
BytesKeyedBucketOrds
.Relates to #74660