-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup time_series agg by caching current tsid ordinal, parent bucket ordinal and buck ordinal #91784
Speedup time_series agg by caching current tsid ordinal, parent bucket ordinal and buck ordinal #91784
Changes from all commits
7157974
5cfd652
f23de49
f3edf86
d395a01
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -92,15 +92,34 @@ protected void doClose() { | |
protected LeafBucketCollector getLeafCollector(AggregationExecutionContext aggCtx, LeafBucketCollector sub) throws IOException { | ||
return new LeafBucketCollectorBase(sub, null) { | ||
|
||
// Keeping track of these fields helps to reduce time spent attempting to add bucket + tsid combos that already were added. | ||
long currentTsidOrd = -1; | ||
long currentBucket = -1; | ||
long currentBucketOrdinal; | ||
|
||
@Override | ||
public void collect(int doc, long bucket) throws IOException { | ||
// Naively comparing bucket against currentBucket and tsid ord to currentBucket can work really well. | ||
// TimeSeriesIndexSearcher ensures that docs are emitted in tsid and timestamp order, so if tsid ordinal | ||
// changes to what is stored in currentTsidOrd then that ordinal well never occur again. Same applies | ||
// currentBucket if there is no parent aggregation or the immediate parent aggregation creates buckets | ||
// based on @timestamp field or dimension fields (fields that make up the tsid). | ||
if (currentBucket == bucket && currentTsidOrd == aggCtx.getTsidOrd()) { | ||
collectExistingBucket(sub, doc, currentBucketOrdinal); | ||
return; | ||
} | ||
|
||
long bucketOrdinal = bucketOrds.add(bucket, aggCtx.getTsid()); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we have a tsidOrd can we use that as a key instead of the bytes? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So we need the tsid as bytes in the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Indeed this tsidord is neither a segment ordinal nor a global ordinal but an ordinal for TSIDs that intersect with the query on this shard. So it can't be used as a key to retrieve TSIDs from the terms dictionary. (Separately, maybe we should rename this method to reduce chances that someone mistakenly uses it as an ordinal for the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's rename the method in a follow up change? |
||
if (bucketOrdinal < 0) { // already seen | ||
bucketOrdinal = -1 - bucketOrdinal; | ||
collectExistingBucket(sub, doc, bucketOrdinal); | ||
} else { | ||
collectBucket(sub, doc, bucketOrdinal); | ||
} | ||
|
||
currentBucketOrdinal = bucketOrdinal; | ||
currentTsidOrd = aggCtx.getTsidOrd(); | ||
currentBucket = bucket; | ||
} | ||
}; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this aggregation is always called with
bucket == 0
so maybe remove thiscurrentBucket == bucket
condition here and replace it with an assertion thatbucket
is zero?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is only the case if
time_series
aggregation is a top level aggregation.I don't think this is always the case and for example a
date_histogram
aggregation is a likely parent aggregation in the case oftime_series
aggregation.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that makes sense. Then maybe the
currentBucket == bucket
condition is a bit fragile as it would prevent the optimization from kicking in if this collector is called with different bucket ordinals even though it the TSID doesn't change. What about changingcurrentTsidOrd
andcurrentBucketOrdinal
to arrays that are keyed bybucket
, ie. tracking separately the current TSID and bucket ordinal per bucket?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the case we know there isn't a parent aggregation then we can return a leaf bucket collector that only does the
currentTsidOrd == aggCtx.getTsidOrd()
check.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd need a BigArray thing because we can have many buckets above us and we'd want to track the memory. It could be useful though. If you ever "come back" to the same bucket. But I don't think, say,
date_histogram
will. I think once it's finished collecting some date for the time series it'll never come back.That's what the
CardinalityUpperBound bucketCardinality
argument to the ctor is for. If it's exactly one you know you don't have a parent agg or it always makes a single bucket.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually things should be fine with a
terms
aggregation as a parent too, because almost all the time thisterms
aggregation would run on a field that is a dimension or a tag. Sobucket
would only change whentsidOrd
changes too?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that is true if the field of the
terms
agg is a dimension field. But maybe not in the case if the field is a non dimension field (label field)?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. My intuition is that we don't have to care much about this case of labels that could take different values for the same TSID (as long as the aggregation output is correct, just less optimized), do we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I think as well. This change favours parent aggs based on dimension fields,
@timestamp
or no parent aggregation. For the other cases, this change doesn't help, but I think doesn't hurt either. We are always correct, since we then fall back to checking whether we have seen tsid & bucket combination inBytesKeyedBucketOrds
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment: f23de49