Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup time_series agg by caching current tsid ordinal, parent bucket ordinal and buck ordinal #91784

Merged
merged 5 commits into from
Nov 30, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -92,15 +92,34 @@ protected void doClose() {
protected LeafBucketCollector getLeafCollector(AggregationExecutionContext aggCtx, LeafBucketCollector sub) throws IOException {
return new LeafBucketCollectorBase(sub, null) {

// Keeping track of these fields helps to reduce time spent attempting to add bucket + tsid combos that already were added.
long currentTsidOrd = -1;
long currentBucket = -1;
long currentBucketOrdinal;

@Override
public void collect(int doc, long bucket) throws IOException {
// Naively comparing bucket against currentBucket and tsid ord to currentBucket can work really well.
// TimeSeriesIndexSearcher ensures that docs are emitted in tsid and timestamp order, so if tsid ordinal
// changes to what is stored in currentTsidOrd then that ordinal well never occur again. Same applies
// currentBucket if there is no parent aggregation or the immediate parent aggregation creates buckets
// based on @timestamp field or dimension fields (fields that make up the tsid).
if (currentBucket == bucket && currentTsidOrd == aggCtx.getTsidOrd()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this aggregation is always called with bucket == 0 so maybe remove this currentBucket == bucket condition here and replace it with an assertion that bucket is zero?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is only the case if time_series aggregation is a top level aggregation.
I don't think this is always the case and for example a date_histogram aggregation is a likely parent aggregation in the case of time_series aggregation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that makes sense. Then maybe the currentBucket == bucket condition is a bit fragile as it would prevent the optimization from kicking in if this collector is called with different bucket ordinals even though it the TSID doesn't change. What about changing currentTsidOrd and currentBucketOrdinal to arrays that are keyed by bucket, ie. tracking separately the current TSID and bucket ordinal per bucket?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the case we know there isn't a parent aggregation then we can return a leaf bucket collector that only does the currentTsidOrd == aggCtx.getTsidOrd() check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about changing currentTsidOrd and currentBucketOrdinal to arrays that are keyed by bucket, ie. tracking separately the current TSID and bucket ordinal per bucket?

We'd need a BigArray thing because we can have many buckets above us and we'd want to track the memory. It could be useful though. If you ever "come back" to the same bucket. But I don't think, say, date_histogram will. I think once it's finished collecting some date for the time series it'll never come back.

I think the case we know there isn't a parent aggregation then we can return a leaf bucket collector that only does the currentTsidOrd == aggCtx.getTsidOrd() check.

That's what the CardinalityUpperBound bucketCardinality argument to the ctor is for. If it's exactly one you know you don't have a parent agg or it always makes a single bucket.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually things should be fine with a terms aggregation as a parent too, because almost all the time this terms aggregation would run on a field that is a dimension or a tag. So bucket would only change when tsidOrd changes too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that is true if the field of the terms agg is a dimension field. But maybe not in the case if the field is a non dimension field (label field)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. My intuition is that we don't have to care much about this case of labels that could take different values for the same TSID (as long as the aggregation output is correct, just less optimized), do we?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intuition is that we don't have to care much about this case of labels that could take different values for the same TSID (as long as the aggregation output is correct, just less optimized), do we?

This is what I think as well. This change favours parent aggs based on dimension fields, @timestamp or no parent aggregation. For the other cases, this change doesn't help, but I think doesn't hurt either. We are always correct, since we then fall back to checking whether we have seen tsid & bucket combination in BytesKeyedBucketOrds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment: f23de49

collectExistingBucket(sub, doc, currentBucketOrdinal);
return;
}

long bucketOrdinal = bucketOrds.add(bucket, aggCtx.getTsid());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a tsidOrd can we use that as a key instead of the bytes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we need the tsid as bytes in the buildAggregations() method and I don't think we can use this tsidord that TimeSeriesIndexSearcher generates as a lookup to get the tsid?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this tsidord is neither a segment ordinal nor a global ordinal but an ordinal for TSIDs that intersect with the query on this shard. So it can't be used as a key to retrieve TSIDs from the terms dictionary. (Separately, maybe we should rename this method to reduce chances that someone mistakenly uses it as an ordinal for the _tsid terms dict? I worry a bit that it would look like it works for common cases like the query only consists of a range query on the @timestamp and this time range happens to match every unique TSID, so tests might not even catch the problem if someone started doing this?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename the method in a follow up change?

if (bucketOrdinal < 0) { // already seen
bucketOrdinal = -1 - bucketOrdinal;
collectExistingBucket(sub, doc, bucketOrdinal);
} else {
collectBucket(sub, doc, bucketOrdinal);
}

currentBucketOrdinal = bucketOrdinal;
currentTsidOrd = aggCtx.getTsidOrd();
currentBucket = bucket;
}
};
}
Expand Down