ESQL: Compute support for filtering grouping aggs #112476

nik9000 · 2024-09-03T18:11:36Z

Adds support to the compute engine for filtering which positions are processed by grouping aggs. This should allow syntax like

| STATS
       success = COUNT(*) WHERE 200 <= response_code AND response_code < 300,
      redirect = COUNT(*) WHERE 300 <= response_code AND response_code < 400,
    client_err = COUNT(*) WHERE 400 <= response_code AND response_code < 500,
    server_err = COUNT(*) WHERE 500 <= response_code AND response_code < 600,
   total_count = COUNT(*)
  BY hostname

We could translate the WHERE expression into an ExpressionEvaluator and run it, then plug it into the filtering support added in this PR.

The actual filtering is done by creating a
FilteredGroupingAggregatorFunction which runs wraps a regular GroupingAggregatorFunction first executing the filter against the incoming Page and then nulling any positions in the group that don't match. Then passing the resulting groups into the real aggregator. When the real grouping aggregator implementation sees null value for groups it skips collecting that position.

We had to make two changes to every agg for this to work:

Add a method to force local group tracking mode on any aggregator. Previously this was only required if the agg encountered null values, but when we're filtering aggs we can no longer trust the seen parameter we get when building the result. This local group tracking mode let's us track what we've actually seen locally.
Add Releasable to the AddInput thing we use to handle chunked pages in grouping aggs. This is required because the results of the filter must be closed on completion.

Both of these are fairly trivial changes, but require touching every aggregation.

Adds support to the compute engine for filtering which positions are processed by grouping aggs. This should allow syntax like ``` | STATS success = COUNT(*) WHERE 200 <= response_code AND response_code < 300, redirect = COUNT(*) WHERE 300 <= response_code AND response_code < 400, client_err = COUNT(*) WHERE 400 <= response_code AND response_code < 500, server_err = COUNT(*) WHERE 500 <= response_code AND response_code < 600, total_count = COUNT(*) BY hostname ``` We could translate the WHERE expression into an `ExpressionEvaluator` and run it, then plug it into the filtering support added in this PR. The actual filtering is done by creating a `FilteredGroupingAggregatorFunction` which runs wraps a regular `GroupingAggregatorFunction` first executing the filter against the incoming `Page` and then `null`ing any positions in the group that don't match. Then passing the resulting groups into the real aggregator. When the real grouping aggregator implementation sees `null` value for groups it skips collecting that position. We had to make two changes to every agg for this to work: 1. Add a method to force local group tracking mode on any aggregator. Previously this was only required if the agg encountered `null` values, but when we're filtering aggs we can no longer trust the `seen` parameter we get when building the result. This local group tracking mode let's us track what we've actually seen locally. 2. Add `Releasable` to the `AddInput` thing we use to handle chunked pages in grouping aggs. This is required because the results of the filter must be closed on completion. Both of these are fairly trivial changes, but require touching every aggregation.

elasticsearchmachine · 2024-09-03T18:12:06Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

nik9000 · 2024-09-03T18:12:01Z

.../src/main/java/org/elasticsearch/compute/aggregation/FilteredAggregatorFunctionSupplier.java

+
+    @Override
+    public AggregatorFunction aggregator(DriverContext driverContext) {
+        throw new UnsupportedOperationException("TODO");


Tracked #111439

nik9000 · 2024-09-03T18:12:09Z

.../src/main/java/org/elasticsearch/compute/aggregation/FilteredGroupingAggregatorFunction.java

+    public AddInput prepareProcessPage(SeenGroupIds seenGroupIds, Page page) {
+        try (BooleanBlock filterResult = ((BooleanBlock) filter.eval(page))) {
+            ToMask mask = filterResult.toMask();
+            // TODO warn on mv fields


Tracked #111439

nik9000 · 2024-09-03T18:13:09Z

...n/esql/compute/src/main/java/org/elasticsearch/compute/operator/HashAggregationOperator.java

@@ -150,18 +150,23 @@ private void end() {
                    hashStart = System.nanoTime();
                    aggregationNanos += hashStart - aggStart;
                }
+


It's just implementing close and calling it properly.

nik9000 · 2024-09-03T18:13:20Z

...test/java/org/elasticsearch/compute/aggregation/FilteredGroupingAggregatorFunctionTests.java

+public class FilteredGroupingAggregatorFunctionTests extends GroupingAggregatorFunctionTestCase {
+    private final List<Exception> unclosed = Collections.synchronizedList(new ArrayList<>());
+
+    // TODO some version of this test that applies across all aggs


Tracked #111439

costin · 2024-09-03T23:25:14Z

Not sure if this is the right time, wondering what's the performance impact on the aggs with these changes? aka what do the microbenchs show.

Separately, to confirm each filter will create another filtering mask while reusing the underlying grouping data - that is, the number of filters is not going to increase the number of transient data, correct?
Keeping the filters separately is useful when dealing with different aggregate that use the same filters - this way the planner can determine the duplicate filter evaluation and do it only once regardless of how many aggs want to use it.

ivancea

Intimidating PR at first; quite simple at the end. I guess we have to rethink removing the autogenerated files :hehe:

LGTM, looking forward for the next steps!

ivancea · 2024-09-06T15:27:52Z

.../compute/src/main/java/org/elasticsearch/compute/aggregation/GroupingAggregatorFunction.java

+     * track which group ids have been seen, even if that increases the
+     * overhead.
+     */
+    void selectedMayContainUnseenGroups(SeenGroupIds seenGroupIds);


What's the idea behind the SeenGroupIds parameter? Is it just a "In case you weren't tracking them, here you have all the seen groups until now"?

Yeah, it's precisely that. Lots of aggs, like MAX don't track what they've seen unless they have to. This is, sadly, another time when they have to track it.

ivancea · 2024-09-06T15:47:12Z

.../src/main/java/org/elasticsearch/compute/aggregation/FilteredAggregatorFunctionSupplier.java

+
+    @Override
+    public GroupingAggregatorFunction groupingAggregator(DriverContext driverContext) {
+        GroupingAggregatorFunction next = this.next.groupingAggregator(driverContext);


nit: Should this assignation be moved inside the try, like with the filter?
I guess this shouldn't fail, but it looks odd being the only piece not in the try. And it's very similar to filter.get(...)

I could assign it to null and overwrite it. I don't believe it's needed, but it doesn't hurt. If this throws then the next will just be null and it's the responsibility of the method call itself to clean anything.

ivancea · 2024-09-06T16:13:23Z

...test/java/org/elasticsearch/compute/aggregation/FilteredGroupingAggregatorFunctionTests.java

+            new EvalOperator.ExpressionEvaluator.Factory() {
+                @Override
+                public EvalOperator.ExpressionEvaluator get(DriverContext context) {
+                    Exception tracker = new Exception(Integer.toString(unclosed.size()));
+                    unclosed.add(tracker);
+                    return new EvalOperator.ExpressionEvaluator() {
+                        @Override
+                        public Block eval(Page page) {
+                            IntBlock ints = page.getBlock(inputChannels.get(0));
+                            try (
+                                BooleanVector.FixedBuilder result = context.blockFactory()
+                                    .newBooleanVectorFixedBuilder(ints.getPositionCount())
+                            ) {
+                                position: for (int p = 0; p < ints.getPositionCount(); p++) {
+                                    int start = ints.getFirstValueIndex(p);
+                                    int end = start + ints.getValueCount(p);
+                                    for (int i = start; i < end; i++) {
+                                        if (ints.getInt(i) > 0) {
+                                            result.appendBoolean(p, true);
+                                            continue position;
+                                        }
+                                    }
+                                    result.appendBoolean(p, false);
+                                }
+                                return result.build().asBlock();
+                            }
+                        }
+
+                        @Override
+                        public void close() {
+                            if (unclosed.remove(tracker) == false) {
+                                throw new IllegalStateException("close failure!");
+                            }
+                        }
+
+                        @Override
+                        public String toString() {
+                            return "any > 0";
+                        }
+                    };
+                }


I need to confirm this: We have this big chunk of code instead of an evaluator(new GreaterThan(...)) just because of the "unclosed" tracking?

If that's it, I wonder if it would be worth it to move this to a named, nested non-static class at the end. To simplify reading this, and document what this class is about

This can't share code with the ESQL GreaterThan because:

It's an any greater than. That makes the test more interesting.

It can't see that code.

The "unclosed" thing

But I can totally move this to a static class at the end of the file. Or a top level class. But a static class in the file feels a little better.

ivancea · 2024-09-06T16:14:32Z

.../src/test/java/org/elasticsearch/compute/aggregation/GroupingAggregatorFunctionTestCase.java

 public abstract class GroupingAggregatorFunctionTestCase extends ForkingOperatorTestCase {
    protected abstract AggregatorFunctionSupplier aggregatorFunction(List<Integer> inputChannels);

    protected final int aggregatorIntermediateBlockCount() {
-        try (var agg = aggregatorFunction(List.of()).aggregator(driverContext())) {
+        try (var agg = aggregatorFunction(List.of()).groupingAggregator(driverContext())) {


I suppose this worked because all our aggregators have the same amount of intermediate blocks (?)

And because they actually implemented aggregator. I've left that as a TODO for this PR. I didn't mind flipping this. And, yeah, they do have the same intermediate block layout. It'd be funky for them not to.

nik9000 added >non-issue :Analytics/ES|QL AKA ESQL v8.16.0 labels Sep 3, 2024

nik9000 requested review from ivancea and alex-spies September 3, 2024 18:11

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Sep 3, 2024

nik9000 commented Sep 3, 2024

View reviewed changes

ivancea approved these changes Sep 6, 2024

View reviewed changes

nik9000 added 2 commits September 9, 2024 12:10

Merge branch 'main' into filtered_aggs_1

7c4c2f1

Flip

fd35874

nik9000 requested a review from a team as a code owner September 9, 2024 16:46

nik9000 added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 9, 2024

Do it this way apparentely

c7094c5

elasticsearchmachine merged commit 72248e3 into elastic:main Sep 9, 2024
15 checks passed

nik9000 deleted the filtered_aggs_1 branch September 9, 2024 18:00

nik9000 mentioned this pull request Sep 11, 2024

ESQL: Add pre and post filter for grouping operator #111439

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESQL: Compute support for filtering grouping aggs #112476

ESQL: Compute support for filtering grouping aggs #112476

nik9000 commented Sep 3, 2024 •

edited

Loading

elasticsearchmachine commented Sep 3, 2024

nik9000 Sep 3, 2024

nik9000 Sep 3, 2024

nik9000 Sep 3, 2024

nik9000 Sep 3, 2024

costin commented Sep 3, 2024 •

edited

Loading

ivancea left a comment

ivancea Sep 6, 2024

nik9000 Sep 9, 2024

ivancea Sep 6, 2024

nik9000 Sep 9, 2024

ivancea Sep 6, 2024

nik9000 Sep 9, 2024

ivancea Sep 6, 2024

nik9000 Sep 9, 2024

ESQL: Compute support for filtering grouping aggs #112476

ESQL: Compute support for filtering grouping aggs #112476

Conversation

nik9000 commented Sep 3, 2024 • edited Loading

elasticsearchmachine commented Sep 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

costin commented Sep 3, 2024 • edited Loading

ivancea left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nik9000 commented Sep 3, 2024 •

edited

Loading

costin commented Sep 3, 2024 •

edited

Loading