Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ESQL: Introduce per agg filter #113735
ESQL: Introduce per agg filter #113735
Changes from 3 commits
839dbbf
586dabe
a9f4358
2e37927
dd13dc3
4d9039a
8b75cea
db9c98e
99fba42
3d1fe24
d02c654
91d91e6
bda7edd
1686ef3
fc426c4
5a1257a
8ccad9e
23c7c54
217335d
9bccde4
c6cadca
c04be1d
0938d91
21696a6
b10b7d4
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought: I think it's going to be important to teach users with good examples that
is fundamentally different from
I expect users will try stuff like
which is invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't currently verified (which is what the comment in Verifier is about). But wondering if this type of aggregation (with no group) should be invalid at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand correctly; but
is invalid because the
WHERE
inside the aggregation refers to the result of the aggregation. To be correct, the second part of the predicate needs to be moved into a separateWHERE
command:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constant aggregations with WHERE are not tested and currently incorrect, at least for those that rely on
COUNT(*)
, likeSUM
.Reproducer:
The query should return 1 as only 1 row satisfies
a > 3
, but it returns4
. That's because we don't propagate the filter into theCOUNT(*)
inSum.surrogate
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be more to it, since
from test | stats count(*) where a > 3
also fails (and samefrom test | stats count(a) where a > 3
), butfrom test | stats c_gt3 = count(*) where a > 3, c_lt3 = count(*) where a < 3
works (and that's how the results for the last added test are actually correct).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's because there's no true folding of the aggregation but instead we try to rewrite them into MV_functions + case.
Which fails short when having to evaluate the filter (which might be foldable or not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a side note, when the filters are foldable, they evaluate to either true (essentially discarded) or false meaning the agg won't run and can be folded to its initial value, 0 for count and
null
for the other aggs.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, let's address the const case later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #115522 to call this out explicitly as it's actually producing incorrect results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More things to test:
BUCKET
in theWHERE
clause. Seems to work for this very simple query:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch however I believe this is a bug. Grouping functions should only be allowed in the BY clause - here it could work it's the same as a grouping however I tried it and we allow a bucket with different field and argument.
/cc @bpintea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The semantic equality for
BUCKET
s is currently run over the aggs inSTATS
against theGroupingFunction
s in itsBY
clause. But sincecheckInvalidNamedExpressionUsage()
isn't run on allAggregateFunction
children and since the filter-on-agg is now added within theAggregateFunction
itself, the condition isn't caught.So this the check needs to be extended. But IMO it should either be a follow-up of this PR, or part of it, I don't think it's a general issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small improvements to the grammar that don't change the parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uniform serialization across all aggregate functions - we should had that since the beginning to avoid having each subclass add its own repetitive serialization.
/cc @nik9000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's easier to make the filter wrap the aggregation rather than inside each one. I think also that serializing a list in case aggs use it is probably not a great choice either. It's fine, but feels like something we'd do when we need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The side effect is all aggs had to be modified to take into account the parent filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we wrap them with something like we do at runtime? It feels like that might work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean. My initial approach was to wrap the agg and filter in separate class (FilteredAggregate with two children - aggregate and filter/expression ) however this messed up the optimizer since the rules picked the aggregate functions but had no idea there was a filter associated with it.
This is problematic for scenario where the same agg is defined with different filter:
STATS c = count(). c_f = count() WHERE a > 1
since the where filter is disjoined from the count, the rules replaced the second count with the first one but that is incorrect since they have different execution paths.
Furthermore at compute time the filter and actual agg gets fused anyway so the two are a pair anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They do get fused anyway, but it's just a lot more convenient if every agg doesn't have to worry about being filtered. If
filter
is a member here we have to think about it all over the place. It's enough that the fused filter at compute time has to add all the tests for it, I'd love to avoid that bubbling all the way out. OTOH, if that's how it has to be, I'm not deep enough to object.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this means each aggregation evaluates its own filter? If so, from performance perspective, one of the advantages of separating filter from aggregation is that, if the same filter applies to different aggregations, it makes it easier to recognize that it is the same filter and can be evaluated once for all aggregations that share it, in stead of evaluating each filter within each aggregation separately. However I couldn’t figure out how the current infrastructure supports it yet. At first look at this, it reminds me of the (common)table expressions and union all subqueries in SQL, but we don’t have this infrastructure in ES|QL yet, and it could be a much broader scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach doesn't prevent optimizing the filter evaluation.
Each aggregation ends up having its own filter because
count(*)
is different thancount(*) where a > 10
. Initially they were decoupled however the rules considered them the same since the filter was sitting outside the aggregation.When dealing with the same filter, we can assemble a different pipeline so the mask or masked block is reused across multiple aggregations.