Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ES|QL: fix stats by constant expresson with alias #117551

Merged
merged 4 commits into from
Nov 26, 2024

Conversation

luigidellaquila
Copy link
Contributor

Fix planning of queries with STATS and grouping aliases

FROM test
| EVAL y = \"a\"
| STATS COUNT() BY x=y
| SORT x

x is now excluded from the references list (it's a generated value), so that the index resolution happens correctly.

Fixes: #114714

@luigidellaquila luigidellaquila added >bug auto-backport Automatically create backport pull requests when merged :Analytics/ES|QL AKA ESQL v9.0.0 v8.17.0 v8.16.2 v8.18.0 labels Nov 26, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Nov 26, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Hi @luigidellaquila, I've created a changelog YAML for you.

@@ -511,7 +511,7 @@ static Set<String> fieldNames(LogicalPlan parsed, Set<String> enrichPolicyMatchF
// remove any already discovered UnresolvedAttributes that are in fact aliases defined later down in the tree
// for example "from test | eval x = salary | stats max = max(x) by gender"
// remove the UnresolvedAttribute "x", since that is an Alias defined in "eval"
AttributeSet planRefs = Expressions.references(p.expressions());
AttributeSet planRefs = p.references();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plans have logic to correctly calculate their references, no need to reinvent the wheel here.

In general, IMHO this method is trying to do too much.
The logic should be simple: each plan should declare needed inputs and produced outputs; this method should just start from the top of the plan, visit it down and for each plan node

  • remove the outputs
  • add the required inputs.

All the instanceof above should be encapsulated in the single logical plan classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, yes. I was about to suggest a refactor of fieldNames to make things simpler and more local, similarly to what you described.

However, since we're at a point here where references are not yet resolved, we cannot use the existing query plan methods as we would wish. Esp. the plan's output and outputSet methods do not work here because these require resolved references, and also need to know about what fields are even present in relations. This, and the fact that we need to match references by name rather than by id. This prevents us from just using the simple approach from ProjectAwayColumns.

However, using the references method should be okay at this point because that's something that can be determined locally for a plan node.

Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix. It turns out, it's the same problem with accidentally accounting for groupings in the Aggregate's references, but in a different place because we use a manual way to determine a plan's output in fieldNames.

@@ -511,7 +511,7 @@ static Set<String> fieldNames(LogicalPlan parsed, Set<String> enrichPolicyMatchF
// remove any already discovered UnresolvedAttributes that are in fact aliases defined later down in the tree
// for example "from test | eval x = salary | stats max = max(x) by gender"
// remove the UnresolvedAttribute "x", since that is an Alias defined in "eval"
AttributeSet planRefs = Expressions.references(p.expressions());
AttributeSet planRefs = p.references();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, yes. I was about to suggest a refactor of fieldNames to make things simpler and more local, similarly to what you described.

However, since we're at a point here where references are not yet resolved, we cannot use the existing query plan methods as we would wish. Esp. the plan's output and outputSet methods do not work here because these require resolved references, and also need to know about what fields are even present in relations. This, and the fact that we need to match references by name rather than by id. This prevents us from just using the simple approach from ProjectAwayColumns.

However, using the references method should be okay at this point because that's something that can be determined locally for a plan node.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for having a second look at this issue @luigidellaquila.
The big advantage of fieldNames is that it can be unit tested. Please, add some tests (be creative with queries that can break the logic in fieldNames) in IndexResolverFieldNamesTests.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. Thank you for adding the unit tests, @luigidellaquila
LGTM

@luigidellaquila
Copy link
Contributor Author

Thanks for the reviews @astefan @alex-spies !

@luigidellaquila luigidellaquila enabled auto-merge (squash) November 26, 2024 15:57
@luigidellaquila luigidellaquila merged commit b22d185 into elastic:main Nov 26, 2024
16 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.17 Commit could not be cherrypicked due to conflicts
8.16 Commit could not be cherrypicked due to conflicts
8.x Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 117551

@luigidellaquila
Copy link
Contributor Author

Backporting manually

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged backport pending >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.16.2 v8.17.0 v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ESQL: STATS BY expression not working with expression from EVAL
4 participants