Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESQL: Validate unique plan attribute names #110488

Merged

Conversation

alex-spies
Copy link
Contributor

@alex-spies alex-spies commented Jul 4, 2024

Fix #110541

There's an unwritten invariant in ES|QL that's worth spelling out: a command's/logical plan's output attributes need to have unique names. In contrast to SQL, there's no way to disambiguate in cases like

ROW a = 1, a = 2 | eval b = 2*a

because we do not have qualifiers, like some_table.a. Instead, ES|QL performs variable shadowing, like in EVAL, where the rightmost assignment wins:

| EVAL x = 2*some_field, y = -some_other_field, x = x + y

EVAL's output will have the attributes y and x - the rightmost one.

Let's enforce this invariant in our dependency checker; there's only one deviation from it, currently, which is ROW - I think this is a bug which this PR also fixes.

This is only enforced for logical plans; the dependency checker for physical plans is yet to be enabled as part of #105436.

Depends on #110793

@elasticsearchmachine elasticsearchmachine added v8.16.0 Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) labels Jul 4, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Hi @alex-spies, I've created a changelog YAML for you.

@alex-spies
Copy link
Contributor Author

Blocked by #110490

+ refs
)
);
throw new IllegalStateException("Reference [" + ua.qualifiedName() + "] is ambiguous; " + "matches any of " + refs);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ambiguities were possible with row earlier; the message suggesting disambiguation does not make sense - that's not possible. This was carried over from ql and made sense for SQL. If we have ambiguities in ESQL, that's a bug IMO.

@@ -69,13 +69,6 @@ a:integer | b:integer | c:null | z:integer
1 | 2 | null | null
;

evalRowWithNull2
row a = 1, null, b = 2, c = null, null | eval z = a+b;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one had ambiguous attribute names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider this a bug fix or a breaking change? (IMHO it's a bug, but I see it's questionable)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think yes, this is a bug, since there's no way to refer to the attributes named null in following commands.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding row yes it seems there is an inconsistency between it and the rest of the commands. But first, let's make sure we cover all the commands from the point of view of attribute names uniqueness and if all stands then we can change row:

  • rename a as foo, b as foo keeps the last foo
  • look into enrich, grok, dissect, mv_expand regarding names uniqueness
  • see if any wildcarded names (where they are allowed - keep, drop maybe some other places) don't trip the unique names verification
  • combine this change with whatever needs to change in union types so that the whole story makes sense project-wide
  • look into fields and sub-fields where the name have dots - rootfield.subfield.subsubfield - we usually don't test these and that's a pity

@elasticsearchmachine
Copy link
Collaborator

Hi @alex-spies, I've updated the changelog YAML for you.

@alex-spies alex-spies marked this pull request as draft July 5, 2024 16:47
Set<String> outputAttributeNames = new HashSet<>();
Set<NameId> outputAttributeIds = new HashSet<>();
for (Attribute outputAttr : p.output()) {
if (outputAttributeNames.add(outputAttr.name()) == false || outputAttributeIds.add(outputAttr.id()) == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.name() -> .qualifiedName()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I changed my mind on this:

It is the actual name that needs to be unique; otherwise, bugs could slip in because we somehow end up using qualifiers on accident; qualifiers are not respected by our optimization rules, e.g. mergeOutputAttributes; this PR demonstrates that qualifiers are entirely unused, and the validation for the current state should reflect current assumptions.

If we end up using qualifiers after all (I think that's really for the future and we should really remove them until then), we can easily update the validation.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try also to push mv_expand to its edges.

}
});

return plan.transformUp(LogicalPlan.class, p -> p.resolved() || p.childrenResolved() == false ? p : doRule(p));
Copy link
Contributor

@craigtaverner craigtaverner Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand why p should be resolved, while p.children should be unresolved. Should it not be p -> p.resolved() == false || p.childrenResolved() == false ? p : doRule(p)

Copy link
Contributor Author

@alex-spies alex-spies Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was carried over from AnalyzerRule and BaseAnalyzerRule while collapsing their execution logic into this simple Rule.

Currently it says "I'll just return the current plan without changes (i.e. skip) if it's either already resolved or has yet-to-be resolved children". I think this is correct.

FROM employees
| SORT emp_no ASC
| KEEP emp_no, first_name, last_name
| RENAME emp_no AS last_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting edge case.
Maybe I am reading the docs update wrong (thank you for updating the docs as well):

If it conflicts with an existing column name,
the existing column is replaced by the renamed column. If multiple columns
are renamed to the same name, all but the rightmost column with the same new
name are dropped.

but there is a contradiction between what we say in docs and what actually happens. According to docs, in the results below, last_name should contain info from emp_no but it should be the second column from left to right, not the first one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch! The column is not replaced, it's dropped. I'll update the doc.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@luigidellaquila luigidellaquila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@alex-spies
Copy link
Contributor Author

Thanks for your reviews, @leemthompo , @astefan , @craigtaverner and @luigidellaquila !

I updated keep.asciidoc one more time (I noticed the info I added would be duplicated), then it's time for a merge!

@alex-spies alex-spies merged commit da53921 into elastic:main Jul 17, 2024
16 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.15 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 110488

@alex-spies
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
8.15

Questions ?

Please refer to the Backport tool documentation

alex-spies added a commit to alex-spies/elasticsearch that referenced this pull request Jul 17, 2024
* Enforce an invariant in our dependency checker so that logical plans never have duplicate output attribute names or ids.
* Fix ROW to not produce columns with duplicate names.
* Fix ResolveUnionTypes to not create multiple synthetic field attributes for the same union type.
* Add tests for commands using the same column name more than once.
* Update docs w.r.t. how commands behave if they are used with duplicate column names.

(cherry picked from commit da53921)

# Conflicts:
#	x-pack/plugin/esql/qa/testFixtures/src/main/resources/stats.csv-spec
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/OptimizerRules.java
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/Rename.java
@alex-spies alex-spies deleted the validate-unique-plan-attribute-names branch July 17, 2024 10:51
salvatore-campagna pushed a commit to salvatore-campagna/elasticsearch that referenced this pull request Jul 17, 2024
* Enforce an invariant in our dependency checker so that logical plans never have duplicate output attribute names or ids.
* Fix ROW to not produce columns with duplicate names.
* Fix ResolveUnionTypes to not create multiple synthetic field attributes for the same union type.
* Add tests for commands using the same column name more than once.
* Update docs w.r.t. how commands behave if they are used with duplicate column names.
elasticsearchmachine pushed a commit that referenced this pull request Jul 17, 2024
* ESQL: Validate unique plan attribute names (#110488)

* Enforce an invariant in our dependency checker so that logical plans never have duplicate output attribute names or ids.
* Fix ROW to not produce columns with duplicate names.
* Fix ResolveUnionTypes to not create multiple synthetic field attributes for the same union type.
* Add tests for commands using the same column name more than once.
* Update docs w.r.t. how commands behave if they are used with duplicate column names.

(cherry picked from commit da53921)

# Conflicts:
#	x-pack/plugin/esql/qa/testFixtures/src/main/resources/stats.csv-spec
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/OptimizerRules.java
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/Rename.java

* Remove unrelated csv tests

These slipped in via merge conflicts.
alex-spies added a commit that referenced this pull request Jul 17, 2024
Calling Rename.output() previously returned wrong results.

Since #110488, instead it throws an IllegalStateException. That leads to test failures in the EsqlNodeSubclassTests because e.g. MvExpandExec and FieldExtractExec eagerly calls .output() on its child when it's being constructed, and the child can be a fragment containing a Rename.
alex-spies added a commit to alex-spies/elasticsearch that referenced this pull request Jul 17, 2024
Calling Rename.output() previously returned wrong results.

Since elastic#110488, instead it throws an IllegalStateException. That leads to test failures in the EsqlNodeSubclassTests because e.g. MvExpandExec and FieldExtractExec eagerly calls .output() on its child when it's being constructed, and the child can be a fragment containing a Rename.

(cherry picked from commit 7df1b06)

# Conflicts:
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/Rename.java
elasticsearchmachine pushed a commit that referenced this pull request Jul 17, 2024
Calling Rename.output() previously returned wrong results.

Since #110488, instead it throws an IllegalStateException. That leads to test failures in the EsqlNodeSubclassTests because e.g. MvExpandExec and FieldExtractExec eagerly calls .output() on its child when it's being constructed, and the child can be a fragment containing a Rename.

(cherry picked from commit 7df1b06)

# Conflicts:
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/Rename.java
ioanatia pushed a commit to ioanatia/elasticsearch that referenced this pull request Jul 22, 2024
Calling Rename.output() previously returned wrong results.

Since elastic#110488, instead it throws an IllegalStateException. That leads to test failures in the EsqlNodeSubclassTests because e.g. MvExpandExec and FieldExtractExec eagerly calls .output() on its child when it's being constructed, and the child can be a fragment containing a Rename.
salvatore-campagna pushed a commit to salvatore-campagna/elasticsearch that referenced this pull request Jul 23, 2024
Calling Rename.output() previously returned wrong results.

Since elastic#110488, instead it throws an IllegalStateException. That leads to test failures in the EsqlNodeSubclassTests because e.g. MvExpandExec and FieldExtractExec eagerly calls .output() on its child when it's being constructed, and the child can be a fragment containing a Rename.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged backport pending >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.15.0 v8.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ESQL: Enforce uniquely named attributes in query plans
6 participants