ESQL: Improve grammar to allow identifiers with . #100740

costin · 2023-10-12T00:49:03Z

Extend the unquoted identifier to contain . not just numbers. Without it
the lexer picks the characters as decimal literal which leads to errors

Fix #100312

Extend the unquoted identifier to contain . not just numbers. Without it the lexer picks the characters as decimal literal which leads to errors Fix elastic#100312

elasticsearchmachine · 2023-10-12T00:49:41Z

Hi @costin, I've created a changelog YAML for you.

elasticsearchmachine · 2023-10-12T00:49:41Z

Pinging @elastic/es-ql (Team:QL)

elasticsearchmachine · 2023-10-12T00:49:42Z

Pinging @elastic/elasticsearch-esql (:Query Languages/ES|QL)

costin · 2023-10-12T03:48:29Z

The issue with unquoted field names that contain . is we are already using that for qualifying names:
a.b means field b inside a. Similar to the JSON object notation a.b.c.
Under this criterion a.1m.4321 gets parsed as field 1m under a however we do not allow unquoted field names to start with digits because of decimal notation, e.g. 1e9 is a field name or a number?
The improvement in this PR adds . to the unquoted identifiers so that a.1m becomes just one field name - there are no qualifiers.
However the test now fails for names such as a.@b (e.g source.@timestamp) because @ appears inside the field name...

To make matters worse we have different lexer rules between different commands hence why the same unquoted identifier works in one but fails in the other - in once is just a string, in the other it's an expression.
Before going down this path, we need to figure out the handling of qualified names.

astefan · 2023-10-12T06:42:54Z

x-pack/plugin/esql/src/main/antlr/EsqlBaseLexer.g4


 UNQUOTED_IDENTIFIER
-    : LETTER (LETTER | DIGIT | '_')*
+    : LETTER UNQUOTED_ID_BODY*


This will allow digits inside "subfields", but not on the root field. For example 123elasticsearch.node.stats.os.cpu.load_avg.1m

Do we want to allow unquoted identifiers that start with digits, though? If so, we'd at least have to disallow identifiers being all digits, no?

we'd at least have to disallow identifiers being all digits, no?

ES accepts both fields "123" and "123.456".

Which we accept as long as the fields are quoted "123" - same with field above "a.1m.4321" is accepted; the problem is handling fields which are NOT quoted.

alex-spies

LGTM

alex-spies · 2023-10-12T13:32:58Z

docs/changelog/100740.yaml

@@ -0,0 +1,6 @@
+pr: 100740
+summary: "ESQL: Improve grammar to allow identifiers with"


Changelog didn't like the period.

alex-spies · 2023-10-12T13:48:09Z

x-pack/plugin/esql/src/main/antlr/EsqlBaseLexer.g4


 UNQUOTED_IDENTIFIER
-    : LETTER (LETTER | DIGIT | '_')*
+    : LETTER UNQUOTED_ID_BODY*


Do we want to allow unquoted identifiers that start with digits, though? If so, we'd at least have to disallow identifiers being all digits, no?

alex-spies · 2023-10-12T14:38:59Z

I wonder if this change affects lexing/parsing of multi-segment identifiers that contain quoted segments, e.g.:

asdf.`1234`.fdsa

Currently this should be allowed and fine, but this PR could break cases like this, so we should double check if this is covered by tests. (Not familiar enough with our parsing/lexing to say.)

An alternative would be to use a (negative) lookahead in the lexer via a semantic predicate (like in this SO post), to allow an identifier to start with a digit only if preceded by a DOT fragment. However, that might be problematic given that we generate both Java and JavaScript code from the same antlr4 definition :/

costin · 2023-10-19T23:49:58Z

I've spent some time on the issue and we essentially have to pick whether we align all commands to have the same identifier definition (and restrictions) or keep things a bit loose which results in problems like in the issue above.
While I like the flexibility and I see that we took advantage of it in our tests, I'm opting for the latter since it less confusing.

To restate the problem, right now from, drop, keep, rename and enrich commands due to their simpler definition, accept a much larger set of identifiers to be declared without quotes.
To restate, the issue we're debating here is dealing with unquoted identifiers which, I would argue, should be discouraged form.

fields with docs

from index
| where `duration.1m` > 1    // name needs quoting because of .
| keep duration.1m              // name can be unquoted

field with parenthesis

from index
| stats count(*), max(field)   
| where `count(*)` > 1          // aliased function name need to be quoted
| keep count(*)                    // same alias can be used unquoted

fields with special chars such as @ or +

from index
| where `@domain+1` > 1   // needs quoting
| keep @domain+1            // used unquoted

I find case 1 the most concerning since . is used for qualified names a. b. We currently don't have plugged in but we will especially when dealing with different sources. This means supporting qualifiers both in unquoted and quoted fields (and having to deal with escaping among other things).

is inconvenient when trying to work with things right after a stats however considering there's an alias version I don't think that's much of a problem.
to me, this is an example why we'd want the identifiers to be the same everywhere else. The expression @domain+1 has different meaning depending on the command - in where it's an expression while in keep is a field as whole.

As a user, I find 3 quite trappy for no real advantage.

elasticsearchmachine · 2023-10-19T23:57:55Z

Hi @costin, I've updated the changelog YAML for you.

alex-spies

This looks like we still need to make the CI happy (resp. make it compile), but the proposed refactoring LGTM.

alex-spies · 2023-10-27T09:33:00Z

x-pack/plugin/esql/src/main/antlr/EsqlBaseLexer.g4

 fragment UNDERSCORE
    : '_'
    ;

+fragment UNQUOTED_ID_BODY
+    : (LETTER | DIGIT | UNDERSCORE)


++ to removing the dot again from the unquoted identifier. I agree that dots will likely need to retain special meaning to navigate nested objects. If I understand correctly, this should also bring back the ability to quote just part of an identifier, e.g.

foo.bar.`1234asdf`

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/parser/EsqlBaseParser.java

x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/parser/StatementParserTests.java

alex-spies · 2023-10-27T09:54:42Z

I've spent some time on the issue and we essentially have to pick whether we align all commands to have the same identifier definition (and restrictions) or keep things a bit loose which results in problems like in the issue above. While I like the flexibility and I see that we took advantage of it in our tests, I'm opting for the latter since it less confusing.

To restate the problem, right now from, drop, keep, rename and enrich commands due to their simpler definition, accept a much larger set of identifiers to be declared without quotes. To restate, the issue we're debating here is dealing with unquoted identifiers which, I would argue, should be discouraged form.

fields with docs
from index
| where `duration.1m` > 1    // name needs quoting because of .
| keep duration.1m              // name can be unquoted 
field with parenthesis
from index
| stats count(*), max(field)   
| where `count(*)` > 1          // aliased function name need to be quoted
| keep count(*)                    // same alias can be used unquoted
fields with special chars such as @ or +
from index
| where `@domain+1` > 1   // needs quoting
| keep @domain+1            // used unquoted
I find case 1 the most concerning since . is used for qualified names a. b. We currently don't have plugged in but we will especially when dealing with different sources. This means supporting qualifiers both in unquoted and quoted fields (and having to deal with escaping among other things).

is inconvenient when trying to work with things right after a stats however considering there's an alias version I don't think that's much of a problem.

to me, this is an example why we'd want the identifiers to be the same everywhere else. The expression @domain+1 has different meaning depending on the command - in where it's an expression while in keep is a field as whole.

As a user, I find 3 quite trappy for no real advantage.

I think identifiers should be treated consistently across all clauses; I think for users it will be confusing that they need to do less quoting in KEEP than in WHERE. If I understand correctly, this could also simplify our lexing, considering that it requires lots of different modes at the moment.

dej611 · 2023-11-29T11:41:15Z

I've migrated our grammar to this new one and fixed most of the token renames: elastic/kibana#172148
The test suite found out a regression with the new grammar, in particular related to the - within identifiers:

from a | enrich my-policy

That used to be valid before, while now it throws the following syntax errors:

SyntaxError: token recognition error at: '-'
SyntaxError: extraneous input 'policy' expecting <EOF>

Other than that, in general syntax errors are way more noisy:

show info something passed from a single syntax error (SyntaxError: extraneous input 'something' expecting <EOF>) to 9 now (one for each char)
from index | keep 4.5 passed from 2 to 4 syntax errors
from a | rename fn() as a passed from 0 to 2 syntax errors
etc... (can find more in the linked PR)

alex-spies · 2023-11-29T12:25:21Z

I've migrated our grammar to this new one and fixed most of the token renames: elastic/kibana#172148 The test suite found out a regression with the new grammar, in particular related to the - within identifiers:
from a | enrich my-policy

Just checked and with this PR's code, this regression is in fact inconsistent with from; the following works:

from my-idx | keep whatever

There's another "inconsistency" with how field names are handled:

from my-idx | eval x = my-field

This parses as my SUB field; users need backticks around my-field to make this work.
But field names are not index/policy names, so maybe this inconsistency is not so bad.

astefan

LGTM
Documentation will need an update, as well. CC @bpintea @abdonpijpelink

luigidellaquila

LGTM

luigidellaquila · 2023-11-29T15:14:42Z

x-pack/plugin/esql/qa/testFixtures/src/main/resources/date.csv-spec

-from employees | eval y = date_trunc(1 year, hire_date) | stats count(emp_no) by y | sort y | keep y, count(emp_no) | limit 5;
+from employees | eval y = date_trunc(1 year, hire_date) | stats c = count(emp_no) by y | sort y | keep y, c | limit 5;

-y:date                        | count(emp_no):long
+y:date                        | c:long


I think this is a positive side effect; being able to use an unquoted function-like alias seems inconsistent

costin · 2023-11-29T15:24:07Z

I've migrated our grammar to this new one and fixed most of the token renames: elastic/kibana#172148
The test suite found out a regression with the new grammar, in particular related to the - within identifiers:

Just checked and with this PR's code, this regression is in fact inconsistent with from; the following works:

I think for enrich policy name, we can improve the grammar to have the same lexing as in from - that is allow - and other special characters in its name so there's no need for quoting.

Other than that, in general syntax errors are way more noisy:

show info something passed from a single syntax error (SyntaxError: extraneous input 'something' expecting ) to 9 now (one for each char)
from index | keep 4.5 passed from 2 to 4 syntax errors
from a | rename fn() as a passed from 0 to 2 syntax errors
etc... (can find more in the linked PR)

That's something that we could look into improving however it's a fragile approach since the error messages depend on the grammar which generates different ANTLR internal representation. Which is triggered by a change like the above.

rename fn() doesn't work anymore since fn() needs to be a field identifier meaning () cannot be used unquoted. The fix is to add backticks: rename `fn()`

astefan · 2023-11-29T22:39:51Z

Leaving this one here, since I don't see a solution to it:row `my-field`=123 | stats count(`my-field`) | eval x = `count(`my-field`)`

costin · 2023-12-02T01:31:25Z

Updated the grammar to allow richer policies than identifiers - due to lexing that ended up more complicated than expected having to add a submode for enrich since otherwise the lexing between from identifiers and field identifiers tripped each other.

Leaving this one here, since I don't see a solution to it:row my-field=123 | stats count(my-field) | eval x = count(my-field)

added a test for it: stats count(`my-field`) | keep `count(my-field)`

astefan · 2023-12-04T08:41:45Z

x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/parser/StatementParserTests.java

+
+    public void testQuotedName() {
+        // row `my-field`=123 | stats count(`my-field`) | eval x = `count(`my-field`)`
+        LogicalPlan plan = processingCommand("stats count(`my-field`) |  keep `count(my-field)`");


In a real-life test where the code reaches the verifier, this query results in

{ "error": { "root_cause": [ { "type": "verification_exception", "reason": "Found 1 problem\nline 1:57: Unknown column [count(my-field)], did you mean [count(`my-field`)]?", "stack_trace": "org.elasticsearch.xpack.esql.analysis.VerificationException: Found 1 problem\nline 1:57: Unknown column [count(my-field)], did you mean [count(`my-field`)]?

Good catch - thanks for this.
This was incorrect - the correct syntax should be

keep `count(``my-field``)` ; the backquotes are escaped

since we want to differentiate between

stats count(a - b) ; which we want to support at some point keep `count(a-b)` stats count(`a-b`) ; this is not an expression but a field name hence the ` need to be kept around meaning | keep `count(``a``-``b``)` ; this means any backquote from the user is kept as is since we're using the query verbatim stats count(`a`-`b`) ; need to preserve the quotes | keep `count(``a``-``b``)` ;

I've added a CSV test and caught a bug in our identifier handling.
I've tried to play with removing redundant quotes however as a user I found the behavior surprising.

Thanks @costin. This really LGTM.

elasticsearchmachine · 2023-12-06T23:39:03Z

Hi @costin, I've updated the changelog YAML for you.

costin · 2023-12-12T03:14:10Z

x-pack/plugin/esql/qa/testFixtures/src/main/resources/stats.csv-spec

@@ -783,7 +783,7 @@ FROM sample_data
 median_duration:double | client_ip:ip
 ;

-fieldEscaping
+fieldEscaping#[skip:-8.12.99, reason:Fixed bug in 8.13 of removing the leading/trailing backquotes of an identifier]
 FROM sample_data
 | stats count(`event_duration`) |  keep `count(``event_duration``)`


@abdonpijpelink This PR introduces a (subtle) breaking change due to a previous bug:

the leading/trailing backquotes from an identifier are removed which means that if folks used them (rare case) they would have to update their query.

stats count(`event_duration`) produces the alias count(``event_duration`) - same text

Since 8.13 (this PR) keep needs to use the same rules for quoting as the rest of commands meaning that field names with special characters need to be quoted and the backquote itself be escaped through repetition.
So

count(`event_duration`) becomes `count(``event_duration``)`

However that is not the case in 8.12 - the grammar rules are slightly different so the text is interpreted verbatim meaning the escaping of backticks is taken as is hence why the field is not found:

Unknown column [count(``event_duration``)], did you mean [count(`event_duration`)]?

We need to put this information somewhere (not sure where) and in a smaller format.

abdonpijpelink · 2023-12-12T13:24:24Z

I propose we change the changelog yaml summary on this PR into something like:

Referencing expressions that contain backticks requires <<esql-identifiers,escaping those backticks>>.

And in parallel, we update the Identifiers section on the Syntax page for 8.13 to include the new behavior.

## Summary This PR aligns the (new) Kibana grammar to the newer ES grammar changes proposed in elastic/elasticsearch#100740 . `EXPAND` and `INLINESTATS` has been reinstated here (even when not used) to exactly match the ES grammar. Most of the changes are due to `TOKEN` renaming plus few other changes on how identifiers are now parsed. Revisit the validation logic helped also to find a couple of bugs on our validation side, but they were very minimal and limited. ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios --------- Co-authored-by: kibanamachine <[email protected]> Co-authored-by: Stratoula Kalafateli <[email protected]> Co-authored-by: Abdon Pijpelink <[email protected]>

ESQL: Improve grammar to allow identifiers with . and numbers

d500538

Extend the unquoted identifier to contain . not just numbers. Without it the lexer picks the characters as decimal literal which leads to errors Fix elastic#100312

costin added >bug :Analytics/ES|QL AKA ESQL v8.11.1 labels Oct 12, 2023

costin requested review from luigidellaquila, astefan, bpintea and alex-spies October 12, 2023 00:49

costin changed the title ~~ESQL: Improve grammar to allow identifiers with . and numbers~~ ESQL: Improve grammar to allow identifiers with . Oct 12, 2023

elasticsearchmachine added v8.12.0 Team:QL (Deprecated) Meta label for query languages team labels Oct 12, 2023

Update docs/changelog/100740.yaml

6d7eaa7

costin added the auto-backport-and-merge label Oct 12, 2023

astefan reviewed Oct 12, 2023

View reviewed changes

alex-spies approved these changes Oct 12, 2023

View reviewed changes

Update ANTLR files

0e4ecb1

costin added >enhancement and removed auto-backport-and-merge v8.11.1 >bug labels Oct 19, 2023

alex-spies approved these changes Oct 27, 2023

View reviewed changes

costin mentioned this pull request Nov 16, 2023

ESQL: Meta-ticket for 8.12 #102272

Closed

10 tasks

Fix yaml test

64f191b

astefan approved these changes Nov 29, 2023

View reviewed changes

luigidellaquila approved these changes Nov 29, 2023

View reviewed changes

Update parsing of enrich policy name

9a51c58

astefan reviewed Dec 4, 2023

View reviewed changes

costin modified the milestones: 8.12, 8.13 Dec 4, 2023

brianseeders added v8.13.0 and removed v8.12.0 labels Dec 6, 2023

Update docs/changelog/100740.yaml

96d1675

This was referenced Dec 7, 2023

[ES|QL] Migration to new ES|QL grammar elastic/kibana#172148

Closed

[ES|QL] Update grammar based on ES changes elastic/kibana#172789

Merged

wchaparro assigned costin Dec 8, 2023

costin added 3 commits December 8, 2023 21:24

Merge branch 'main' into fix/100312

830b436

Address feedback on field name quoting

5865a2c

Skip test due to breaking change

649f769

costin commented Dec 12, 2023

View reviewed changes

Update summary

a047074

costin merged commit a8a956f into elastic:main Dec 12, 2023
2 of 15 checks passed

costin deleted the fix/100312 branch December 12, 2023 22:04

abdonpijpelink mentioned this pull request Dec 13, 2023

[DOCS] Improve ES|QL backticks docs #103386

Closed

abdonpijpelink mentioned this pull request Jan 5, 2024

[DOCS] ES|QL backtick changes in 8.13 #103958

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESQL: Improve grammar to allow identifiers with . #100740

ESQL: Improve grammar to allow identifiers with . #100740

costin commented Oct 12, 2023

elasticsearchmachine commented Oct 12, 2023

elasticsearchmachine commented Oct 12, 2023

elasticsearchmachine commented Oct 12, 2023

costin commented Oct 12, 2023

astefan Oct 12, 2023

alex-spies Oct 12, 2023

bpintea Oct 13, 2023

costin Oct 18, 2023

alex-spies left a comment

alex-spies Oct 12, 2023

alex-spies Oct 12, 2023

alex-spies commented Oct 12, 2023

costin commented Oct 19, 2023

elasticsearchmachine commented Oct 19, 2023

alex-spies left a comment

alex-spies Oct 27, 2023

alex-spies commented Oct 27, 2023

dej611 commented Nov 29, 2023 •

edited

Loading

alex-spies commented Nov 29, 2023

astefan left a comment

luigidellaquila left a comment

luigidellaquila Nov 29, 2023

costin commented Nov 29, 2023

astefan commented Nov 29, 2023

costin commented Dec 2, 2023

astefan Dec 4, 2023

costin Dec 9, 2023

astefan Dec 10, 2023

elasticsearchmachine commented Dec 6, 2023

costin Dec 12, 2023

abdonpijpelink commented Dec 12, 2023

		@@ -0,0 +1,6 @@
		pr: 100740
		summary: "ESQL: Improve grammar to allow identifiers with"

ESQL: Improve grammar to allow identifiers with . #100740

ESQL: Improve grammar to allow identifiers with . #100740

Conversation

costin commented Oct 12, 2023

elasticsearchmachine commented Oct 12, 2023

elasticsearchmachine commented Oct 12, 2023

elasticsearchmachine commented Oct 12, 2023

costin commented Oct 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alex-spies left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alex-spies commented Oct 12, 2023

costin commented Oct 19, 2023

elasticsearchmachine commented Oct 19, 2023

alex-spies left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alex-spies commented Oct 27, 2023

dej611 commented Nov 29, 2023 • edited Loading

alex-spies commented Nov 29, 2023

astefan left a comment

Choose a reason for hiding this comment

luigidellaquila left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

costin commented Nov 29, 2023

astefan commented Nov 29, 2023

costin commented Dec 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticsearchmachine commented Dec 6, 2023

Choose a reason for hiding this comment

abdonpijpelink commented Dec 12, 2023

dej611 commented Nov 29, 2023 •

edited

Loading