Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESQL: Improve grammar to allow identifiers with . #100740

Merged
merged 14 commits into from
Dec 12, 2023
Merged

Conversation

costin
Copy link
Member

@costin costin commented Oct 12, 2023

Extend the unquoted identifier to contain . not just numbers. Without it
the lexer picks the characters as decimal literal which leads to errors

Fix #100312

Extend the unquoted identifier to contain . not just numbers. Without it
 the lexer picks the characters as decimal literal which leads to errors

Fix elastic#100312
@costin costin changed the title ESQL: Improve grammar to allow identifiers with . and numbers ESQL: Improve grammar to allow identifiers with . Oct 12, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @costin, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-ql (Team:QL)

@elasticsearchmachine elasticsearchmachine added v8.12.0 Team:QL (Deprecated) Meta label for query languages team labels Oct 12, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/elasticsearch-esql (:Query Languages/ES|QL)

@costin
Copy link
Member Author

costin commented Oct 12, 2023

The issue with unquoted field names that contain . is we are already using that for qualifying names:
a.b means field b inside a. Similar to the JSON object notation a.b.c.
Under this criterion a.1m.4321 gets parsed as field 1m under a however we do not allow unquoted field names to start with digits because of decimal notation, e.g. 1e9 is a field name or a number?
The improvement in this PR adds . to the unquoted identifiers so that a.1m becomes just one field name - there are no qualifiers.
However the test now fails for names such as a.@b (e.g source.@timestamp) because @ appears inside the field name...

To make matters worse we have different lexer rules between different commands hence why the same unquoted identifier works in one but fails in the other - in once is just a string, in the other it's an expression.
Before going down this path, we need to figure out the handling of qualified names.


UNQUOTED_IDENTIFIER
: LETTER (LETTER | DIGIT | '_')*
: LETTER UNQUOTED_ID_BODY*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will allow digits inside "subfields", but not on the root field. For example 123elasticsearch.node.stats.os.cpu.load_avg.1m

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to allow unquoted identifiers that start with digits, though? If so, we'd at least have to disallow identifiers being all digits, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'd at least have to disallow identifiers being all digits, no?

ES accepts both fields "123" and "123.456".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which we accept as long as the fields are quoted "123" - same with field above "a.1m.4321" is accepted; the problem is handling fields which are NOT quoted.

Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -0,0 +1,6 @@
pr: 100740
summary: "ESQL: Improve grammar to allow identifiers with"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changelog didn't like the period.


UNQUOTED_IDENTIFIER
: LETTER (LETTER | DIGIT | '_')*
: LETTER UNQUOTED_ID_BODY*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to allow unquoted identifiers that start with digits, though? If so, we'd at least have to disallow identifiers being all digits, no?

@alex-spies
Copy link
Contributor

I wonder if this change affects lexing/parsing of multi-segment identifiers that contain quoted segments, e.g.:

asdf.`1234`.fdsa

Currently this should be allowed and fine, but this PR could break cases like this, so we should double check if this is covered by tests. (Not familiar enough with our parsing/lexing to say.)

An alternative would be to use a (negative) lookahead in the lexer via a semantic predicate (like in this SO post), to allow an identifier to start with a digit only if preceded by a DOT fragment. However, that might be problematic given that we generate both Java and JavaScript code from the same antlr4 definition :/

@costin
Copy link
Member Author

costin commented Oct 19, 2023

I've spent some time on the issue and we essentially have to pick whether we align all commands to have the same identifier definition (and restrictions) or keep things a bit loose which results in problems like in the issue above.
While I like the flexibility and I see that we took advantage of it in our tests, I'm opting for the latter since it less confusing.

To restate the problem, right now from, drop, keep, rename and enrich commands due to their simpler definition, accept a much larger set of identifiers to be declared without quotes.
To restate, the issue we're debating here is dealing with unquoted identifiers which, I would argue, should be discouraged form.

  1. fields with docs
from index
| where `duration.1m` > 1    // name needs quoting because of .
| keep duration.1m              // name can be unquoted 
  1. field with parenthesis
from index
| stats count(*), max(field)   
| where `count(*)` > 1          // aliased function name need to be quoted
| keep count(*)                    // same alias can be used unquoted
  1. fields with special chars such as @ or +
from index
| where `@domain+1` > 1   // needs quoting
| keep @domain+1            // used unquoted

I find case 1 the most concerning since . is used for qualified names a. b. We currently don't have plugged in but we will especially when dealing with different sources. This means supporting qualifiers both in unquoted and quoted fields (and having to deal with escaping among other things).

  1. is inconvenient when trying to work with things right after a stats however considering there's an alias version I don't think that's much of a problem.

  2. to me, this is an example why we'd want the identifiers to be the same everywhere else. The expression @domain+1 has different meaning depending on the command - in where it's an expression while in keep is a field as whole.

As a user, I find 3 quite trappy for no real advantage.

@elasticsearchmachine
Copy link
Collaborator

Hi @costin, I've updated the changelog YAML for you.

Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like we still need to make the CI happy (resp. make it compile), but the proposed refactoring LGTM.

fragment UNDERSCORE
: '_'
;

fragment UNQUOTED_ID_BODY
: (LETTER | DIGIT | UNDERSCORE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ to removing the dot again from the unquoted identifier. I agree that dots will likely need to retain special meaning to navigate nested objects. If I understand correctly, this should also bring back the ability to quote just part of an identifier, e.g.

foo.bar.`1234asdf`

@alex-spies
Copy link
Contributor

I've spent some time on the issue and we essentially have to pick whether we align all commands to have the same identifier definition (and restrictions) or keep things a bit loose which results in problems like in the issue above. While I like the flexibility and I see that we took advantage of it in our tests, I'm opting for the latter since it less confusing.

To restate the problem, right now from, drop, keep, rename and enrich commands due to their simpler definition, accept a much larger set of identifiers to be declared without quotes. To restate, the issue we're debating here is dealing with unquoted identifiers which, I would argue, should be discouraged form.

  1. fields with docs
from index
| where `duration.1m` > 1    // name needs quoting because of .
| keep duration.1m              // name can be unquoted 
  1. field with parenthesis
from index
| stats count(*), max(field)   
| where `count(*)` > 1          // aliased function name need to be quoted
| keep count(*)                    // same alias can be used unquoted
  1. fields with special chars such as @ or +
from index
| where `@domain+1` > 1   // needs quoting
| keep @domain+1            // used unquoted

I find case 1 the most concerning since . is used for qualified names a. b. We currently don't have plugged in but we will especially when dealing with different sources. This means supporting qualifiers both in unquoted and quoted fields (and having to deal with escaping among other things).

  1. is inconvenient when trying to work with things right after a stats however considering there's an alias version I don't think that's much of a problem.
  2. to me, this is an example why we'd want the identifiers to be the same everywhere else. The expression @domain+1 has different meaning depending on the command - in where it's an expression while in keep is a field as whole.

As a user, I find 3 quite trappy for no real advantage.

I think identifiers should be treated consistently across all clauses; I think for users it will be confusing that they need to do less quoting in KEEP than in WHERE. If I understand correctly, this could also simplify our lexing, considering that it requires lots of different modes at the moment.

@costin costin mentioned this pull request Nov 16, 2023
10 tasks
@dej611
Copy link
Contributor

dej611 commented Nov 29, 2023

I've migrated our grammar to this new one and fixed most of the token renames: elastic/kibana#172148
The test suite found out a regression with the new grammar, in particular related to the - within identifiers:

from a | enrich my-policy

That used to be valid before, while now it throws the following syntax errors:

SyntaxError: token recognition error at: '-'
SyntaxError: extraneous input 'policy' expecting <EOF>

Other than that, in general syntax errors are way more noisy:

  • show info something passed from a single syntax error (SyntaxError: extraneous input 'something' expecting <EOF>) to 9 now (one for each char)
  • from index | keep 4.5 passed from 2 to 4 syntax errors
  • from a | rename fn() as a passed from 0 to 2 syntax errors
  • etc... (can find more in the linked PR)

@alex-spies
Copy link
Contributor

I've migrated our grammar to this new one and fixed most of the token renames: elastic/kibana#172148 The test suite found out a regression with the new grammar, in particular related to the - within identifiers:

from a | enrich my-policy

Just checked and with this PR's code, this regression is in fact inconsistent with from; the following works:

from my-idx | keep whatever

There's another "inconsistency" with how field names are handled:

from my-idx | eval x = my-field

This parses as my SUB field; users need backticks around my-field to make this work.
But field names are not index/policy names, so maybe this inconsistency is not so bad.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Documentation will need an update, as well. CC @bpintea @abdonpijpelink

Copy link
Contributor

@luigidellaquila luigidellaquila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines -142 to +144
from employees | eval y = date_trunc(1 year, hire_date) | stats count(emp_no) by y | sort y | keep y, count(emp_no) | limit 5;
from employees | eval y = date_trunc(1 year, hire_date) | stats c = count(emp_no) by y | sort y | keep y, c | limit 5;

y:date | count(emp_no):long
y:date | c:long
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a positive side effect; being able to use an unquoted function-like alias seems inconsistent

@costin
Copy link
Member Author

costin commented Nov 29, 2023

I've migrated our grammar to this new one and fixed most of the token renames: elastic/kibana#172148
The test suite found out a regression with the new grammar, in particular related to the - within identifiers:

Just checked and with this PR's code, this regression is in fact inconsistent with from; the following works:

I think for enrich policy name, we can improve the grammar to have the same lexing as in from - that is allow - and other special characters in its name so there's no need for quoting.

Other than that, in general syntax errors are way more noisy:

show info something passed from a single syntax error (SyntaxError: extraneous input 'something' expecting ) to 9 now (one for each char)
from index | keep 4.5 passed from 2 to 4 syntax errors
from a | rename fn() as a passed from 0 to 2 syntax errors
etc... (can find more in the linked PR)

That's something that we could look into improving however it's a fragile approach since the error messages depend on the grammar which generates different ANTLR internal representation. Which is triggered by a change like the above.

rename fn() doesn't work anymore since fn() needs to be a field identifier meaning () cannot be used unquoted. The fix is to add backticks: rename `fn()`

@astefan
Copy link
Contributor

astefan commented Nov 29, 2023

Leaving this one here, since I don't see a solution to it:row `my-field`=123 | stats count(`my-field`) | eval x = `count(`my-field`)`

@costin
Copy link
Member Author

costin commented Dec 2, 2023

Updated the grammar to allow richer policies than identifiers - due to lexing that ended up more complicated than expected having to add a submode for enrich since otherwise the lexing between from identifiers and field identifiers tripped each other.

Leaving this one here, since I don't see a solution to it:row my-field=123 | stats count(my-field) | eval x = count(my-field)

added a test for it: stats count(`my-field`) | keep `count(my-field)`


public void testQuotedName() {
// row `my-field`=123 | stats count(`my-field`) | eval x = `count(`my-field`)`
LogicalPlan plan = processingCommand("stats count(`my-field`) | keep `count(my-field)`");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a real-life test where the code reaches the verifier, this query results in

{
    "error": {
        "root_cause": [
            {
                "type": "verification_exception",
                "reason": "Found 1 problem\nline 1:57: Unknown column [count(my-field)], did you mean [count(`my-field`)]?",
                "stack_trace": "org.elasticsearch.xpack.esql.analysis.VerificationException: Found 1 problem\nline 1:57: Unknown column [count(my-field)], did you mean [count(`my-field`)]?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - thanks for this.
This was incorrect - the correct syntax should be

keep `count(``my-field``)`  ; the backquotes are escaped

since we want to differentiate between

stats count(a - b)  ; which we want to support at some point
keep `count(a-b)`  

stats count(`a-b`)   ; this is not an expression but a field name hence the ` need to be kept around meaning
| keep `count(``a``-``b``)`

; this means any backquote from the user is kept as is since we're using the query verbatim

stats count(`a`-`b`)         ; need to preserve the quotes
| keep `count(``a``-``b``)` ; 

I've added a CSV test and caught a bug in our identifier handling.
I've tried to play with removing redundant quotes however as a user I found the behavior surprising.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @costin. This really LGTM.

@costin costin modified the milestones: 8.12, 8.13 Dec 4, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @costin, I've updated the changelog YAML for you.

@@ -783,7 +783,7 @@ FROM sample_data
median_duration:double | client_ip:ip
;

fieldEscaping
fieldEscaping#[skip:-8.12.99, reason:Fixed bug in 8.13 of removing the leading/trailing backquotes of an identifier]
FROM sample_data
| stats count(`event_duration`) | keep `count(``event_duration``)`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abdonpijpelink This PR introduces a (subtle) breaking change due to a previous bug:

  • the leading/trailing backquotes from an identifier are removed which means that if folks used them (rare case) they would have to update their query.
stats count(`event_duration`) produces the alias count(``event_duration`) - same text

Since 8.13 (this PR) keep needs to use the same rules for quoting as the rest of commands meaning that field names with special characters need to be quoted and the backquote itself be escaped through repetition.
So

count(`event_duration`)  becomes `count(``event_duration``)`

However that is not the case in 8.12 - the grammar rules are slightly different so the text is interpreted verbatim meaning the escaping of backticks is taken as is hence why the field is not found:

 Unknown column [count(``event_duration``)], did you mean [count(`event_duration`)]?

We need to put this information somewhere (not sure where) and in a smaller format.

@abdonpijpelink
Copy link
Contributor

I propose we change the changelog yaml summary on this PR into something like:

Referencing expressions that contain backticks requires <<esql-identifiers,escaping those backticks>>.

And in parallel, we update the Identifiers section on the Syntax page for 8.13 to include the new behavior.

@costin costin merged commit a8a956f into elastic:main Dec 12, 2023
2 of 15 checks passed
@costin costin deleted the fix/100312 branch December 12, 2023 22:04
dej611 added a commit to elastic/kibana that referenced this pull request Dec 20, 2023
## Summary

This PR aligns the (new) Kibana grammar to the newer ES grammar changes
proposed in elastic/elasticsearch#100740 .

`EXPAND` and `INLINESTATS` has been reinstated here (even when not used)
to exactly match the ES grammar.

Most of the changes are due to `TOKEN` renaming plus few other changes
on how identifiers are now parsed.

Revisit the validation logic helped also to find a couple of bugs on our
validation side, but they were very minimal and limited.

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

---------

Co-authored-by: kibanamachine <[email protected]>
Co-authored-by: Stratoula Kalafateli <[email protected]>
Co-authored-by: Abdon Pijpelink <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >enhancement Team:QL (Deprecated) Meta label for query languages team v8.13.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ES|QL does not consistently handle field names starting with a number
10 participants