-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ESQL: Improve grammar to allow identifiers with . #100740
Conversation
Extend the unquoted identifier to contain . not just numbers. Without it the lexer picks the characters as decimal literal which leads to errors Fix elastic#100312
Hi @costin, I've created a changelog YAML for you. |
Pinging @elastic/es-ql (Team:QL) |
Pinging @elastic/elasticsearch-esql (:Query Languages/ES|QL) |
The issue with unquoted field names that contain To make matters worse we have different lexer rules between different commands hence why the same unquoted identifier works in one but fails in the other - in once is just a string, in the other it's an expression. |
|
||
UNQUOTED_IDENTIFIER | ||
: LETTER (LETTER | DIGIT | '_')* | ||
: LETTER UNQUOTED_ID_BODY* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will allow digits inside "subfields", but not on the root field. For example 123elasticsearch.node.stats.os.cpu.load_avg.1m
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to allow unquoted identifiers that start with digits, though? If so, we'd at least have to disallow identifiers being all digits, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'd at least have to disallow identifiers being all digits, no?
ES accepts both fields "123"
and "123.456"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which we accept as long as the fields are quoted "123" - same with field above "a.1m.4321" is accepted; the problem is handling fields which are NOT quoted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
docs/changelog/100740.yaml
Outdated
@@ -0,0 +1,6 @@ | |||
pr: 100740 | |||
summary: "ESQL: Improve grammar to allow identifiers with" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changelog didn't like the period.
|
||
UNQUOTED_IDENTIFIER | ||
: LETTER (LETTER | DIGIT | '_')* | ||
: LETTER UNQUOTED_ID_BODY* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to allow unquoted identifiers that start with digits, though? If so, we'd at least have to disallow identifiers being all digits, no?
I wonder if this change affects lexing/parsing of multi-segment identifiers that contain quoted segments, e.g.:
Currently this should be allowed and fine, but this PR could break cases like this, so we should double check if this is covered by tests. (Not familiar enough with our parsing/lexing to say.) An alternative would be to use a (negative) lookahead in the lexer via a semantic predicate (like in this SO post), to allow an identifier to start with a digit only if preceded by a |
I've spent some time on the issue and we essentially have to pick whether we align all commands to have the same identifier definition (and restrictions) or keep things a bit loose which results in problems like in the issue above. To restate the problem, right now from, drop, keep, rename and enrich commands due to their simpler definition, accept a much larger set of identifiers to be declared without quotes.
I find case 1 the most concerning since
As a user, I find 3 quite trappy for no real advantage. |
Hi @costin, I've updated the changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like we still need to make the CI happy (resp. make it compile), but the proposed refactoring LGTM.
fragment UNDERSCORE | ||
: '_' | ||
; | ||
|
||
fragment UNQUOTED_ID_BODY | ||
: (LETTER | DIGIT | UNDERSCORE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ to removing the dot again from the unquoted identifier. I agree that dots will likely need to retain special meaning to navigate nested objects. If I understand correctly, this should also bring back the ability to quote just part of an identifier, e.g.
foo.bar.`1234asdf`
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/parser/EsqlBaseParser.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/parser/StatementParserTests.java
Show resolved
Hide resolved
I think identifiers should be treated consistently across all clauses; I think for users it will be confusing that they need to do less quoting in |
I've migrated our grammar to this new one and fixed most of the token renames: elastic/kibana#172148
That used to be valid before, while now it throws the following syntax errors:
Other than that, in general syntax errors are way more noisy:
|
Just checked and with this PR's code, this regression is in fact inconsistent with
There's another "inconsistency" with how field names are handled:
This parses as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Documentation will need an update, as well. CC @bpintea @abdonpijpelink
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
from employees | eval y = date_trunc(1 year, hire_date) | stats count(emp_no) by y | sort y | keep y, count(emp_no) | limit 5; | ||
from employees | eval y = date_trunc(1 year, hire_date) | stats c = count(emp_no) by y | sort y | keep y, c | limit 5; | ||
|
||
y:date | count(emp_no):long | ||
y:date | c:long |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a positive side effect; being able to use an unquoted function-like alias seems inconsistent
I think for enrich policy name, we can improve the grammar to have the same lexing as in
That's something that we could look into improving however it's a fragile approach since the error messages depend on the grammar which generates different ANTLR internal representation. Which is triggered by a change like the above.
|
Leaving this one here, since I don't see a solution to it: |
Updated the grammar to allow richer policies than identifiers - due to lexing that ended up more complicated than expected having to add a submode for enrich since otherwise the lexing between from identifiers and field identifiers tripped each other.
added a test for it: |
|
||
public void testQuotedName() { | ||
// row `my-field`=123 | stats count(`my-field`) | eval x = `count(`my-field`)` | ||
LogicalPlan plan = processingCommand("stats count(`my-field`) | keep `count(my-field)`"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a real-life test where the code reaches the verifier, this query results in
{
"error": {
"root_cause": [
{
"type": "verification_exception",
"reason": "Found 1 problem\nline 1:57: Unknown column [count(my-field)], did you mean [count(`my-field`)]?",
"stack_trace": "org.elasticsearch.xpack.esql.analysis.VerificationException: Found 1 problem\nline 1:57: Unknown column [count(my-field)], did you mean [count(`my-field`)]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch - thanks for this.
This was incorrect - the correct syntax should be
keep `count(``my-field``)` ; the backquotes are escaped
since we want to differentiate between
stats count(a - b) ; which we want to support at some point
keep `count(a-b)`
stats count(`a-b`) ; this is not an expression but a field name hence the ` need to be kept around meaning
| keep `count(``a``-``b``)`
; this means any backquote from the user is kept as is since we're using the query verbatim
stats count(`a`-`b`) ; need to preserve the quotes
| keep `count(``a``-``b``)` ;
I've added a CSV test and caught a bug in our identifier handling.
I've tried to play with removing redundant quotes however as a user I found the behavior surprising.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @costin. This really LGTM.
Hi @costin, I've updated the changelog YAML for you. |
@@ -783,7 +783,7 @@ FROM sample_data | |||
median_duration:double | client_ip:ip | |||
; | |||
|
|||
fieldEscaping | |||
fieldEscaping#[skip:-8.12.99, reason:Fixed bug in 8.13 of removing the leading/trailing backquotes of an identifier] | |||
FROM sample_data | |||
| stats count(`event_duration`) | keep `count(``event_duration``)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abdonpijpelink This PR introduces a (subtle) breaking change due to a previous bug:
- the leading/trailing backquotes from an identifier are removed which means that if folks used them (rare case) they would have to update their query.
stats count(`event_duration`) produces the alias count(``event_duration`) - same text
Since 8.13 (this PR) keep needs to use the same rules for quoting as the rest of commands meaning that field names with special characters need to be quoted and the backquote itself be escaped through repetition.
So
count(`event_duration`) becomes `count(``event_duration``)`
However that is not the case in 8.12 - the grammar rules are slightly different so the text is interpreted verbatim meaning the escaping of backticks is taken as is hence why the field is not found:
Unknown column [count(``event_duration``)], did you mean [count(`event_duration`)]?
We need to put this information somewhere (not sure where) and in a smaller format.
I propose we change the changelog yaml
And in parallel, we update the Identifiers section on the Syntax page for 8.13 to include the new behavior. |
## Summary This PR aligns the (new) Kibana grammar to the newer ES grammar changes proposed in elastic/elasticsearch#100740 . `EXPAND` and `INLINESTATS` has been reinstated here (even when not used) to exactly match the ES grammar. Most of the changes are due to `TOKEN` renaming plus few other changes on how identifiers are now parsed. Revisit the validation logic helped also to find a couple of bugs on our validation side, but they were very minimal and limited. ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios --------- Co-authored-by: kibanamachine <[email protected]> Co-authored-by: Stratoula Kalafateli <[email protected]> Co-authored-by: Abdon Pijpelink <[email protected]>
Extend the unquoted identifier to contain . not just numbers. Without it
the lexer picks the characters as decimal literal which leads to errors
Fix #100312