-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EQL grammar updates and tests #49658
Conversation
Pinging @elastic/es-search (:Search/EQL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some comments mainly around the intent behind the grammar - on whether loosen it covers queries not currently supported by the grammar.
It would be helpful to have accompanying queries beside the grammar changes to better correlate the two.
Thanks.
|
||
span | ||
: WITH MAXSPAN EQ DIGIT_IDENTIFIER | ||
: BY expression (COMMA expression)* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expression
is quite loose vs identifier as it considers foo +1
, 1/2
etc as valid grammar and it's not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is intentional, and does mirror the existing grammar.
It'll likely have performant implications on the stack, but existing implementations of EQL do support this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Languages are defined by their syntax - how parsing is done, whether it's grammar that leads to generation, done by hand, etc.. it's an implementation detail.
Considering we're looking at reimplementing the parsing in a different language, with a different type of parser (LL vs PEG until v0.7, LALR now in 0.8) I don't think fully mirroring the grammar is attainable nor should be a goal.
Further more in this case, the grammar is incorrect so why keep that ?
Between correctness and compatibility, we should always chose the former.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Languages are defined by their syntax
Yes I agree. Even more precisely, I would say that languages are defined by the inputs they accept. I'm not advocating for mirroring a PEG grammar in ANTLR, but I do think that we should strive the same inputs and provide the same semantics as before. We can make exceptions as needed, and fix things along the way, but they should be intentional and we should be aware.
Further more in this case, the grammar is incorrect so why keep that ?
I think this is where we disagree. The previous grammar and implementations actually used this functionality. It wasn't too loose on accident. There were a few cases where the underlying schema wasn't enough, for by
to match the underlying events together and we have had to use something akin to this, for example:
sequence
[process where ...] by some_field
[security where ...] by someFunction(other_field)
I think that if we ever have to choose between correctness and compatibility, choosing correctness makes sense. But in this case, the design was intentional so I don't think these are in conflict. If we need to restrict this because it's not scalable within ES, we can do that. But that doesn't mean the previous version was incorrect, just because there weren't many examples that exercised arbitrary expressions for a join key.
qualifiedNames | ||
: qualifiedName (COMMA qualifiedName)* | ||
; | ||
|
||
qualifiedName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add the square brackets and integers? Not sure how that qualifies a name if there's no .
.
The reason the grammar defines the prefix instead of the suffix optional is to help with parsing (the identifier is always the tail of the list while everything before it is the qualifier).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lets us type a.b[0].c
, for instance. I believe we said indexing in arrays is out of scope, but it's sometimes necessary within pipes, because you can use events[0].some_field
to refer to fields from the source events:
sequence by pid
[process where process.name == "winword.exe"]
[file where file.extension == "exe"]
| unique events[0].command_line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow - if we don't support arrays how would that work with pipes?
As for array, grammar-wise the way to go is through a dedicated rule:
primaryExpression
...
| qualifiedName
| value=primaryExpression LB index=valueExpression RB
This supports both literal expressions for the index and clearly differentiates between a name and the array syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more flexible than the existing EQL, which has its pros and cons.
This would accept this, which EQL doesn't currently accept:
some_function(...)["some field"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to respond to this - to avoid having a function as an argument use qualifiedName or identifier instead in the array rule:
| value=identifier LB index=valueExpression RB
which would prevent the function rule from applying.
x-pack/plugin/eql/src/main/java/org/elasticsearch/xpack/eql/parser/EqlParser.java
Show resolved
Hide resolved
x-pack/plugin/eql/src/main/java/org/elasticsearch/xpack/eql/parser/EqlParser.java
Show resolved
Hide resolved
} | ||
|
||
@Override | ||
public void exitQuotedIdentifier(EqlBaseParser.QuotedIdentifierContext context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another useful callback to strip quotes from identifier "where" WHERE condition
so the parser receives where
instead of "where"
.
I think that it might be good to have another discussion to hash things out further over zoom. I think some places where we diverge are in our expectations of the grammar and case-sensitivity. I'll summarize some of mine, so that it's hopefully more clear than it may have been with the comments:
|
I agree with everything from @rw-access's comment, with the exception of one point:
We should extend the syntax to all for any characters that are allowed in Elasticsearch field names. I don't think we necessarily need to allow for all arbitrary characters) and we don't need to allow for index patterns ( see #49634 and #49462 (comment)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a round of comments. Please merge the latest branch version in - it should fix the CI failures.
x-pack/plugin/eql/src/main/java/org/elasticsearch/xpack/eql/parser/EqlParser.java
Show resolved
Hide resolved
x-pack/plugin/eql/src/test/java/org/elasticsearch/xpack/eql/parser/GrammarTests.java
Show resolved
Hide resolved
x-pack/plugin/eql/src/test/java/org/elasticsearch/xpack/eql/parser/GrammarTests.java
Show resolved
Hide resolved
x-pack/plugin/eql/src/test/java/org/elasticsearch/xpack/eql/parser/GrammarTests.java
Outdated
Show resolved
Hide resolved
process where process_name == "svchost.exe" and command_line != "* -k *"; | ||
process where process_name in ('ipconfig.exe', 'netstat.exe', 'systeminfo.exe', 'route.exe'); | ||
process where subtype.create and wildcard(command_line, "*.ost *", "*.pst *") | ||
; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this file auto-generated?
I ask because it's fairly inconsistent.
👍 on comments.
If there are different types of queries, consider breaking it down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it's at least partially autogenerated. most of it is copied from the unit tests for the python library. I can definitely clean it up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only if you think it's a good final state.
Btw, something that we do for docs is to include actual code snippets that are tested (for example:
// tag::describeTable |
Just something to keep in mind (if you have queries that are used in the docs, you can put them aside for the future).
@@ -98,18 +103,19 @@ primaryExpression | |||
: constant #constantDefault | |||
| functionExpression #function | |||
| qualifiedName #dereference | |||
| ESCAPED_IDENTIFIER #identifierEscape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not declare both ESCAPED_IDENTIFIER
and IDENTIFIER
under identifier
as in the previous version and use IDENTIFIER
where needed (such as function names - it's might be worth using STRING or declaring a different alphanumeric for that)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, function names and pipe names should be the only exceptions to that.
Would you prefer to support both of these? I figured that if you are going to escape with `, then you might as well escape the whole thing, but I'm not attached:
# ticks around the full path. dots interpreted as normal
`some-escaped-field.one-more.escaped-!-field`
# ticks around each key
`some-escaped-field`.`one-more`.`escaped-!-field`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the functionExpression
rule pick a pipe/function declaration? In general I'd rather have us use qualifiedName everywhere (instead of identifier) so that user are able to use the .
hierarchy and thus avoid redundancy (qualifiedName & identifier).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few more comments before calling it a day.
@elasticmachine run elasticsearch-ci/1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get this in and move any left over comments as different issues.
* EQL: Plug query params into the AstBuilder (#51886) As the eventType is customizable, plug that into the parser based on the given request. (cherry picked from commit 5b4a3a3) * EQL: Add field resolution and verification (#51872) Add basic field resolution inside the Analyzer and a basic Verifier to check for any unresolved fields. (cherry picked from commit 7087358) * EQL: Introduce basic execution pipeline (#51809) Add main classes that form the 'execution' pipeline are added - most of them have no functionality; the purpose of this PR is to add flesh out the contract between the various moving parts so that work can start on them independently. (cherry picked from commit 9a1bae5) * EQL: Add AstBuilder to convert to QL tree (#51558) * EQL: Add AstBuilder visitors * EQL: Add tests for wildcards and sets * EQL: Fix licensing * EQL: Fix ExpressionTests.java license * EQL: Cleanup imports * EQL: PR feedback and remove LiteralBuilder * EQL: Split off logical plan from expressions * EQL: Remove stray import * EQL: Add predicate handling for set checks * EQL: Remove commented out dead code * EQL: Remove wildcard test, wait until analyzer (cherry picked from commit a462700) * EQL grammar updates and tests (#49658) * EQL: Additional tests and grammar updates * EQL: Add backtick escaped identifiers * EQL: Adding keywords to language * EQL: Add checks for unsupported syntax * EQL: Testing updates and PR feedback * EQL: Add string escapes * EQL: Cleanup grammar for identifier * EQL: Remove tabs from .eql tests (cherry picked from commit 6f1890b)
Updated the grammar to more closely match https://github.com/endgameinc/eql/blob/master/eql/etc/eql.g (minus PEG). Also added additional unit tests from the python repository. I haven't ported over the invalid queries, since those are mixed and some are invalid for semantic reasons.
Summary of some changes
maxspan
keywordand
!=AND