EQL grammar updates and tests #49658

rw-access · 2019-11-27T17:14:56Z

Updated the grammar to more closely match https://github.com/endgameinc/eql/blob/master/eql/etc/eql.g (minus PEG). Also added additional unit tests from the python repository. I haven't ported over the invalid queries, since those are mixed and some are invalid for semantic reasons.

Summary of some changes

Added several tests
Removed maxspan keyword
Added all string permutationns
Lowercased tokens and removed the case insensitive lexer. This makes it more consistent with existing EQL implementations where and != AND
Removed digit and quoted identifier. At some point we'll need to support something like this for arbitrary fields, but I'm wondering if that should have its own syntax with potentially less ambiguities

elasticmachine · 2019-11-27T17:14:58Z

Pinging @elastic/es-search (:Search/EQL)

costin

I've left some comments mainly around the intent behind the grammar - on whether loosen it covers queries not currently supported by the grammar.
It would be helpful to have accompanying queries beside the grammar changes to better correlate the two.
Thanks.

x-pack/plugin/eql/src/main/antlr/EqlBase.g4

costin · 2019-12-02T17:35:16Z

x-pack/plugin/eql/src/main/antlr/EqlBase.g4

-
-span
-    : WITH MAXSPAN EQ DIGIT_IDENTIFIER
+    : BY expression (COMMA expression)*


expression is quite loose vs identifier as it considers foo +1 , 1/2 etc as valid grammar and it's not.

This is intentional, and does mirror the existing grammar.

It'll likely have performant implications on the stack, but existing implementations of EQL do support this.

Languages are defined by their syntax - how parsing is done, whether it's grammar that leads to generation, done by hand, etc.. it's an implementation detail.

Considering we're looking at reimplementing the parsing in a different language, with a different type of parser (LL vs PEG until v0.7, LALR now in 0.8) I don't think fully mirroring the grammar is attainable nor should be a goal.
Further more in this case, the grammar is incorrect so why keep that ?
Between correctness and compatibility, we should always chose the former.

Languages are defined by their syntax

Yes I agree. Even more precisely, I would say that languages are defined by the inputs they accept. I'm not advocating for mirroring a PEG grammar in ANTLR, but I do think that we should strive the same inputs and provide the same semantics as before. We can make exceptions as needed, and fix things along the way, but they should be intentional and we should be aware.

Further more in this case, the grammar is incorrect so why keep that ?

I think this is where we disagree. The previous grammar and implementations actually used this functionality. It wasn't too loose on accident. There were a few cases where the underlying schema wasn't enough, for by to match the underlying events together and we have had to use something akin to this, for example:

sequence [process where ...] by some_field [security where ...] by someFunction(other_field)

I think that if we ever have to choose between correctness and compatibility, choosing correctness makes sense. But in this case, the design was intentional so I don't think these are in conflict. If we need to restrict this because it's not scalable within ES, we can do that. But that doesn't mean the previous version was incorrect, just because there weren't many examples that exercised arbitrary expressions for a join key.

costin · 2019-12-02T17:44:57Z

x-pack/plugin/eql/src/main/antlr/EqlBase.g4

-qualifiedNames
-    : qualifiedName (COMMA qualifiedName)*
-    ;
-
 qualifiedName


Why add the square brackets and integers? Not sure how that qualifies a name if there's no ..
The reason the grammar defines the prefix instead of the suffix optional is to help with parsing (the identifier is always the tail of the list while everything before it is the qualifier).

This lets us type a.b[0].c, for instance. I believe we said indexing in arrays is out of scope, but it's sometimes necessary within pipes, because you can use events[0].some_field to refer to fields from the source events:

sequence by pid [process where process.name == "winword.exe"] [file where file.extension == "exe"] | unique events[0].command_line

I don't follow - if we don't support arrays how would that work with pipes?

As for array, grammar-wise the way to go is through a dedicated rule:

primaryExpression ... | qualifiedName | value=primaryExpression LB index=valueExpression RB

This supports both literal expressions for the index and clearly differentiates between a name and the array syntax.

This is more flexible than the existing EQL, which has its pros and cons.
This would accept this, which EQL doesn't currently accept:

some_function(...)["some field"]

Forgot to respond to this - to avoid having a function as an argument use qualifiedName or identifier instead in the array rule:
| value=identifier LB index=valueExpression RB

which would prevent the function rule from applying.

x-pack/plugin/eql/src/main/antlr/EqlBase.g4

x-pack/plugin/eql/src/main/java/org/elasticsearch/xpack/eql/parser/EqlParser.java

costin · 2019-12-02T17:56:55Z

x-pack/plugin/eql/src/main/java/org/elasticsearch/xpack/eql/parser/EqlParser.java

-        }
-
-        @Override
-        public void exitQuotedIdentifier(EqlBaseParser.QuotedIdentifierContext context) {


Another useful callback to strip quotes from identifier "where" WHERE condition so the parser receives where instead of "where".

rw-access · 2019-12-02T22:34:03Z

I think that it might be good to have another discussion to hash things out further over zoom. I think some places where we diverge are in our expectations of the grammar and case-sensitivity. I'll summarize some of mine, so that it's hopefully more clear than it may have been with the comments:

we should accept the same syntax that is being used, unless its something we're explicitly dropping support for
we should extend the syntax when EQL is not expressive enough. for instance, adding support for field names that are not of the form [_A-Za-z][0-9A-Za-z_]*. these changes should be made in all versions of EQL because it has been asked for already.
we should extend the syntax to allow for index patterns, and allow for arbitrary characters in field names
we shouldn't loosen the syntax in additional ways where we accepts queries that existing implementations would explicitly reject (i.e where vs WHERE)
tokens have been historically treated as lowercase only, and I don't see any strong advantages for changing that now and sacrificing compatibility with existing implementations of EQL
we should avoid introducing ambiguities to the grammar. the existing grammar is LALR(1), so we know there are no ambiguities in its current form

colings86 · 2019-12-11T13:15:51Z

I agree with everything from @rw-access's comment, with the exception of one point:

we should extend the syntax to allow for index patterns, and allow for arbitrary characters in field names

We should extend the syntax to all for any characters that are allowed in Elasticsearch field names. I don't think we necessarily need to allow for all arbitrary characters) and we don't need to allow for index patterns ( see #49634 and #49462 (comment))

costin

Left a round of comments. Please merge the latest branch version in - it should fix the CI failures.

x-pack/plugin/eql/src/main/java/org/elasticsearch/xpack/eql/parser/EqlParser.java

x-pack/plugin/eql/src/test/java/org/elasticsearch/xpack/eql/parser/GrammarTests.java

costin · 2020-01-21T14:23:22Z

x-pack/plugin/eql/src/test/resources/valid-queries.eql

+process where process_name == "svchost.exe" and command_line != "* -k *";
+process where process_name in ('ipconfig.exe', 'netstat.exe', 'systeminfo.exe', 'route.exe');
+process where subtype.create and wildcard(command_line, "*.ost *", "*.pst *")
+;


Is this file auto-generated?
I ask because it's fairly inconsistent.
👍 on comments.
If there are different types of queries, consider breaking it down.

yeah, it's at least partially autogenerated. most of it is copied from the unit tests for the python library. I can definitely clean it up.

Only if you think it's a good final state.
Btw, something that we do for docs is to include actual code snippets that are tested (for example:

elasticsearch/x-pack/plugin/sql/qa/src/main/resources/docs/docs.csv-spec

Line 14 in 4a8e5ad

// tag::describeTable

) - this has the nice side-effect of being always consistent and correct.

Just something to keep in mind (if you have queries that are used in the docs, you can put them aside for the future).

costin · 2020-01-21T14:28:03Z

x-pack/plugin/eql/src/main/antlr/EqlBase.g4

@@ -98,18 +103,19 @@ primaryExpression
    : constant                                                                          #constantDefault
    | functionExpression                                                                #function
    | qualifiedName                                                                     #dereference
+    | ESCAPED_IDENTIFIER                                                                #identifierEscape


Why not declare both ESCAPED_IDENTIFIER and IDENTIFIER under identifier as in the previous version and use IDENTIFIER where needed (such as function names - it's might be worth using STRING or declaring a different alphanumeric for that)?

Yeah, function names and pipe names should be the only exceptions to that.
Would you prefer to support both of these? I figured that if you are going to escape with `, then you might as well escape the whole thing, but I'm not attached:

# ticks around the full path. dots interpreted as normal `some-escaped-field.one-more.escaped-!-field` # ticks around each key `some-escaped-field`.`one-more`.`escaped-!-field`

Should the functionExpression rule pick a pipe/function declaration? In general I'd rather have us use qualifiedName everywhere (instead of identifier) so that user are able to use the . hierarchy and thus avoid redundancy (qualifiedName & identifier).

costin

Left a few more comments before calling it a day.

costin · 2020-01-27T09:45:57Z

@elasticmachine run elasticsearch-ci/1

costin

Let's get this in and move any left over comments as different issues.

* EQL: Plug query params into the AstBuilder (#51886) As the eventType is customizable, plug that into the parser based on the given request. (cherry picked from commit 5b4a3a3) * EQL: Add field resolution and verification (#51872) Add basic field resolution inside the Analyzer and a basic Verifier to check for any unresolved fields. (cherry picked from commit 7087358) * EQL: Introduce basic execution pipeline (#51809) Add main classes that form the 'execution' pipeline are added - most of them have no functionality; the purpose of this PR is to add flesh out the contract between the various moving parts so that work can start on them independently. (cherry picked from commit 9a1bae5) * EQL: Add AstBuilder to convert to QL tree (#51558) * EQL: Add AstBuilder visitors * EQL: Add tests for wildcards and sets * EQL: Fix licensing * EQL: Fix ExpressionTests.java license * EQL: Cleanup imports * EQL: PR feedback and remove LiteralBuilder * EQL: Split off logical plan from expressions * EQL: Remove stray import * EQL: Add predicate handling for set checks * EQL: Remove commented out dead code * EQL: Remove wildcard test, wait until analyzer (cherry picked from commit a462700) * EQL grammar updates and tests (#49658) * EQL: Additional tests and grammar updates * EQL: Add backtick escaped identifiers * EQL: Adding keywords to language * EQL: Add checks for unsupported syntax * EQL: Testing updates and PR feedback * EQL: Add string escapes * EQL: Cleanup grammar for identifier * EQL: Remove tabs from .eql tests (cherry picked from commit 6f1890b)

Additional tests and grammar updates

ccc864d

rw-access added the :Analytics/EQL EQL querying label Nov 27, 2019

rw-access requested a review from costin November 27, 2019 17:14

costin reviewed Dec 2, 2019

View reviewed changes

rw-access added 4 commits January 16, 2020 16:59

Merge remote-tracking branch 'origin/feature/eql' into grammar-updates

7299bd1

Add backtick escaped identifiers

796033a

Adding keywords to language

37fdda6

EQL: Add checks for unsupported syntax

8ec422e

costin reviewed Jan 21, 2020

View reviewed changes

rw-access added 5 commits January 21, 2020 08:20

Merge remote-tracking branch 'origin/master' into grammar-updates

b2ee44a

EQL: Testing updates and PR feedback

c8fc383

EQL: Add string escapes

988215e

EQL: Cleanup grammar for identifier

c0deb01

Merge branch 'feature/eql' into grammar-updates

9f5eed4

costin reviewed Jan 21, 2020

View reviewed changes

rw-access added 3 commits January 23, 2020 09:00

Merge remote-tracking branch 'origin/feature/eql' into grammar-updates

937e317

Merge remote-tracking branch 'origin/feature/eql' into grammar-updates

98d06ec

EQL: Remove tabs from .eql tests

66aa579

rw-access marked this pull request as ready for review January 23, 2020 16:12

rw-access mentioned this pull request Jan 24, 2020

Escape non-alphanumeric fields for EQL #51443

Closed

rw-access changed the base branch from feature/eql to master January 27, 2020 17:26

Merge remote-tracking branch 'origin/master' into grammar-updates

e33236f

costin approved these changes Jan 27, 2020

View reviewed changes

rw-access merged commit 6f1890b into elastic:master Jan 27, 2020

rw-access deleted the grammar-updates branch January 27, 2020 18:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EQL grammar updates and tests #49658

EQL grammar updates and tests #49658

rw-access commented Nov 27, 2019

elasticmachine commented Nov 27, 2019

costin left a comment

costin Dec 2, 2019

rw-access Dec 2, 2019

costin Dec 7, 2019

rw-access Dec 7, 2019

costin Dec 2, 2019

rw-access Dec 2, 2019

costin Dec 7, 2019

rw-access Dec 7, 2019

costin Jan 21, 2020

costin Dec 2, 2019

rw-access commented Dec 2, 2019 •

edited

Loading

colings86 commented Dec 11, 2019 •

edited

Loading

costin left a comment

costin Jan 21, 2020

rw-access Jan 21, 2020

costin Jan 21, 2020

costin Jan 21, 2020

rw-access Jan 21, 2020

costin Jan 21, 2020

costin left a comment

costin commented Jan 27, 2020

costin left a comment

EQL grammar updates and tests #49658

EQL grammar updates and tests #49658

Conversation

rw-access commented Nov 27, 2019

elasticmachine commented Nov 27, 2019

costin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rw-access commented Dec 2, 2019 • edited Loading

colings86 commented Dec 11, 2019 • edited Loading

costin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

costin left a comment

Choose a reason for hiding this comment

costin commented Jan 27, 2020

costin left a comment

Choose a reason for hiding this comment

rw-access commented Dec 2, 2019 •

edited

Loading

colings86 commented Dec 11, 2019 •

edited

Loading