Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EQL grammar updates and tests #49658

Merged
merged 14 commits into from
Jan 27, 2020
Merged

EQL grammar updates and tests #49658

merged 14 commits into from
Jan 27, 2020

Conversation

rw-access
Copy link
Contributor

Updated the grammar to more closely match https://github.com/endgameinc/eql/blob/master/eql/etc/eql.g (minus PEG). Also added additional unit tests from the python repository. I haven't ported over the invalid queries, since those are mixed and some are invalid for semantic reasons.

Summary of some changes

  • Added several tests
  • Removed maxspan keyword
  • Added all string permutationns
  • Lowercased tokens and removed the case insensitive lexer. This makes it more consistent with existing EQL implementations where and != AND
  • Removed digit and quoted identifier. At some point we'll need to support something like this for arbitrary fields, but I'm wondering if that should have its own syntax with potentially less ambiguities

@rw-access rw-access added the :Analytics/EQL EQL querying label Nov 27, 2019
@rw-access rw-access requested a review from costin November 27, 2019 17:14
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/EQL)

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some comments mainly around the intent behind the grammar - on whether loosen it covers queries not currently supported by the grammar.
It would be helpful to have accompanying queries beside the grammar changes to better correlate the two.
Thanks.

x-pack/plugin/eql/src/main/antlr/EqlBase.g4 Outdated Show resolved Hide resolved

span
: WITH MAXSPAN EQ DIGIT_IDENTIFIER
: BY expression (COMMA expression)*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expression is quite loose vs identifier as it considers foo +1 , 1/2 etc as valid grammar and it's not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional, and does mirror the existing grammar.

It'll likely have performant implications on the stack, but existing implementations of EQL do support this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Languages are defined by their syntax - how parsing is done, whether it's grammar that leads to generation, done by hand, etc.. it's an implementation detail.

Considering we're looking at reimplementing the parsing in a different language, with a different type of parser (LL vs PEG until v0.7, LALR now in 0.8) I don't think fully mirroring the grammar is attainable nor should be a goal.
Further more in this case, the grammar is incorrect so why keep that ?
Between correctness and compatibility, we should always chose the former.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Languages are defined by their syntax

Yes I agree. Even more precisely, I would say that languages are defined by the inputs they accept. I'm not advocating for mirroring a PEG grammar in ANTLR, but I do think that we should strive the same inputs and provide the same semantics as before. We can make exceptions as needed, and fix things along the way, but they should be intentional and we should be aware.

Further more in this case, the grammar is incorrect so why keep that ?

I think this is where we disagree. The previous grammar and implementations actually used this functionality. It wasn't too loose on accident. There were a few cases where the underlying schema wasn't enough, for by to match the underlying events together and we have had to use something akin to this, for example:

sequence
   [process where ...] by some_field
   [security where ...] by someFunction(other_field)

I think that if we ever have to choose between correctness and compatibility, choosing correctness makes sense. But in this case, the design was intentional so I don't think these are in conflict. If we need to restrict this because it's not scalable within ES, we can do that. But that doesn't mean the previous version was incorrect, just because there weren't many examples that exercised arbitrary expressions for a join key.

qualifiedNames
: qualifiedName (COMMA qualifiedName)*
;

qualifiedName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add the square brackets and integers? Not sure how that qualifies a name if there's no ..
The reason the grammar defines the prefix instead of the suffix optional is to help with parsing (the identifier is always the tail of the list while everything before it is the qualifier).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lets us type a.b[0].c, for instance. I believe we said indexing in arrays is out of scope, but it's sometimes necessary within pipes, because you can use events[0].some_field to refer to fields from the source events:

sequence by pid
  [process where process.name == "winword.exe"]
  [file where file.extension == "exe"]
| unique events[0].command_line

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow - if we don't support arrays how would that work with pipes?

As for array, grammar-wise the way to go is through a dedicated rule:

primaryExpression
  ...
 | qualifiedName 
 | value=primaryExpression LB index=valueExpression RB

This supports both literal expressions for the index and clearly differentiates between a name and the array syntax.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more flexible than the existing EQL, which has its pros and cons.
This would accept this, which EQL doesn't currently accept:

some_function(...)["some field"]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to respond to this - to avoid having a function as an argument use qualifiedName or identifier instead in the array rule:
| value=identifier LB index=valueExpression RB

which would prevent the function rule from applying.

x-pack/plugin/eql/src/main/antlr/EqlBase.g4 Show resolved Hide resolved
x-pack/plugin/eql/src/main/antlr/EqlBase.g4 Show resolved Hide resolved
x-pack/plugin/eql/src/main/antlr/EqlBase.g4 Show resolved Hide resolved
x-pack/plugin/eql/src/main/antlr/EqlBase.g4 Show resolved Hide resolved
}

@Override
public void exitQuotedIdentifier(EqlBaseParser.QuotedIdentifierContext context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another useful callback to strip quotes from identifier "where" WHERE condition so the parser receives where instead of "where".

@rw-access
Copy link
Contributor Author

rw-access commented Dec 2, 2019

I think that it might be good to have another discussion to hash things out further over zoom. I think some places where we diverge are in our expectations of the grammar and case-sensitivity. I'll summarize some of mine, so that it's hopefully more clear than it may have been with the comments:

  • we should accept the same syntax that is being used, unless its something we're explicitly dropping support for
  • we should extend the syntax when EQL is not expressive enough. for instance, adding support for field names that are not of the form [_A-Za-z][0-9A-Za-z_]*. these changes should be made in all versions of EQL because it has been asked for already.
  • we should extend the syntax to allow for index patterns, and allow for arbitrary characters in field names
  • we shouldn't loosen the syntax in additional ways where we accepts queries that existing implementations would explicitly reject (i.e where vs WHERE)
  • tokens have been historically treated as lowercase only, and I don't see any strong advantages for changing that now and sacrificing compatibility with existing implementations of EQL
  • we should avoid introducing ambiguities to the grammar. the existing grammar is LALR(1), so we know there are no ambiguities in its current form

@colings86
Copy link
Contributor

colings86 commented Dec 11, 2019

I agree with everything from @rw-access's comment, with the exception of one point:

we should extend the syntax to allow for index patterns, and allow for arbitrary characters in field names

We should extend the syntax to all for any characters that are allowed in Elasticsearch field names. I don't think we necessarily need to allow for all arbitrary characters) and we don't need to allow for index patterns ( see #49634 and #49462 (comment))

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a round of comments. Please merge the latest branch version in - it should fix the CI failures.

process where process_name == "svchost.exe" and command_line != "* -k *";
process where process_name in ('ipconfig.exe', 'netstat.exe', 'systeminfo.exe', 'route.exe');
process where subtype.create and wildcard(command_line, "*.ost *", "*.pst *")
;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file auto-generated?
I ask because it's fairly inconsistent.
👍 on comments.
If there are different types of queries, consider breaking it down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it's at least partially autogenerated. most of it is copied from the unit tests for the python library. I can definitely clean it up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if you think it's a good final state.
Btw, something that we do for docs is to include actual code snippets that are tested (for example:

) - this has the nice side-effect of being always consistent and correct.

Just something to keep in mind (if you have queries that are used in the docs, you can put them aside for the future).

@@ -98,18 +103,19 @@ primaryExpression
: constant #constantDefault
| functionExpression #function
| qualifiedName #dereference
| ESCAPED_IDENTIFIER #identifierEscape
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not declare both ESCAPED_IDENTIFIER and IDENTIFIER under identifier as in the previous version and use IDENTIFIER where needed (such as function names - it's might be worth using STRING or declaring a different alphanumeric for that)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, function names and pipe names should be the only exceptions to that.
Would you prefer to support both of these? I figured that if you are going to escape with `, then you might as well escape the whole thing, but I'm not attached:

# ticks around the full path. dots interpreted as normal
`some-escaped-field.one-more.escaped-!-field`

# ticks around each key
`some-escaped-field`.`one-more`.`escaped-!-field`

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the functionExpression rule pick a pipe/function declaration? In general I'd rather have us use qualifiedName everywhere (instead of identifier) so that user are able to use the . hierarchy and thus avoid redundancy (qualifiedName & identifier).

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few more comments before calling it a day.

@rw-access rw-access marked this pull request as ready for review January 23, 2020 16:12
@costin
Copy link
Member

costin commented Jan 27, 2020

@elasticmachine run elasticsearch-ci/1

@rw-access rw-access changed the base branch from feature/eql to master January 27, 2020 17:26
Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get this in and move any left over comments as different issues.

@rw-access rw-access merged commit 6f1890b into elastic:master Jan 27, 2020
@rw-access rw-access deleted the grammar-updates branch January 27, 2020 18:23
costin added a commit that referenced this pull request Feb 5, 2020
* EQL: Plug query params into the AstBuilder (#51886)

As the eventType is customizable, plug that into the parser based on the
given request.

(cherry picked from commit 5b4a3a3)

* EQL: Add field resolution and verification (#51872)

Add basic field resolution inside the Analyzer and a basic Verifier to
check for any unresolved fields.

(cherry picked from commit 7087358)

* EQL: Introduce basic execution pipeline (#51809)

Add main classes that form the 'execution' pipeline are added - most of
them have no functionality; the purpose of this PR is to add flesh out
the contract between the various moving parts so that work can start on
them independently.

(cherry picked from commit 9a1bae5)

* EQL: Add AstBuilder to convert to QL tree (#51558)

* EQL: Add AstBuilder visitors
* EQL: Add tests for wildcards and sets
* EQL: Fix licensing
* EQL: Fix ExpressionTests.java license
* EQL: Cleanup imports
* EQL: PR feedback and remove LiteralBuilder
* EQL: Split off logical plan from expressions
* EQL: Remove stray import
* EQL: Add predicate handling for set checks
* EQL: Remove commented out dead code
* EQL: Remove wildcard test, wait until analyzer

(cherry picked from commit a462700)

* EQL grammar updates and tests (#49658)

* EQL: Additional tests and grammar updates
* EQL: Add backtick escaped identifiers
* EQL: Adding keywords to language
* EQL: Add checks for unsupported syntax
* EQL: Testing updates and PR feedback
* EQL: Add string escapes
* EQL: Cleanup grammar for identifier
* EQL: Remove tabs from .eql tests

(cherry picked from commit 6f1890b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/EQL EQL querying
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants