Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EQL grammar updates and tests #49658

Merged
merged 14 commits into from
Jan 27, 2020
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
150 changes: 72 additions & 78 deletions x-pack/plugin/eql/src/main/antlr/EqlBase.g4
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,6 @@

grammar EqlBase;

tokens {
DELIMITER
}

singleStatement
: statement EOF
Expand All @@ -19,45 +16,54 @@ singleExpression
;

statement
: query (PIPE pipe)*
: query pipe*
;

query
: sequence
| join
| condition
| eventQuery
;

sequenceParams
: WITH (MAXSPAN EQ timeUnit)
;

sequence
: SEQUENCE (by=joinKeys)? (span)?
match+
(UNTIL match)?
: SEQUENCE (by=joinKeys sequenceParams? | sequenceParams by=joinKeys?)?
sequenceTerm sequenceTerm+
(UNTIL sequenceTerm)?
;

join
: JOIN (by=joinKeys)?
match+
(UNTIL match)?
joinTerm joinTerm+
(UNTIL joinTerm)?
;

pipe
: kind=IDENTIFIER (booleanExpression (COMMA booleanExpression)*)?
: PIPE kind=IDENTIFIER (booleanExpression (COMMA booleanExpression)*)?
;


joinKeys
: BY qualifiedNames
;

span
: WITH MAXSPAN EQ DIGIT_IDENTIFIER
: BY expression (COMMA expression)*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expression is quite loose vs identifier as it considers foo +1 , 1/2 etc as valid grammar and it's not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional, and does mirror the existing grammar.

It'll likely have performant implications on the stack, but existing implementations of EQL do support this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Languages are defined by their syntax - how parsing is done, whether it's grammar that leads to generation, done by hand, etc.. it's an implementation detail.

Considering we're looking at reimplementing the parsing in a different language, with a different type of parser (LL vs PEG until v0.7, LALR now in 0.8) I don't think fully mirroring the grammar is attainable nor should be a goal.
Further more in this case, the grammar is incorrect so why keep that ?
Between correctness and compatibility, we should always chose the former.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Languages are defined by their syntax

Yes I agree. Even more precisely, I would say that languages are defined by the inputs they accept. I'm not advocating for mirroring a PEG grammar in ANTLR, but I do think that we should strive the same inputs and provide the same semantics as before. We can make exceptions as needed, and fix things along the way, but they should be intentional and we should be aware.

Further more in this case, the grammar is incorrect so why keep that ?

I think this is where we disagree. The previous grammar and implementations actually used this functionality. It wasn't too loose on accident. There were a few cases where the underlying schema wasn't enough, for by to match the underlying events together and we have had to use something akin to this, for example:

sequence
   [process where ...] by some_field
   [security where ...] by someFunction(other_field)

I think that if we ever have to choose between correctness and compatibility, choosing correctness makes sense. But in this case, the design was intentional so I don't think these are in conflict. If we need to restrict this because it's not scalable within ES, we can do that. But that doesn't mean the previous version was incorrect, just because there weren't many examples that exercised arbitrary expressions for a join key.

;

match
: LB condition RB (by=joinKeys)?
joinTerm
: subquery (by=joinKeys)?
;

sequenceTerm
: subquery (FORK (EQ booleanValue)?)? (by=joinKeys)?
;

subquery
: LB eventQuery RB
;

condition
: event=qualifiedName WHERE expression
eventQuery
: event=identifier WHERE expression
;

expression
Expand All @@ -66,6 +72,7 @@ expression

booleanExpression
: NOT booleanExpression #logicalNot
| relationship=identifier OF subquery #processCheck
| predicated #booleanDefault
| left=booleanExpression operator=AND right=booleanExpression #logicalBinary
| left=booleanExpression operator=OR right=booleanExpression #logicalBinary
Expand All @@ -81,9 +88,7 @@ predicated
// dedicated calls for each branch are not used to reuse the NOT handling across them
// instead the property kind is used for differentiation
predicate
: NOT? kind=BETWEEN lower=valueExpression AND upper=valueExpression
| NOT? kind=IN LP valueExpression (COMMA valueExpression)* RP
| NOT? kind=IN LP query RP
: NOT? kind=IN LP valueExpression (COMMA valueExpression)* RP
;

valueExpression
Expand All @@ -98,18 +103,19 @@ primaryExpression
: constant #constantDefault
| functionExpression #function
| qualifiedName #dereference
| ESCAPED_IDENTIFIER #identifierEscape
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not declare both ESCAPED_IDENTIFIER and IDENTIFIER under identifier as in the previous version and use IDENTIFIER where needed (such as function names - it's might be worth using STRING or declaring a different alphanumeric for that)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, function names and pipe names should be the only exceptions to that.
Would you prefer to support both of these? I figured that if you are going to escape with `, then you might as well escape the whole thing, but I'm not attached:

# ticks around the full path. dots interpreted as normal
`some-escaped-field.one-more.escaped-!-field`

# ticks around each key
`some-escaped-field`.`one-more`.`escaped-!-field`

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the functionExpression rule pick a pipe/function declaration? In general I'd rather have us use qualifiedName everywhere (instead of identifier) so that user are able to use the . hierarchy and thus avoid redundancy (qualifiedName & identifier).

| LP expression RP #parenthesizedExpression
;

functionExpression
: identifier LP (expression (COMMA expression)*)? RP
: name=identifier LP (expression (COMMA expression)*)? RP
;

constant
: NULL #nullLiteral
| number #numericLiteral
| booleanValue #booleanLiteral
| STRING+ #stringLiteral
| string #stringLiteral
;

comparisonOperator
Expand All @@ -120,26 +126,16 @@ booleanValue
: TRUE | FALSE
;

qualifiedNames
: qualifiedName (COMMA qualifiedName)*
;

qualifiedName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add the square brackets and integers? Not sure how that qualifies a name if there's no ..
The reason the grammar defines the prefix instead of the suffix optional is to help with parsing (the identifier is always the tail of the list while everything before it is the qualifier).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lets us type a.b[0].c, for instance. I believe we said indexing in arrays is out of scope, but it's sometimes necessary within pipes, because you can use events[0].some_field to refer to fields from the source events:

sequence by pid
  [process where process.name == "winword.exe"]
  [file where file.extension == "exe"]
| unique events[0].command_line

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow - if we don't support arrays how would that work with pipes?

As for array, grammar-wise the way to go is through a dedicated rule:

primaryExpression
  ...
 | qualifiedName 
 | value=primaryExpression LB index=valueExpression RB

This supports both literal expressions for the index and clearly differentiates between a name and the array syntax.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more flexible than the existing EQL, which has its pros and cons.
This would accept this, which EQL doesn't currently accept:

some_function(...)["some field"]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to respond to this - to avoid having a function as an argument use qualifiedName or identifier instead in the array rule:
| value=identifier LB index=valueExpression RB

which would prevent the function rule from applying.

: (identifier DOT)* identifier
: identifier (DOT identifier | LB INTEGER_VALUE+ RB)*
;

identifier
: quoteIdentifier
| unquoteIdentifier
: IDENTIFIER
;

quoteIdentifier
: QUOTED_IDENTIFIER #quotedIdentifier
;

unquoteIdentifier
: IDENTIFIER #unquotedIdentifier
| DIGIT_IDENTIFIER #digitIdentifier
timeUnit
: number unit=IDENTIFIER?
;

number
Expand All @@ -151,31 +147,26 @@ string
: STRING
;

AND: 'AND';
ANY: 'ANY';
ASC: 'ASC';
BETWEEN: 'BETWEEN';
BY: 'BY';
CHILD: 'CHILD';
DESCENDANT: 'DESCENDANT';
EVENT: 'EVENT';
FALSE: 'FALSE';
IN: 'IN';
JOIN: 'JOIN';
MAXSPAN: 'MAXSPAN';
NOT: 'NOT';
NULL: 'NULL';
OF: 'OF';
OR: 'OR';
SEQUENCE: 'SEQUENCE';
TRUE: 'TRUE';
UNTIL: 'UNTIL';
WHERE: 'WHERE';
WITH: 'WITH';
AND: 'and';
rw-access marked this conversation as resolved.
Show resolved Hide resolved
BY: 'by';
rw-access marked this conversation as resolved.
Show resolved Hide resolved
FALSE: 'false';
FORK: 'fork';
IN: 'in';
JOIN: 'join';
MAXSPAN: 'maxspan';
NOT: 'not';
NULL: 'null';
OF: 'of';
OR: 'or';
SEQUENCE: 'sequence';
TRUE: 'true';
UNTIL: 'until';
WHERE: 'where';
WITH: 'with';

// Operators
EQ : '=' | '==';
NEQ : '<>' | '!=';
NEQ : '!=';
LT : '<';
LTE : '<=';
GT : '>';
Expand All @@ -194,9 +185,16 @@ LP: '(';
RP: ')';
PIPE: '|';


ESCAPED_IDENTIFIER
: '`' (~'`')* '`'
;

STRING
: '\'' ( ~'\'')* '\''
| '"' ( ~'"' )* '"'
: '\'' ('\\' [btnfr"'\\] | ~[\r\n'\\])* '\''
| '"' ('\\' [btnfr"'\\] | ~[\r\n"\\])* '"'
| '?"' ('\\"' |~["\r\n])* '"'
| '?\'' ('\\\'' |~['\r\n])* '\''
;

INTEGER_VALUE
Expand All @@ -210,31 +208,24 @@ DECIMAL_VALUE
| DOT DIGIT+ EXPONENT
;

// make @timestamp not require escaping, since @ has no other meaning
IDENTIFIER
: (LETTER | '_') (LETTER | DIGIT | '_' | '@' )*
;

DIGIT_IDENTIFIER
: DIGIT (LETTER | DIGIT | '_' | '@')+
: (LETTER | '_' | '@') (LETTER | DIGIT | '_')*
;

QUOTED_IDENTIFIER
rw-access marked this conversation as resolved.
Show resolved Hide resolved
: '"' ( ~'"' | '""' )* '"'
;

fragment EXPONENT
: 'E' [+-]? DIGIT+
: [Ee] [+-]? DIGIT+
rw-access marked this conversation as resolved.
Show resolved Hide resolved
;

fragment DIGIT
: [0-9]
;

fragment LETTER
: [A-Z]
: [A-Za-z]
;

SIMPLE_COMMENT
LINE_COMMENT
: '//' ~[\r\n]* '\r'? '\n'? -> channel(HIDDEN)
;

Expand All @@ -246,9 +237,12 @@ WS
: [ \r\n\t]+ -> channel(HIDDEN)
;


// Catch-all for anything we can't recognize.
// We use this to be able to ignore and recover all the text
// when splitting statements with DelimiterLexer
/*
UNRECOGNIZED
: .
;
;
*/
87 changes: 0 additions & 87 deletions x-pack/plugin/eql/src/main/antlr/EqlBase.tokens

This file was deleted.

Loading