feat(Lezer grammar): Rework grammar with semantic parse tree #3588

vanillajonathan · 2023-10-01T21:14:14Z

Previously the tests passed because they were crafted to pass, not because they reflected what the parse tree ought to look like. This patch reworks the grammar to output a more idiomatic parse tree.

Previously the parse tree looked like Expr(Integer,Op_bin,Integer) and now looks like BinaryExpression(Integer,ArithOp,Integer) or BinaryExpression(Integer,CompareOp,Integer).

A Docblock token is introduced.
Time units now have their own TimeUnit token.
The highlight.js file now maps the Docblock and TimeUnit token to t.docString and t.unit respectively.
The @precedence block has been moved up to the top of the file as per upstream convention.
Tokens named with underscores have been renamed to camel case or pascal case as per upstream convention.
- Query_def is now QueryDefinition. I am unsure about the name and consider MetaDeclaration or PragmaDeclaration. Named args are no longer optional.
- Def is now VariableDeclaration as per upstream convention.
- Named_arg is now NamedArg.
- Pipeline_stmt is now PipelineStatement. The "Statement" suffix is by upstream convention.
- Tuple and Array have been renamed to TupleExpression and ArrayExpression respectively as are the names in the Python grammar.
- Range have been renamed to RangeExpression.
- Func_call has been renamed to CallExpression as per upstream convention.
commaSep is introduced as a internal generic token, it is currently only used by the ArrayExpression token. This token is a conventional token found in upstream grammars.
The BinaryExpression token have been introduced along with the LogicOp and CompareOp tokens. The Op_bin token is gone. This is per upstream convention.
The expression tokens are grouped together under the new internal expression token as per upstream convention.
The Interpolated_string have been renamed to interpolatedString and is now a internal token.
Tokens that resided below the @token block have been moved up as per upstream convention.
Triple-quoted strings are now (kind of) supported.
New tests
- Array of integers
- Array of floats
- Comment
- Docblock, i.e. #!
- Variable declaration
- Function declaration
- Function declaration with two arguments
- Simple pipeline
- Coalesce operator
- Triple-quoted single-quoted string, i.e. '''
- Triple-quoted double-quoted string, i.e. """

max-sixty

This is excellent @vanillajonathan ! Nice work! It is much better, and I see you've standardized things.

I didn't quite understand all the @ — that's OK though, I see the tests and the results look good.

I asked a couple of questions, and then we can merge...

grammars/prql-lezer/src/prql.grammar

Co-authored-by: Maximilian Roos <[email protected]>

vanillajonathan · 2023-10-01T23:05:51Z

This is excellent @vanillajonathan ! Nice work! It is much better, and I see you've standardized things.

Thanks. Yes, I looked at upstream grammars and tried to adopt to their convention about naming, syntax and structure.

I didn't quite understand all the @ — that's OK though, I see the tests and the results look good.

The @digit is a built-in character set that matches 0 to 9. (reference)

The @precedence is a built-in block to define which token takes precedence during parsing when conflicts arise. Example the Docblock token must take precedence over Comment token since both of them start with the # character. (reference)

@top is a built-in that defines the entry point to the grammar. (reference)

@skip is a built-in used to add tokens that can appear anywhere, such as comments and spaces. (reference)

@specialize is a built-in. (reference)

@external is a built-in to load stuff from a JavaScript file, it can be used for advanced rules such as semicolon in JavaScript, or indention in Python, we just use it to load the highlighter as is commonly done in other upstream grammars. (reference)

@name is a built-in used to set the node name to something different than what the token is named. (reference)

@isGroup is a built-in used to create node groups. We use it to group all our expressions (such as BinaryExpression, ArrayExpression, TupleExpression, etc) under one internal expression token as is commonly done in upstream grammars. (reference)

max-sixty · 2023-10-02T04:05:30Z

@isGroup is a built-in used to create node groups. We use it to group all our expressions (such as BinaryExpression, ArrayExpression, TupleExpression, etc) under one internal expression token as is commonly done in upstream grammars. (reference)

OK great — this was the one I wasn't familiar with. Thanks for the explanation...

max-sixty · 2023-10-02T18:45:59Z

Thanks again @vanillajonathan !

vanillajonathan · 2023-10-03T20:57:05Z

We have the F_string, R_string, and S_string tokens named with an underscore. Upstream does not use underscore in their token names. Upstream Rust grammar have a RawString token and the upstream Python grammar have a FormatString token.

We should have a Boolean token for the keywords true and false. Upstream expresses this as boolean { @specialize[@name=Boolean]<identifier, "true" | "false"> }.

We should handle the null keyword. Upstream expresses this as kw<"null">.

Right now true, false and null are outputted as Ident in the parse tree.

The statement from invoice is outputted as Ident and Ident in the parse tree. In PRQL you can do select invoice.date or select invoice.*, that asterisk makes it something of an "expression", maybe an IdentifierExpression? The upstream Python and JavaScript grammars have a MemberExpression token to represent things like VariableName (("." | "?.") PropertyName.

We need triple-quoted f-strings, r-strings and s-strings.

We need an Escape token for escape sequences (such as \n, etc).

max-sixty · 2023-10-03T22:25:40Z

We have the F_string, R_string, and S_string tokens named with an underscore. Upstream does not use underscore in their token names.

Fine to rename FString etc!

We should have a Boolean token for the keywords true and false. Upstream expresses this as boolean { @specialize[@name=Boolean]<identifier, "true" | "false"> }.

We should handle the null keyword. Upstream expresses this as kw<"null">.

Right now true, false and null are outputted as Ident in the parse tree.

Yes, having these as keywords would be good. We don't necessarily need to distinguish them in more detail than "keyword", but it's OK to do so if you like.

select invoice.*, that asterisk makes it something of an "expression", maybe an IdentifierExpression?

No strong view — I think it could be highlighted the same as an Ident

We need triple-quoted f-strings, r-strings and s-strings.

Great! If these can be generic, even better... I had mixed success when I originally tried to do that.

We need an Escape token for escape sequences (such as \n, etc).

Nice, yes.

Also feel free to take a swing at the change suggested at the bottom of #3571 — it would touch a lot of places, but I don't think it's that hard. It would be a nice next-PR to the rust parser.

Rework grammar with semantic parse tree

7bd9bfa

max-sixty approved these changes Oct 1, 2023

View reviewed changes

Update grammars/prql-lezer/src/prql.grammar

e416b49

Co-authored-by: Maximilian Roos <[email protected]>

Update prql.grammar

48ada31

max-sixty merged commit 12369c9 into PRQL:main Oct 2, 2023

vanillajonathan deleted the lezer-rework branch October 2, 2023 23:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(Lezer grammar): Rework grammar with semantic parse tree #3588

feat(Lezer grammar): Rework grammar with semantic parse tree #3588

vanillajonathan commented Oct 1, 2023 •

edited

Loading

max-sixty left a comment

vanillajonathan commented Oct 1, 2023

max-sixty commented Oct 2, 2023

max-sixty commented Oct 2, 2023

vanillajonathan commented Oct 3, 2023

max-sixty commented Oct 3, 2023

feat(Lezer grammar): Rework grammar with semantic parse tree #3588

feat(Lezer grammar): Rework grammar with semantic parse tree #3588

Conversation

vanillajonathan commented Oct 1, 2023 • edited Loading

max-sixty left a comment

Choose a reason for hiding this comment

vanillajonathan commented Oct 1, 2023

max-sixty commented Oct 2, 2023

max-sixty commented Oct 2, 2023

vanillajonathan commented Oct 3, 2023

max-sixty commented Oct 3, 2023

vanillajonathan commented Oct 1, 2023 •

edited

Loading