Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(Lezer grammar): Rework grammar with semantic parse tree #3588

Merged
merged 3 commits into from
Oct 2, 2023

Conversation

vanillajonathan
Copy link
Collaborator

@vanillajonathan vanillajonathan commented Oct 1, 2023

Previously the tests passed because they were crafted to pass, not because they reflected what the parse tree ought to look like. This patch reworks the grammar to output a more idiomatic parse tree.

Previously the parse tree looked like Expr(Integer,Op_bin,Integer) and now looks like BinaryExpression(Integer,ArithOp,Integer) or BinaryExpression(Integer,CompareOp,Integer).

  • A Docblock token is introduced.
  • Time units now have their own TimeUnit token.
  • The highlight.js file now maps the Docblock and TimeUnit token to t.docString and t.unit respectively.
  • The @precedence block has been moved up to the top of the file as per upstream convention.
  • Tokens named with underscores have been renamed to camel case or pascal case as per upstream convention.
    • Query_def is now QueryDefinition. I am unsure about the name and consider MetaDeclaration or PragmaDeclaration. Named args are no longer optional.
    • Def is now VariableDeclaration as per upstream convention.
    • Named_arg is now NamedArg.
    • Pipeline_stmt is now PipelineStatement. The "Statement" suffix is by upstream convention.
    • Tuple and Array have been renamed to TupleExpression and ArrayExpression respectively as are the names in the Python grammar.
    • Range have been renamed to RangeExpression.
    • Func_call has been renamed to CallExpression as per upstream convention.
  • commaSep is introduced as a internal generic token, it is currently only used by the ArrayExpression token. This token is a conventional token found in upstream grammars.
  • The BinaryExpression token have been introduced along with the LogicOp and CompareOp tokens. The Op_bin token is gone. This is per upstream convention.
  • The expression tokens are grouped together under the new internal expression token as per upstream convention.
  • The Interpolated_string have been renamed to interpolatedString and is now a internal token.
  • Tokens that resided below the @token block have been moved up as per upstream convention.
  • Triple-quoted strings are now (kind of) supported.
  • New tests
    • Array of integers
    • Array of floats
    • Comment
    • Docblock, i.e. #!
    • Variable declaration
    • Function declaration
    • Function declaration with two arguments
    • Simple pipeline
    • Coalesce operator
    • Triple-quoted single-quoted string, i.e. '''
    • Triple-quoted double-quoted string, i.e. """

Copy link
Member

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is excellent @vanillajonathan ! Nice work! It is much better, and I see you've standardized things.

I didn't quite understand all the @ — that's OK though, I see the tests and the results look good.

I asked a couple of questions, and then we can merge...

grammars/prql-lezer/src/prql.grammar Show resolved Hide resolved
grammars/prql-lezer/src/prql.grammar Outdated Show resolved Hide resolved
grammars/prql-lezer/src/prql.grammar Show resolved Hide resolved
grammars/prql-lezer/src/prql.grammar Outdated Show resolved Hide resolved
grammars/prql-lezer/src/prql.grammar Show resolved Hide resolved
@vanillajonathan
Copy link
Collaborator Author

This is excellent @vanillajonathan ! Nice work! It is much better, and I see you've standardized things.

Thanks. Yes, I looked at upstream grammars and tried to adopt to their convention about naming, syntax and structure.

I didn't quite understand all the @ — that's OK though, I see the tests and the results look good.

The @digit is a built-in character set that matches 0 to 9. (reference)

The @precedence is a built-in block to define which token takes precedence during parsing when conflicts arise. Example the Docblock token must take precedence over Comment token since both of them start with the # character. (reference)

@top is a built-in that defines the entry point to the grammar. (reference)

@skip is a built-in used to add tokens that can appear anywhere, such as comments and spaces. (reference)

@specialize is a built-in. (reference)

@external is a built-in to load stuff from a JavaScript file, it can be used for advanced rules such as semicolon in JavaScript, or indention in Python, we just use it to load the highlighter as is commonly done in other upstream grammars. (reference)

@name is a built-in used to set the node name to something different than what the token is named. (reference)

@isGroup is a built-in used to create node groups. We use it to group all our expressions (such as BinaryExpression, ArrayExpression, TupleExpression, etc) under one internal expression token as is commonly done in upstream grammars. (reference)

@max-sixty
Copy link
Member

@isGroup is a built-in used to create node groups. We use it to group all our expressions (such as BinaryExpression, ArrayExpression, TupleExpression, etc) under one internal expression token as is commonly done in upstream grammars. (reference)

OK great — this was the one I wasn't familiar with. Thanks for the explanation...

@max-sixty max-sixty merged commit 12369c9 into PRQL:main Oct 2, 2023
@max-sixty
Copy link
Member

Thanks again @vanillajonathan !

@vanillajonathan vanillajonathan deleted the lezer-rework branch October 2, 2023 23:12
@vanillajonathan
Copy link
Collaborator Author

We have the F_string, R_string, and S_string tokens named with an underscore. Upstream does not use underscore in their token names. Upstream Rust grammar have a RawString token and the upstream Python grammar have a FormatString token.

We should have a Boolean token for the keywords true and false. Upstream expresses this as boolean { @specialize[@name=Boolean]<identifier, "true" | "false"> }.

We should handle the null keyword. Upstream expresses this as kw<"null">.

Right now true, false and null are outputted as Ident in the parse tree.

The statement from invoice is outputted as Ident and Ident in the parse tree. In PRQL you can do select invoice.date or select invoice.*, that asterisk makes it something of an "expression", maybe an IdentifierExpression? The upstream Python and JavaScript grammars have a MemberExpression token to represent things like VariableName (("." | "?.") PropertyName.

We need triple-quoted f-strings, r-strings and s-strings.

We need an Escape token for escape sequences (such as \n, etc).

@max-sixty
Copy link
Member

We have the F_string, R_string, and S_string tokens named with an underscore. Upstream does not use underscore in their token names.

Fine to rename FString etc!

We should have a Boolean token for the keywords true and false. Upstream expresses this as boolean { @specialize[@name=Boolean]<identifier, "true" | "false"> }.

We should handle the null keyword. Upstream expresses this as kw<"null">.

Right now true, false and null are outputted as Ident in the parse tree.

Yes, having these as keywords would be good. We don't necessarily need to distinguish them in more detail than "keyword", but it's OK to do so if you like.

select invoice.*, that asterisk makes it something of an "expression", maybe an IdentifierExpression?

No strong view — I think it could be highlighted the same as an Ident

We need triple-quoted f-strings, r-strings and s-strings.

Great! If these can be generic, even better... I had mixed success when I originally tried to do that.

We need an Escape token for escape sequences (such as \n, etc).

Nice, yes.


Also feel free to take a swing at the change suggested at the bottom of #3571 — it would touch a lot of places, but I don't think it's that hard. It would be a nice next-PR to the rust parser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants