-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate the parser to the language definition v2 #638
Comments
The DSL v1 and the codegen assumes that keyword "enablement" is the same as the keyword being reserved, i.e. not usable in the identifier position. This patch uses that to correctly reserve keywords and parse correctly more code. While this still unfortunately does not handle the known cases where the atom can be accepted as identifier or a keyword, i.e. `emit` in 0.4.22, this is strictly more correct and this will help by minimizing the final diff once we switch to DSL v2, since the new definition already uses these correct enabled/reserved version ranges. Ref #638
Otherwise, the following rule: ```rust // An integer (without a dot or a fraction) is enabled in all versions: TokenDefinition( scanner = TrailingContext( scanner = Sequence([ Fragment(DecimalDigits), Optional(Fragment(DecimalExponent)) ]), not_followed_by = Fragment(IdentifierStart) ) ), ``` was a catch-all rule that could successfully lex "1.2" as "1" in all versions. Instead, this mimicks the following version from the DSL v1: ```rust scanner DecimalLiteral = ( ( ( { removed in "0.5.0" (DecimalDigits (("." (DecimalDigits ?) ) ?)) } | { introduced in "0.5.0" (DecimalDigits (("." DecimalDigits ) ?)) } | ('.' DecimalDigits) ) (DecimalExponent ?) ) not followed by IdentifierStart ) ; ``` and fixed a case in https://github.com/Xanewok/slang/tree/codegen-use-dslv2 when using parts of the DSLv2 for the codegen but I didn't open the PR yet. Ref #638
Part of #638 ### Outline The new definition is used to construct the grammar definition model from the DSL v1, which is then used by the codegen to create the parser and rule/token kinds, like before. The rough translation done here is to translate every struct/enum/repeated/separated/precedence expression as a rule/non-terminal kind (with the exception of choice nodes, which are inlined, as per #650 (comment)); the rest (i.e. trivia, fragment, token, keywords) are translated as "old" tokens/terminals. ### More context In general, this requires more polish and decision which rules are we comfortable with leaving and fixing leftover issues such as explicit `PrecedenceExpression#rule_name` added to fit the old DSL v1 model/codegen better. I specifically didn't want to touch the v1 codegen/model yet, because it helped being able to incrementally migrate to a new definition and I wanted to limit changes done in a single PR, so that it's easier to review. We can modify the codegen/model to fit the v2 model better once we fully migrate and remove the old DSL v1. The translation/construction is done at build "run-time" rather than inside the newly introduced definition "compiler" proc macro. There's not much explicit reason for that, other than the fact that it was quicker for me to work on (quicker dev cycle) and the logic is just plain Rust code. This can be moved inside the proc macro later on, if we decide that it's better that way. ### Differences Because the DSL v2 takes a more named and structured approach (as it's designed to be a model for the strongly typed CST/AST), translating it 1-1 to our current parser structure is impossible or even desired. There are some rules that were introduced and it doesn't make much sense for some to replicate the old behaviour. 1. At the moment, it's impossible to specify an optional *separated* item, so the following ones are introduced as wrappers: - `TupleValue` (used in `TupleValuesList`) - `TupleMemberDeconstruction` (used in `TupleMembersList`) 2. Some sequences in the structs are optional, but these are now introduced and named, rather than inlined as before: - `IndexAccessEnd` - `ImportAlias` - `UsingAlias` - `VariableDeclarationValue` 3. ~Previously inlined sequence parsers now have to be named~ (fixed with #657) - `NumericExpression` is now split into two, now named, choices: `HexNumberExpression` and `DecimalNumberExpression` Moreover, the following was done to bring back the old CST shape as much as possible: - Some of the new rules where renamed back to the old ones, e.g. some repeated nodes have `List` suffix again - The (Yul)Statement are outlined (as in v0/v1) and re-introduced as a struct wrapper - `ArgumentsDeclaration` is outlined (as in v0/v1) as well - `FunctionCallOptions` is outlined again as a `(NamedArgs NamedArgs*)` rather than `NamedArgs | NamedArgs+` to better match the old CST in the singular case (but differs from the old CST in the multiple case) - #653 This was done to unify the definitions where possible and to help with reviewing the CST changes by minimizing the shape differences. Once we move off v0 (#637) and v1 (#638), we will be free to change the final shape, since the old definitions will stop tying us down quite a bit. ### Outstanding issues - [x] Separate outstanding FIXMEs that need to be done after the migration into a task list * LeadingTrivia in v2 is using the v1 definition for now * SourceUnit is hacked to always be followed by Leading Trivia (copies v1; trivia model in v2 is a bit different) * Clean up `PrecedenceExpression#rule_name` and adapt the codegen model to fit the v2 definition better * Stop leaking identifiers by adapting either v1 or v2 models * Keyword trie inclusion should be reworked to not require synthetic rules over all keywords (v1 model) and to properly allow keywords to overlap identifiers (#568) - [x] Fix the Prettier error caused by newly possible nested, empty rules: ``` [error] crates/solidity/testing/snapshots/cst_output/TupleExpression/empty/generated/0.4.11-success.yml: SyntaxError: All collection items must start at the same column (11:9) [error] 9 | - TupleExpression (Rule): # 0..3 "( )" [error] 10 | - OpenParen (Token): "(" # 0..1 [error] > 11 | - TupleValuesList (Rule): [] # 1..1 [error] | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [error] > 12 | - TupleValue (Rule): [] # 1..1 [error] | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [error] > 13 | - CloseParen (Token): ")" # 2..3 [error] | ^ [error] 14 | ```
Part of #638 This leverages the existing trie construction for the literal scanners. By not explicitly disambiguating ourselves, we won't have to be careful about missing something when updating the grammar. cc @OmarTawfik since we talked about this recently
…684) Part of #637 #638 The other use case is in `codegen/spec/src/grammar.rs` but that the rest of the code uses v0 for spec-generation, so I didn't change that not to mix definitions and not to step on your toes @OmarTawfik for #637. --------- Co-authored-by: Omar Tawfik <[email protected]>
Part of #638 As we discussed, inlining some nodes types causes us to lose information that is otherwise hard/inconvenient to reconstruct and this hopefully increases the usability of CST alone.
Part of #638 This PR: - Removes the `PrecedenceExpression::rule_name` in favour of using the name directly for its rule/node kind - Deduplicates the individual precedence expressions in the parser codegen (regressed in #687 (comment)) - Allows to parse individual precedence expressions and adds rule kinds for them - ... so we can remove `ProductionKind` in favour of the now-complete `RuleKind` I can split this up but since all of this is interconnected, it made sense to work on them at the same time and submit work that solves all of the aforementioned issues. --------- Co-authored-by: Omar Tawfik <[email protected]>
Closes #568 There is still one outstanding issue where we return a `Vec<TokenKind>` from `next_token`; it'd like to return a more specialized type and ideally pass it on stack (2x2 bytes), rather than on-heap (extra 3x8 bytes for the Vec handle + indirection). We should name it better and properly show that we can return at most 2 token kinds (single token kind or identifier + kw combo). To do: - [x] Return tokens from `next_token` via stack Apart from that, I think this is a more correct approach than #598, especially accounting for the new keyword definition format in DSL v2. The main change is that we only check the keyword trie and additionally the (newly introduced) compound keyword scanners only after the token has been lexed as an identifier. For each context, we collect Identifier scanners used by the keywords and attempt promotion there. The existing lexing performance is not impacted from what I've seen when running the sanctuary tests and I can verify (incl. CST tests) that we now properly parse source that uses contextual keywords (e.g. `from`) and that the compound keywords (e.g. `ufixedMxN`) are properly versioned. This adapts the existing `codegen_grammar` interface that's a leftover from DSLv1; I did that to work on finishing #638; once this is merged and we now properly parse contextual keywords, I'll move to clean it up and reduce the parser codegen indirection (right now we go from v2 -> v1 model -> code generator -> Tera templates; it'd like to at least cut out the v1 model and/or simplify visiting v2 from the existing `CodeGenerator`). Please excuse the WIP comments in the middle; the first and the last ones should make sense when reviewing. I can simplify this a bit for review, if needed.
…rsing (#733) Checks boxes in #640 #638 This adapts `ParserFunction` to always attempt parsing the leading trivia after we had a succesful match. This has a nice side benefit of not overwriting tokens that would allow for more progress when collecting the final trivia (unlike previous [AttemptedParseRule, EndOfFileTrivia] sequence). I'm not so happy with the resulting tree shape, however; IMO we should either: 1. return a "tree" that has both parents - the first one being the expected parsed rule node but also the second parent being the "leading trivia" 2. or at least introduce a notion of a virtual EndOfFile token that this final end-of-file trivia can belong to. Otherwise, a lone, final LeadingTrivia child belonging to the rule node seems misplaced and it's not intuitive for me to find the remaining trivia in that place. I'm open for other suggestions!
Closes #863 - The entirety of "runtime" codegen is now following a strict MVC model, and the source code is rendered from the same model object. We should be able to further improve/refactor this model object in #638 - Complex model construction can be strongly-typed in Rust, while rendering is left to templates, as they are much easier to read and modify, and also have powerful macros/imports/callbacks, etc... - I also (partially) improved the DX when working with templates in terminal, by rendering their errors as an `ariadne` report, with best-guess source location in the template. A temporary workaround waiting for Keats/tera#885 - Now both stubs and output crates are rendered from the same source templates, removing a few internal hacks and the need for `![allow(dead_code)]` and `![allow(unused)]` everywhere (will follow up on this cleanup separately). - We no longer have to maintain explicit paths and copy files around in the code generator. All templates generate their output file relative to their location, which means we can split and refactor code into as many templates/sub-modules as needed, following standard rust idioms.
Part of #638, removes the `codegen_grammar` crate Instead of having the definitions and the parser generation separate, we now co-locate this and provide a parser-specific model (for MVC rendering introduced in #962) for the code generation. This mostly moves code around and documents a bit more what we already have. This is fairly non-controversial and I am for #638 to be incremental, so this is meant to progress the issue rather than fix it in one giant PR to keep up the velocity and land it as soon as possible, accounting for my upcoming move to Hardhat soon. If this generally makes sense, I'd love to have a rubber stamp for it; for now, I'll work on top of this to further migrate off DSLv1 definitions.
…#977) Part of #638, ticks the "stop leaking identifiers" box - We leverage the existing `codegen_language_definition::model::Identifier` in PG rather than rely on `&'static str`s from DSL v1, so we don't leak anymore and we start using the DSLv2 type in PG - Similarly, we move some of the old DSLv1 utilities in favor of directly using the DSLv2 counterparts - Version-dependent codegen is now extracted into a dedicated module rather than split ad-hoc in trie.rs and parser_definition.rs - We separate the accumulator-specific state from the model itself to better convey what is required and necessary during the accumulation and what data is finally being derived from the accumulation and stored in the resulting `ParserModel` used in the MVC fashion for PG codegen The remaining work is to stop using the DSLv1 `Grammar(Visitor)` but rather directly collect items from and use the DSLv2 model to construct the final `ParserModel` for codegen. I wouldn't look too much into details like duplicated doc comments for the state accumulator/model because it's going away in the next PR, similarly naming for stuff like `Versioned(Quote)` that's also renamed temporarily and will be changed/removed since we will move off the DSLv1 model.
…tion logic (#991) Work towards #638 Mainly: - We bundle `quote`ing or any (old) parser codegen now in the new `generator/src/parser/codegen` module - To expand the `kinds.rs.jinja2` we use a new, dedicated `KindsModel` to disentangle, limiting the workarounds like removing the built-in labels only after visiting the grammar - simplified collecting the `ParserModel.referenced_versions` (it's de facto `Language::collect_breaking_changes` but with a caveat; I'll submit a separate PR for that since it's not 100% obvious and would change how we define `ParserModel`). - and some minor fixes like `pub(super)` visibility or updated referenced issues. Now, the remaining work will be to actually revamp how we visit the resulting grammar.
While implementing #1003 I started thinking whether using v2 types only makes sense for our existing parser generator. The goal of #1003 and this issue in general was to reduce duplication between the very similar types and get rid of the bridging types whenever possible. While it was enabled by the mechanical and lexer-oriented design of The are a couple of points here:
Having the above in mind, while there are some unnecessary translation hacks still that I'd like to smooth out, I would probably reconsider dropping the v1 interface entirely. cc @AntonyBlakey @OmarTawfik for visibility |
Part of #638 Follow-up to #991 Pretty straightforward: instead of visiting the previously built v1 definition structure, we defer to `Language::collect_breaking_changes` as the definitions overlap - the breaking changes are defined as versions in which the syntax items may be evaluated differently, which means that these are exactly the versions that will be referenced for the conditional syntax item evaluation in the parser/lexer.
Based on #1002 Part of #638 The DSL v2's `model::Scanner` is almost 1:1 usable with our current parser codegen, with the exception of versioned scanners. In short we: - introduce a helper type `VersionedScanner` that is used in place of the old `ScannerDefinitionNode::Versioned` - introduce a `ScannerExt` trait implemented for both `VersionedScanner` and the `model::Scanner`, which is responsible for generating the main scanning logic - Repurposes `ScannerDefinition` slightly to surface more of the scanner-related logic for trivia/fragment/token - implements this directly for `model::{TriviaItem,FragmentItem,TokenItem}` and stores it in the v1 Grammar struct - similarly, uses `model::KeywordItem` directly in place of v1's `KeywordScannerDefinition` as they shared the same functionality
@OmarTawfik from what I understand:
If not, then, I'd propose that we treat it as general improvement to the trivia handling rather than being a core component of the DSL v2 (we did already migrate the parser to the DSL v2 and we're happy with the current behavior) and we should probably track it separately if we're not already happy with what we have, as per above. The listed scope for this issue almost feels like a quarterly KR and I'd love if we could finally close it, once #1018 is merged (and we revert the string rule) 😅 |
sounds good.
IIRC, the hack we added is to "reuse" leading trivia as EOF trivia, and the proper fix was to construct a specific EOF parser in codegen, with its definition being Is that still the case? If so, it is definitely not blocking, and I would not prioritize it over other important tasks we have, but I think we should still do it. If you have a full plate, please feel free to put it in a separate issue and assign it to me, and we can close this one. |
Ticks the box in #638 (_Keyword trie inclusion should be reworked to not require synthetic rules over all keywords_) This exposes the resolution step rather than treating it as an implementation detail and doesn't try to shoehorn the DSL v2 items to the old `Grammar` model as much that the PG was based on. Moreover, this breaks up the existing grammar visitor and tries to collect or calculate more properties upfront directly from DSL v2 in order to be more explicit that they not need to depend on the collector state that tracked everything. I didn't submit it initially because I felt I could polish it slightly further (see the `TODO` note) but since I'm focused on EDR now and this cleans up the last important box in #638 (the other one is simply reverting some rules for string literals), I think it's worth reviewing in the current state.
I remember that we did discuss what to do with trailing trivia and that it's not so obvious, as for example
would mean that this conceptually would belong to the The way we have this setup is that the trailing trivia has to be a conceptually a subset of the leading trivia and thus the leading trivia is the trivia parser, which means if we want to accept all of the trivia items at EOF, then the leading trivia definition is correct for that use case and not a stretch, from what I understand (but, to be clear, we do have a "hack" of implicitly parsing the leading trivia rather than it having a dedicated definition in the definition.rs). If we want to use the DSLv2's definition of (as added by this PR) ZeroOrMore(Choice([
Trivia(Whitespace),
Trivia(SingleLineComment),
Trivia(MultilineComment)
])),
Choice([Trivia(EndOfLine), EndOfInput]) then for it to be correct, we'd have to change it to something like: trailing_trivia = Choice([
// A single line comment
Sequence([
Optional(Trivia(Whitespace)),
Optional(Choice([
Trivia(SingleLineComment),
Trivia(MultilineCommentThatDoesNotContainAnEndOfLine)
])),
Trivia(EndOfLine),
])
// De facto end-of-file trivia
// (separated in order to disallow multi-line trivias as trailing in the general case)
Sequence([
ZeroOrMore(Choice([
/* trivias listed in leading_trivia */
])),
Trivia(EndOfFile),
])
]), At which point we're de facto reintroducing end_of_file_trivia, now that I think about it. Given that it's different enough from the regular trailing trivia, should we just bite the bullet and be more explicit about it to not confuse the user and be clear about expectations? Introducing a dedicated notion of end_of_file trivia would also mean we don't need to introduce this virtual @AntonyBlakey @OmarTawfik I don't think we did write down our different conclusions and rationale while we were discussing this on multiple occasions. If you're fine with it, I'd propose that we separate an issue/thread dedicated to it as I feel we had already too many items bundled in this issue and the decision is not clear-cut at this moment. |
I've opened #1020 to track this. |
Sounds good. Let's follow up on #1020. Thanks! |
…Foundation#1002) Part of NomicFoundation#638 Follow-up to NomicFoundation#991 Pretty straightforward: instead of visiting the previously built v1 definition structure, we defer to `Language::collect_breaking_changes` as the definitions overlap - the breaking changes are defined as versions in which the syntax items may be evaluated differently, which means that these are exactly the versions that will be referenced for the conditional syntax item evaluation in the parser/lexer. Refactor `BuiltInLabel` to avoid duplication (NomicFoundation#992) Spin off of NomicFoundation#976 Moves the `BuiltInLabel` enum from the parser generator into the language definition and remove duplication in the `kinds` template. add wit generating wit and glue stub adaptors, wit feature flag glue macros remove wit_bindgen fix wit gen paths add wit-bindgen export the kinds pub export macro for wit improve export macro world => slang fully implement glue convert query matches refactor ffi glue macros refactor wit variant rather than enum back to enum
…Foundation#1002) Part of NomicFoundation#638 Follow-up to NomicFoundation#991 Pretty straightforward: instead of visiting the previously built v1 definition structure, we defer to `Language::collect_breaking_changes` as the definitions overlap - the breaking changes are defined as versions in which the syntax items may be evaluated differently, which means that these are exactly the versions that will be referenced for the conditional syntax item evaluation in the parser/lexer. Refactor `BuiltInLabel` to avoid duplication (NomicFoundation#992) Spin off of NomicFoundation#976 Moves the `BuiltInLabel` enum from the parser generator into the language definition and remove duplication in the `kinds` template. add wit generating wit and glue stub adaptors, wit feature flag glue macros remove wit_bindgen fix wit gen paths add wit-bindgen export the kinds pub export macro for wit improve export macro world => slang fully implement glue convert query matches refactor ffi glue macros refactor wit variant rather than enum back to enum
…Foundation#1002) Part of NomicFoundation#638 Follow-up to NomicFoundation#991 Pretty straightforward: instead of visiting the previously built v1 definition structure, we defer to `Language::collect_breaking_changes` as the definitions overlap - the breaking changes are defined as versions in which the syntax items may be evaluated differently, which means that these are exactly the versions that will be referenced for the conditional syntax item evaluation in the parser/lexer.
At the moment we use our Rust macros 1.0 (aka DSL v1) to define the language for our parser. We already have:
So to cut down the different definitions, we should migrate to the newest definition (see #617).
Migration PR:
Impacts the AST:
allow_empty
from DSL v2 #693PrecedenceExpression#rule_name
and adapt the codegen model to fit the v2 definition betterPrecedence
items should also produce their ownNonTerminalKind
kind, wrapping the single child (PrecedenceExpression
orPrimaryExpression
).PrecedenceExpression
should get their ownparse_XXX()
function.Does not impact the AST:
[ ] Re-introduce EOF/EOI (End of Input) virtual token/trivia parser[ ] Change LeadingTrivia definition in v2 to allow being terminated by EOI as well (changed in Use DSL v2 for parser codegen #650)(tracked in Define end-of-file trivia as part of the language definition #1020)ZeroOrMore(AnyTriviaItem)
at the end when parsing any production withpub fn parse
0.7.0
#800)Scanner::not_followed_by
completely and automatically calculate it instead (context).codegen_grammar
crate. (Co-locate parser generator logic #972)The text was updated successfully, but these errors were encountered: