Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow keywords to overlap identifiers #568

Closed
AntonyBlakey opened this issue Aug 19, 2023 · 1 comment · Fixed by #723
Closed

Allow keywords to overlap identifiers #568

AntonyBlakey opened this issue Aug 19, 2023 · 1 comment · Fixed by #723
Assignees

Comments

@AntonyBlakey
Copy link
Contributor

This is valid code in 0.4.21:

contract ClientReceipt {
    event Deposit();
    function deposit() public payable {
        uint256 emit;
        emit Deposit();
    }
}

The key thing is that the emit statement is enabled from 0.4.21, but emit is still valid as an identifier. We currently cannot handle this, and it makes it impossible to compute over the versioning of tokens vs. identifiers.

@Xanewok
Copy link
Contributor

Xanewok commented Sep 4, 2023

https://docs.soliditylang.org/en/v0.8.21/grammar.html#a4.SolidityParser.identifier

In 0.8.21 from, error, revert and global can also overlap:

pragma solidity ^0.8.21;

contract Contract {
    error MyError();

    function deposit() public payable {
        uint256 error;
        uint256 from;
        uint256 revert;
        uint256 global;

        revert MyError();
    }
}

@Xanewok Xanewok removed this from Slang - 2023 Q3 Nov 2, 2023
Xanewok added a commit to Xanewok/slang that referenced this issue Nov 12, 2023
This allows using `constructor` keyword in 0.4.22 but some
keywords are too eagerly reserved like `leave` in Yul since 0.6.0,
whereas it's actually reserved in 0.7.1.

See NomicFoundation#568.
github-merge-queue bot pushed a commit that referenced this issue Nov 28, 2023
Part of #638 

### Outline

The new definition is used to construct the grammar definition model
from the DSL v1, which is then used by the codegen to create the parser
and rule/token kinds, like before.

The rough translation done here is to translate every
struct/enum/repeated/separated/precedence expression as a
rule/non-terminal kind (with the exception of choice nodes, which are
inlined, as per
#650 (comment));
the rest (i.e. trivia, fragment, token, keywords) are translated as
"old" tokens/terminals.

### More context

In general, this requires more polish and decision which rules are we
comfortable with leaving and fixing leftover issues such as explicit
`PrecedenceExpression#rule_name` added to fit the old DSL v1
model/codegen better.

I specifically didn't want to touch the v1 codegen/model yet, because it
helped being able to incrementally migrate to a new definition and I
wanted to limit changes done in a single PR, so that it's easier to
review. We can modify the codegen/model to fit the v2 model better once
we fully migrate and remove the old DSL v1.

The translation/construction is done at build "run-time" rather than
inside the newly introduced definition "compiler" proc macro. There's
not much explicit reason for that, other than the fact that it was
quicker for me to work on (quicker dev cycle) and the logic is just
plain Rust code. This can be moved inside the proc macro later on, if we
decide that it's better that way.

### Differences

Because the DSL v2 takes a more named and structured approach (as it's
designed to be a model for the strongly typed CST/AST), translating it
1-1 to our current parser structure is impossible or even desired.

There are some rules that were introduced and it doesn't make much sense
for some to replicate the old behaviour.

1. At the moment, it's impossible to specify an optional *separated*
item, so the following ones are introduced as wrappers:
   - `TupleValue` (used in `TupleValuesList`)
   - `TupleMemberDeconstruction` (used in `TupleMembersList`)
2. Some sequences in the structs are optional, but these are now
introduced and named, rather than inlined as before:
    - `IndexAccessEnd`
    - `ImportAlias`
    - `UsingAlias`
    - `VariableDeclarationValue`
3. ~Previously inlined sequence parsers now have to be named~ (fixed
with #657)
- `NumericExpression` is now split into two, now named, choices:
`HexNumberExpression` and `DecimalNumberExpression`
    
Moreover, the following was done to bring back the old CST shape as much
as possible:
- Some of the new rules where renamed back to the old ones, e.g. some
repeated nodes have `List` suffix again
- The (Yul)Statement are outlined (as in v0/v1) and re-introduced as a
struct wrapper
- `ArgumentsDeclaration` is outlined (as in v0/v1) as well
- `FunctionCallOptions` is outlined again as a `(NamedArgs NamedArgs*)`
rather than `NamedArgs | NamedArgs+` to better match the old CST in the
singular case (but differs from the old CST in the multiple case)
- #653 

This was done to unify the definitions where possible and to help with
reviewing the CST changes by minimizing the shape differences. Once we
move off v0 (#637) and v1 (#638), we will be free to change the final
shape, since the old definitions will stop tying us down quite a bit.
    
### Outstanding issues
- [x] Separate outstanding FIXMEs that need to be done after the
migration into a task list
   * LeadingTrivia in v2 is using the v1 definition for now
* SourceUnit is hacked to always be followed by Leading Trivia (copies
v1; trivia model in v2 is a bit different)
* Clean up `PrecedenceExpression#rule_name` and adapt the codegen model
to fit the v2 definition better
   * Stop leaking identifiers by adapting either v1 or v2 models
* Keyword trie inclusion should be reworked to not require synthetic
rules over all keywords (v1 model) and to properly allow keywords to
overlap identifiers (#568)
- [x] Fix the Prettier error caused by newly possible nested, empty
rules:
```
[error] crates/solidity/testing/snapshots/cst_output/TupleExpression/empty/generated/0.4.11-success.yml: SyntaxError: All collection items must start at the same column (11:9)
[error]    9 |   - TupleExpression (Rule): # 0..3 "( )"
[error]   10 |       - OpenParen (Token): "(" # 0..1
[error] > 11 |       - TupleValuesList (Rule): [] # 1..1
[error]      |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[error] > 12 |           - TupleValue (Rule): [] # 1..1
[error]      | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[error] > 13 |       - CloseParen (Token): ")" # 2..3
[error]      | ^
[error]   14 |
```
@Xanewok Xanewok moved this to In Progress in Slang - 2023 Q4 Dec 19, 2023
github-merge-queue bot pushed a commit that referenced this issue Jan 8, 2024
Closes #568 

There is still one outstanding issue where we return a `Vec<TokenKind>`
from `next_token`; it'd like to return a more specialized type and
ideally pass it on stack (2x2 bytes), rather than on-heap (extra 3x8
bytes for the Vec handle + indirection). We should name it better and
properly show that we can return at most 2 token kinds (single token
kind or identifier + kw combo).

To do:
- [x] Return tokens from `next_token` via stack

Apart from that, I think this is a more correct approach than #598,
especially accounting for the new keyword definition format in DSL v2.

The main change is that we only check the keyword trie and additionally
the (newly introduced) compound keyword scanners only after the token
has been lexed as an identifier. For each context, we collect Identifier
scanners used by the keywords and attempt promotion there.

The existing lexing performance is not impacted from what I've seen when
running the sanctuary tests and I can verify (incl. CST tests) that we
now properly parse source that uses contextual keywords (e.g. `from`)
and that the compound keywords (e.g. `ufixedMxN`) are properly
versioned.

This adapts the existing `codegen_grammar` interface that's a leftover
from DSLv1; I did that to work on finishing #638; once this is merged
and we now properly parse contextual keywords, I'll move to clean it up
and reduce the parser codegen indirection (right now we go from v2 -> v1
model -> code generator -> Tera templates; it'd like to at least cut out
the v1 model and/or simplify visiting v2 from the existing
`CodeGenerator`).

Please excuse the WIP comments in the middle; the first and the last
ones should make sense when reviewing. I can simplify this a bit for
review, if needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: In Progress
2 participants