-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add 'reserved word' construct again, with a better API #3896
Open
maxbrunsfeld
wants to merge
16
commits into
master
Choose a base branch
from
new-reserved-word-structure
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+2,551
−1,731
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
maxbrunsfeld
changed the title
Add `reserved word' construct again, with a better API
Add 'reserved word' construct again, with a better API
Nov 8, 2024
maxbrunsfeld
force-pushed
the
new-reserved-word-structure
branch
from
November 15, 2024 19:57
747b67f
to
d61aa8b
Compare
amaanq
force-pushed
the
new-reserved-word-structure
branch
from
November 22, 2024 20:36
4de1466
to
708f10f
Compare
4 tasks
Looking at these test failures, I see that we still need to implement loading of the reserved-word-related fields on |
amaanq
force-pushed
the
new-reserved-word-structure
branch
from
November 23, 2024 04:13
9490585
to
08fe860
Compare
Co-authored-by: Amaan <[email protected]>
Co-authored-by: Amaan <[email protected]>
…uction Co-authored-by: Amaan <[email protected]>
amaanq
force-pushed
the
new-reserved-word-structure
branch
from
November 23, 2024 04:23
08fe860
to
5956ab1
Compare
This was referenced Nov 23, 2024
amaanq
force-pushed
the
new-reserved-word-structure
branch
from
November 23, 2024 06:28
8ffed93
to
ebb9df5
Compare
This was referenced Nov 23, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a third attempt to solve a problem described in #246, which now supersedes #1635.
Background
Tree-sitter uses context-aware tokenization - in a given parse state, Tree-sitter only recognizes tokens that are syntactically valid in that state. This is what allows Tree-sitter to tokenize languages correctly without requiring the grammar author to think about different lexer modes. In general, Tree-sitter is permissive in allowing words that are keywords in some places to be used freely as names in other places.
Sometimes this permissiveness causes unexpected error recoveries. Consider this syntax error in Rust:
Currently, when tree-sitter-rust encounters this code, it doesn't detect an error until the word
b
, because it interprets the wordif
as a field/method name on thea
object. It doesn't seeif
as a keyword, because the keywordif
would not be valid in that position.Because the error is detected too late, it's not possible to recover well. Tree-sitter fails to recognize the
if_statement
, and sees it instead as a continuation of the expression above:rust tree with bad recovery
The
reserved
propertyIn order improve this error recovery, the grammar author needs a way to explicitly indicate that certain keywords are reserved. That is - even if they are not technically valid, they should still be recognized as separate from any other tokens that would match that string (such as an
identifier
). In Rust, most keywords likeif
andlet
are reserved in all contexts.This PR introduces a new top-level property on grammars called
reserved
, which is an object much like a grammar'srules
property. In this object, the first property represents the global reserved rules, so typically this should be called "global", though, much like therules
property's start rule, any name works.When using this new feature, and parsing the same rust code as above, the error is now detected at the correct time (at the
if
token), because Tree-sitter still treatsif
as a keyword and not an identifier, even though the keyword is unexpected. This allows error recovery to be much better: preserving the entireif_statement
, and marking the incompletea.
line as an error.rust tree with good recovery
Contextual Reserved Words
Many languages have a more complex system of reserved words, in which words are reserved in some contexts, but not others. For example, in JavaScript, the word
if
cannot be used in a local declaration or an expression, but it can be used as the name of an object property:The current version of tree-sitter-javascript will treat the
if
properties as valid, which is correct, but it will fail to detect the error on theif
token on line 3, similarly to the Rust example described above.javscript tree with bad error recovery
In order to allow the valid usages, but detect the invalid ones, the grammar author needs a way to indicate in the JavaScript grammar that
if
is normally a reserved word, but it is still allowed in property names.The
reserved
Grammar RuleIn addition to the top-level
reserved
property, this PR also introduces a new rule function,reserved(reservedWordSetName, rule)
, which lets you override the set of reserved words in a certain context. In the case of JavaScript, we actually want to remove all reserved words in the context of object properties:In this particular case, we call
reserved()
with the name of the property in thereserved
object, in this case, 'properties', to indicate that there are no reserved words in that context. In other use cases, we might pass an alternative set of reserved words. The property name that's passed into the first argument ofreserved
has its reserved word set used in this context.Details
word
token would be valid. For example, when inside of a string literal, the reserved words won't cause the lexer to recognize the contents of a string as anif
keyword.global
set always has a precedence of 0, andproperties
in the JavaScript example has a precedence of 1.