Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 'reserved word' construct again, with a better API #3896

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

maxbrunsfeld
Copy link
Contributor

@maxbrunsfeld maxbrunsfeld commented Nov 8, 2024

This is a third attempt to solve a problem described in #246, which now supersedes #1635.

Background

Tree-sitter uses context-aware tokenization - in a given parse state, Tree-sitter only recognizes tokens that are syntactically valid in that state. This is what allows Tree-sitter to tokenize languages correctly without requiring the grammar author to think about different lexer modes. In general, Tree-sitter is permissive in allowing words that are keywords in some places to be used freely as names in other places.

Sometimes this permissiveness causes unexpected error recoveries. Consider this syntax error in Rust:

fn main() {
    a.  // <- incomplete

    if b {
        c();
    }
}

Currently, when tree-sitter-rust encounters this code, it doesn't detect an error until the word b, because it interprets the word if as a field/method name on the a object. It doesn't see if as a keyword, because the keyword if would not be valid in that position.

Because the error is detected too late, it's not possible to recover well. Tree-sitter fails to recognize the if_statement, and sees it instead as a continuation of the expression above:

rust tree with bad recovery
(source_file [0, 0] - [7, 0]
  (function_item [0, 0] - [5, 3]
    name: (identifier [0, 3] - [0, 7])
    parameters: (parameters [0, 7] - [0, 9])
    body: (block [0, 10] - [5, 3]
      (field_expression [1, 2] - [3, 4]
        value: (identifier [1, 2] - [1, 3])
        field: (field_identifier [3, 2] - [3, 4]))
      (call_expression [3, 5] - [4, 7]
        function: (struct_expression [3, 5] - [4, 5]
          name: (type_identifier [3, 5] - [3, 6])
          body: (field_initializer_list [3, 7] - [4, 5]
            (shorthand_field_initializer [4, 4] - [4, 5]
              (identifier [4, 4] - [4, 5]))))
        arguments: (arguments [4, 5] - [4, 7]))))
  (ERROR [6, 0] - [6, 1]))

The reserved property

In order improve this error recovery, the grammar author needs a way to explicitly indicate that certain keywords are reserved. That is - even if they are not technically valid, they should still be recognized as separate from any other tokens that would match that string (such as an identifier). In Rust, most keywords like if and let are reserved in all contexts.

This PR introduces a new top-level property on grammars called reserved, which is an object much like a grammar's rules property. In this object, the first property represents the global reserved rules, so typically this should be called "global", though, much like the rules property's start rule, any name works.

module.exports = grammar({
  name: 'rust',

  reserved: {
    global: $ => [
      "enum",
      "fn",
      "for",
      "if",
      "let",
      "loop",
      "match",
      "mod",
      "struct",
      "type",
      "while",
    ],
  },

  // rest of grammar...
});

When using this new feature, and parsing the same rust code as above, the error is now detected at the correct time (at the if token), because Tree-sitter still treats if as a keyword and not an identifier, even though the keyword is unexpected. This allows error recovery to be much better: preserving the entire if_statement, and marking the incomplete a. line as an error.

rust tree with good recovery
(source_file [0, 0] - [7, 0]
  (function_item [0, 0] - [6, 1]
    name: (identifier [0, 3] - [0, 7])
    parameters: (parameters [0, 7] - [0, 9])
    body: (block [0, 10] - [6, 1]
      (ERROR [1, 2] - [1, 4]
        (identifier [1, 2] - [1, 3]))
      (if_expression [3, 2] - [5, 3]
        condition: (identifier [3, 5] - [3, 6])
        consequence: (block [3, 7] - [5, 3]
          (call_expression [4, 4] - [4, 7]
            function: (identifier [4, 4] - [4, 5])
            arguments: (arguments [4, 5] - [4, 7])))))))

Contextual Reserved Words

Many languages have a more complex system of reserved words, in which words are reserved in some contexts, but not others. For example, in JavaScript, the word if cannot be used in a local declaration or an expression, but it can be used as the name of an object property:

var a = {if: true}; // <- valid
var b =             // <- incomplete
if (c) {            // <- error should be detected at this "if"
  a.if();           // <- valid
}

The current version of tree-sitter-javascript will treat the if properties as valid, which is correct, but it will fail to detect the error on the if token on line 3, similarly to the Rust example described above.

javscript tree with bad error recovery
(program [0, 0] - [6, 0]
  (variable_declaration [0, 0] - [0, 19]
    (variable_declarator [0, 4] - [0, 18]
      name: (identifier [0, 4] - [0, 5])
      value: (object [0, 8] - [0, 18]
        (pair [0, 9] - [0, 17]
          key: (property_identifier [0, 9] - [0, 11])
          value: (true [0, 13] - [0, 17])))))
  (comment [0, 20] - [0, 31])
  (variable_declaration [1, 0] - [2, 6]
    (variable_declarator [1, 4] - [2, 6]
      name: (identifier [1, 4] - [1, 5])
      (comment [1, 20] - [1, 36])
      value: (call_expression [2, 0] - [2, 6]
        function: (identifier [2, 0] - [2, 2])
        arguments: (arguments [2, 3] - [2, 6]
          (identifier [2, 4] - [2, 5])))))
  (statement_block [2, 7] - [4, 1]
    (comment [2, 20] - [2, 63])
    (expression_statement [3, 2] - [3, 9]
      (call_expression [3, 2] - [3, 8]
        function: (member_expression [3, 2] - [3, 6]
          object: (identifier [3, 2] - [3, 3])
          property: (property_identifier [3, 4] - [3, 6]))
        arguments: (arguments [3, 6] - [3, 8])))
    (comment [3, 20] - [3, 31])))
test.js	3 ms	(MISSING ";" [2, 6] - [2, 6])

In order to allow the valid usages, but detect the invalid ones, the grammar author needs a way to indicate in the JavaScript grammar that if is normally a reserved word, but it is still allowed in property names.

The reserved Grammar Rule

In addition to the top-level reserved property, this PR also introduces a new rule function, reserved(reservedWordSetName, rule), which lets you override the set of reserved words in a certain context. In the case of JavaScript, we actually want to remove all reserved words in the context of object properties:

grammar({
  name: 'javascript',
  
  reserved: {
    global: $ => [
      'const',
      'do',
      'else',
      'finally',
      'for',
      'function',
      'if',
      'let',
      'return',
      'throw',
      'var',
      'while',
    ],
    properties: $ => [],
  },

  rules: {
    // ...
    
    _property_name: $ => reserved('properties', choice(
      alias($.identifier, $.property_name),
      // ...
    )),
  }
});

In this particular case, we call reserved() with the name of the property in the reserved object, in this case, 'properties', to indicate that there are no reserved words in that context. In other use cases, we might pass an alternative set of reserved words. The property name that's passed into the first argument of reserved has its reserved word set used in this context.

Details

  • Reserved Word Semantics - Right now, reserved words only take effect in parse states where the grammar's word token would be valid. For example, when inside of a string literal, the reserved words won't cause the lexer to recognize the contents of a string as an if keyword.
  • Using the Highest Reserved Word Set - In many parse states, there are multiple possible rules that could be in progress. Right now, the reserved word set for these parse states will be the highest of the reserved words associated with each in-progress rule. You can think of each reserved word set having an implicit precedence that starts as 0, and increases for each subsequent reserved word set, so the global set always has a precedence of 0, and properties in the JavaScript example has a precedence of 1.

@maxbrunsfeld maxbrunsfeld changed the title Add `reserved word' construct again, with a better API Add 'reserved word' construct again, with a better API Nov 8, 2024
@maxbrunsfeld
Copy link
Contributor Author

Looking at these test failures, I see that we still need to implement loading of the reserved-word-related fields on TSLanguage from wasm. Should be pretty straightforward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants