Skip to content

Latest commit

 

History

History
448 lines (359 loc) · 14 KB

notes.md

File metadata and controls

448 lines (359 loc) · 14 KB

treesitter ini, aws-config

This is a tree-sitter experience report.

Why treesitter?

Tree-sitter uses context-aware tokenization - in a given parse state, Tree-sitter only recognizes tokens that are syntactically valid in that state. This is what allows Tree-sitter to tokenize languages correctly without requiring the grammar author to think about different lexer modes. — ref

tree-sitter has nice docs to walkthrough this: https://tree-sitter.github.io/tree-sitter/creating-parsers

https://siraben.dev/2022/03/01/tree-sitter.html is another good tutorial and provides some motivation for treesitter:

Experience report

"tree-sitter" should be named "treesitter". There is zero reason to choose a multi-word name if you can avoid it. Multi-word names manifest in various forms depending on the context: "treeSitter", "TreeSitter", "tree_sitter", "tree-sitter". This is worthless entropy.

Here's the C code for an .ini parser generated by tree-sitter. It has an API exposed as a C dynamic library, usable from any language. How did I do this? See below.

#include <tree_sitter/parser.h>
…
enum {
  …
  sym_setting_name = 6,
  sym_setting_value = 7,
  …
};
static const char *ts_symbol_names[] = {
  [sym_setting_name] = "setting_name",
  [sym_setting_value] = "setting_value",
  …
};
static bool ts_lex(TSLexer *lexer, TSStateId state) {
  START_LEXER();
  eof = lexer->eof(lexer);
  switch (state) {
    case 0:
      …
      END_STATE();
    …
    case 8:
      ACCEPT_TOKEN(aux_sym_section_name_token1);
      if (lookahead == '\n') ADVANCE(10);
      if (lookahead == '[' ||
          lookahead == ']') ADVANCE(19);
      if (lookahead != 0) ADVANCE(8);
      END_STATE();
    …
    case 14:
      ACCEPT_TOKEN(sym_setting_name);
      if (lookahead == '\n') ADVANCE(16);
      if (lookahead == '=') ADVANCE(19);
      if (lookahead != 0) ADVANCE(14);
      END_STATE();
    …
  }
}
static uint16_t ts_parse_table[LARGE_STATE_COUNT][SYMBOL_COUNT] = {
  [0] = {
    …
  },
  [1] = {
    …
    [sym_section] = STATE(3),
    [sym_section_name] = STATE(2),
  },
};

Creating a new parser for the INI file type

$ npm init

# Install a small module that lets your parser be used from Node.
$ npm install --save nan

# Install (and depend-on) the Tree-sitter CLI.
$ npm install --save-dev tree-sitter-cli

Create the grammar.js stub

grammar.js:

module.exports = grammar({
  name: 'ini',
  rules: {
    // TODO: add the actual grammar rules
    source_file: $ => 'hello'
  }
});

Generate the parser

$ ./node_modules/.bin/tree-sitter generate

That generates src/parser.c from the grammar.js that we defined (typically tree-sitter projects put this in their npm run build script in package.json). The parser can be tested now:

$ echo hello > test
$ ./node_modules/.bin/tree-sitter parse test
(source_file [0, 0] - [1, 0])

It also generates some glue files to compile the parser's C source a library and load it as a Node.js module. Cool!

  • binding.gyp tells Node.js how to compile the parser's C source.
    • tree-sitter generate automatically compiles parser.c into a dynamically-loadable library.
  • index.js is the entrypoint for Node.js to load the compiled C module.
  • src/binding.cc wraps your parser in a JavaScript object for Node.js
  • src/tree_sitter/parser.h is a C header by parser.c.

Tree-sitter will detect ambiguity. For example this grammar.js is ambiguous:

module.exports = grammar({
  name: 'ini',
  rules: {
    document: $ => $._value,
    _value: $ => choice(
      $.a,
      $.b,
    ),
    a: $ => 'hello',
    b: $ => 'hello'
  }
});

Tree-sitter throws a Unresolved conflict error when it detects ambiguity:

$ ./node_modules/.bin/tree-sitter generate

Unresolved conflict for symbol sequence:

  'hello'  •  EOF  …

Possible interpretations:

  1:  (a  'hello')  •  EOF  …
  2:  (b  'hello')  •  EOF  …

Possible resolutions:

  1:  Specify a higher precedence in `a` than in the other rules.
  2:  Specify a higher precedence in `b` than in the other rules.
  3:  Add a conflict for these rules: `a`, `b`

Developing the grammar, part 1

One of my favorite things about tree-sitter is its use of S-expressions ("sexps") to define test-cases and pretty-print the AST. S-expressions are the human-readable representation of the AST. They are similar to HTML/XML/SGML (but simpler and older): just a very simple way to express a tree of things.

Here's the first test for my "ini" grammar:

==================
Test 1
==================

foo = bar

---

()

It fails:

$ ./node_modules/.bin/tree-sitter test
main:
  ✗ Test 1

1 failure:

expected / actual

  1. Test 1:
    (ERROR (UNEXPECTED 'f')) ()
    (ERROR (UNEXPECTED 'f')) ()

Now I can start developing the grammar. Tree-sitter provides a DSL to define grammars.

  • name: Name of the grammar.
  • rules:
    • Symbols (the $ object): Use $.foo to another grammar symbol within a rule.
    • String and Regex literals: terminal symbols.
    • Sequences: seq(rule1, rule2, …): matches rules in the given order.
    • Alternatives: choice(rule1, rule2, …): matches any one rule in the set.
    • Repetitions: repeat(rule) matches zero-or-more rules.
      • repeat1(rule) matches one-or-more rules.
    • optional(rule) matches zero-or-one rule.
    • Precedence: prec(number, rule) marks a rule with a precedence to resolve ambiguities ("LR(1) conflicts")
      • prec.left([number], rule) marks a rule as left-associative, i.e. prefers a match that ends earlier.
      • prec.right([number], rule) prefers a match that ends later.
      • prec.dynamic(number, rule) decides precedence at runtime! Uses the conflicts field in the grammar, when there is a "genuine ambiguity" (multiple matching rules).
    • token(rule) marks a rule as producing a single token; the entire rule subtree will be handled atomically by the lexer, as if you had written it all using a single regular expression.
      • Its content will be represented in the final syntax tree as a single leaf node. (Whereas If you don't use TOKEN, there will be a separate leaf node in the tree for each string literal or regexp in the subtree.)
    • alias(rule, name) renames a rule in the AST.
      • Example: alias($.foo, $.bar) appears as named node bar
      • Example: alias($.foo, 'bar') appears as anonymous node "bar".
    • field(name, rule) assigns a field name to the matching child nodes, which can be used to access the nodes from the AST.
  • extras: array of tokens that may appear anywhere. Often used for whitespace and comments. Example (also the default!):
    extras: $ => [
      /\s/
    ],
    
  • inline: useful for rules used in multiple places but for which you don’t want AST nodes.
  • conflicts: defines intentional ambiguities (LR(1) conflicts), which tree-sitter will resolve at runtime (dynamic precedence).
  • externals: token names which can be returned by an external scanner. Allows you to write custom C code which runs during the lexing process to handle non-regular lexical rules (e.g. Python indentation).
  • word: matches keywords for keyword extraction optimization.
  • supertypes: hidden rules considered to be ‘supertypes’ in the generated node types file (useful for statically-typed languages)

Reading about seq() gives me enough to try a toy rule for the "ini" grammar:

- `grammar.js`:
  ```
  module.exports = grammar({
    name: 'ini',
    rules: {
      document: $ => $._value,
      _value: $ => choice(
        $.kvpair,
      ),
      kvpair: $ => seq(
        /[^=]+/, '=', /.+/
      ),
    }
  });
  ```
- `test/corpus/main.txt`:
  ```
  ==================
  Test 1
  ==================

  foo = bar

  ---

  (document (kvpair))
  ```

The test passes...

$ ./node_modules/.bin/tree-sitter generate && ./node_modules/.bin/tree-sitter test
main:
  ✓ Test 1

...but this initial grammar lacks granularity: it doesn't parse the individual parts of kvpair, i.e. we probably want "key" and "value". And it feels wrong to use regex to exclude "=", that will probably be avoided more "formally" in the final grammar.

In fact this section of the docs mentions two properties to write a good parser:

  1. Intuitive structure. Because the grammar directly influences the structure of the AST!
  2. Adherence to LR(1). For optimal performance.

Developing the grammar, part 2

Now that I have a toy grammar feeding into a passing test, I am ready to experiment.

  • I see tree-sitter-json grammar combines seq(optional(choice(…))) in a rule--even assigning rules to local variables--and return a single rule via return token(…).
  • I see that tree-sitter-javascript includes comments in its extra directive.

ini files are very simple:

  • they have three parts: section, settings, and comments
    [section 1]
    a = val1
    b = val2
    # comment
    
  • a section might not have any "content", only a title:
    [section 1]
    

From there I'm able to introduce a setting rule, and define section in terms of section_name and setting.

module.exports = grammar({
  name: 'ini',
  extras: $ => [
    $.comment,
    /\s/,
  ],
  rules: {
    document: $ => repeat(
      $.section
    ),

    section: $ => {
      return seq(
        // Section must have a name.
        $.section_name,
        // Section has zero or more settings (name=value pairs).
        repeat($.setting),
      )
    },
    section_name: $ => seq(
      '[',
      /[^\[\]]+/,
      ']',
      '\n',     // Section name must be on its own line.
    ),
    setting: $ => seq(
      $.setting_name,
      '=',
      $.setting_value,
      '\n',     // Only one setting per line.
    ),
    setting_name: $ => /[^=]+/,
    setting_value: $ => /.+/,

    comment: $ => token(
      seq('#', /.*/),
    ),
  }
});

That passes this updated test:

==================
Test 1
==================

[a section title]
foo = bar

---

(document
  (section
    (section_name)
      (setting (setting_name) (setting_value))))

Actually use it!

Now we can actually parse stuff! Such as this AWS config example:

File test-awsconfig:

[default]
region = us-west-2
output = json

[profile dev-user]
region = us-east-1
output = text

[profile developers]
role_arn = arn:aws:iam::123456789012:role/developers
source_profile = dev-user
region = us-west-2
output = json

Parse it:

$ ./node_modules/.bin/tree-sitter parse test-awsconfig
(document [0, 0] - [13, 0]
  (section [0, 0] - [13, 0]
    (section_name [0, 0] - [1, 0])
    (setting [1, 0] - [2, 0]
      (setting_name [1, 0] - [1, 7])
      (setting_value [1, 8] - [1, 18]))
    (setting [2, 0] - [4, 0]
      (setting_name [2, 0] - [2, 7])
      (setting_value [2, 8] - [2, 13]))
    (setting [4, 0] - [6, 0]
      (setting_name [4, 0] - [5, 7])
      (setting_value [5, 8] - [5, 18]))
    (setting [6, 0] - [8, 0]
      (setting_name [6, 0] - [6, 7])
      (setting_value [6, 8] - [6, 13]))
    (setting [8, 0] - [10, 0]
      (setting_name [8, 0] - [9, 9])
      (setting_value [9, 10] - [9, 52]))
    (setting [10, 0] - [11, 0]
      (setting_name [10, 0] - [10, 15])
      (setting_value [10, 16] - [10, 25]))
    (setting [11, 0] - [12, 0]
      (setting_name [11, 0] - [11, 7])
      (setting_value [11, 8] - [11, 18]))
    (setting [12, 0] - [13, 0]
      (setting_name [12, 0] - [12, 7])
      (setting_value [12, 8] - [12, 13]))))

Error handling

One of the most valuable features of a parser-generator is its ability to flag errors and report where the error occurred in the source code.

An ini file without a section heading is an error.

foo = bar

$ ./node_modules/.bin/tree-sitter parse test-awsconfig
(document [0, 0] - [1, 0]
  (ERROR [0, 0] - [0, 9]
    (ERROR [0, 0] - [0, 3])
    (ERROR [0, 6] - [0, 9])))

Preventing duplicate symbols

From an early tree-sitter discussion, author Max Brunsfeld:

Thus far, I don't really consider "type II correctness" to be a goal for Tree-sitter, because I think it's fundamentally incompatible with one of the most important goals of the project: to be able to parse source code files based solely on their language, without knowing any auxiliary information about the files.

For example, tree-sitter-python should be able to parse any python code you can find, without knowing what python version the code is meant to run on. That means it needs to parse the union of Python 2 and Python 3.

Capturing whitespace

Whitespace will be included in all nodes if you have any $.word node that captures leading whitespace.