This is a tree-sitter experience report.
Tree-sitter uses context-aware tokenization - in a given parse state, Tree-sitter only recognizes tokens that are syntactically valid in that state. This is what allows Tree-sitter to tokenize languages correctly without requiring the grammar author to think about different lexer modes. — ref
tree-sitter has nice docs to walkthrough this: https://tree-sitter.github.io/tree-sitter/creating-parsers
https://siraben.dev/2022/03/01/tree-sitter.html is another good tutorial and provides some motivation for treesitter:
"tree-sitter" should be named "treesitter". There is zero reason to choose a multi-word name if you can avoid it. Multi-word names manifest in various forms depending on the context: "treeSitter", "TreeSitter", "tree_sitter", "tree-sitter". This is worthless entropy.
Here's the C code for an .ini
parser generated by tree-sitter. It has an API
exposed as a C dynamic library, usable from any language. How did I do this? See below.
#include <tree_sitter/parser.h>
…
enum {
…
sym_setting_name = 6,
sym_setting_value = 7,
…
};
static const char *ts_symbol_names[] = {
[sym_setting_name] = "setting_name",
[sym_setting_value] = "setting_value",
…
};
static bool ts_lex(TSLexer *lexer, TSStateId state) {
START_LEXER();
eof = lexer->eof(lexer);
switch (state) {
case 0:
…
END_STATE();
…
case 8:
ACCEPT_TOKEN(aux_sym_section_name_token1);
if (lookahead == '\n') ADVANCE(10);
if (lookahead == '[' ||
lookahead == ']') ADVANCE(19);
if (lookahead != 0) ADVANCE(8);
END_STATE();
…
case 14:
ACCEPT_TOKEN(sym_setting_name);
if (lookahead == '\n') ADVANCE(16);
if (lookahead == '=') ADVANCE(19);
if (lookahead != 0) ADVANCE(14);
END_STATE();
…
}
}
static uint16_t ts_parse_table[LARGE_STATE_COUNT][SYMBOL_COUNT] = {
[0] = {
…
},
[1] = {
…
[sym_section] = STATE(3),
[sym_section_name] = STATE(2),
},
};
$ npm init
# Install a small module that lets your parser be used from Node.
$ npm install --save nan
# Install (and depend-on) the Tree-sitter CLI.
$ npm install --save-dev tree-sitter-cli
grammar.js
:
module.exports = grammar({
name: 'ini',
rules: {
// TODO: add the actual grammar rules
source_file: $ => 'hello'
}
});
$ ./node_modules/.bin/tree-sitter generate
That generates src/parser.c
from the grammar.js
that we defined (typically
tree-sitter projects put this in their npm run build
script in
package.json
). The parser can be tested now:
$ echo hello > test
$ ./node_modules/.bin/tree-sitter parse test
(source_file [0, 0] - [1, 0])
It also generates some glue files to compile the parser's C source a library and load it as a Node.js module. Cool!
binding.gyp
tells Node.js how to compile the parser's C source.tree-sitter generate
automatically compilesparser.c
into a dynamically-loadable library.
index.js
is the entrypoint for Node.js to load the compiled C module.src/binding.cc
wraps your parser in a JavaScript object for Node.jssrc/tree_sitter/parser.h
is a C header byparser.c
.
Tree-sitter will detect ambiguity. For example this grammar.js
is ambiguous:
module.exports = grammar({
name: 'ini',
rules: {
document: $ => $._value,
_value: $ => choice(
$.a,
$.b,
),
a: $ => 'hello',
b: $ => 'hello'
}
});
Tree-sitter throws a Unresolved conflict
error when it detects ambiguity:
$ ./node_modules/.bin/tree-sitter generate
Unresolved conflict for symbol sequence:
'hello' • EOF …
Possible interpretations:
1: (a 'hello') • EOF …
2: (b 'hello') • EOF …
Possible resolutions:
1: Specify a higher precedence in `a` than in the other rules.
2: Specify a higher precedence in `b` than in the other rules.
3: Add a conflict for these rules: `a`, `b`
One of my favorite things about tree-sitter is its use of S-expressions ("sexps") to define test-cases and pretty-print the AST. S-expressions are the human-readable representation of the AST. They are similar to HTML/XML/SGML (but simpler and older): just a very simple way to express a tree of things.
Here's the first test for my "ini" grammar:
==================
Test 1
==================
foo = bar
---
()
It fails:
$ ./node_modules/.bin/tree-sitter test
main:
✗ Test 1
1 failure:
expected / actual
1. Test 1:
(ERROR (UNEXPECTED 'f')) ()
(ERROR (UNEXPECTED 'f')) ()
Now I can start developing the grammar. Tree-sitter provides a DSL to define grammars.
name
: Name of the grammar.rules
:- Symbols (the
$
object): Use$.foo
to another grammar symbol within a rule. - String and Regex literals: terminal symbols.
- Sequences:
seq(rule1, rule2, …)
: matches rules in the given order. - Alternatives:
choice(rule1, rule2, …)
: matches any one rule in the set. - Repetitions:
repeat(rule)
matches zero-or-more rules.repeat1(rule)
matches one-or-more rules.
optional(rule)
matches zero-or-one rule.- Precedence:
prec(number, rule)
marks a rule with a precedence to resolve ambiguities ("LR(1) conflicts")prec.left([number], rule)
marks a rule as left-associative, i.e. prefers a match that ends earlier.prec.right([number], rule)
prefers a match that ends later.prec.dynamic(number, rule)
decides precedence at runtime! Uses theconflicts
field in the grammar, when there is a "genuine ambiguity" (multiple matching rules).
token(rule)
marks a rule as producing a single token; the entire rule subtree will be handled atomically by the lexer, as if you had written it all using a single regular expression.- Its content will be represented in the final syntax tree as a single leaf node. (Whereas If you don't use TOKEN, there will be a separate leaf node in the tree for each string literal or regexp in the subtree.)
alias(rule, name)
renames a rule in the AST.- Example:
alias($.foo, $.bar)
appears as named nodebar
- Example:
alias($.foo, 'bar')
appears as anonymous node "bar".
- Example:
field(name, rule)
assigns a field name to the matching child nodes, which can be used to access the nodes from the AST.
- Symbols (the
extras
: array of tokens that may appear anywhere. Often used for whitespace and comments. Example (also the default!):extras: $ => [ /\s/ ],
inline
: useful for rules used in multiple places but for which you don’t want AST nodes.conflicts
: defines intentional ambiguities (LR(1) conflicts), which tree-sitter will resolve at runtime (dynamic precedence).externals
: token names which can be returned by an external scanner. Allows you to write custom C code which runs during the lexing process to handle non-regular lexical rules (e.g. Python indentation).word
: matches keywords for keyword extraction optimization.supertypes
: hidden rules considered to be ‘supertypes’ in the generated node types file (useful for statically-typed languages)
Reading about seq()
gives me enough to try a toy rule for the "ini" grammar:
- `grammar.js`:
```
module.exports = grammar({
name: 'ini',
rules: {
document: $ => $._value,
_value: $ => choice(
$.kvpair,
),
kvpair: $ => seq(
/[^=]+/, '=', /.+/
),
}
});
```
- `test/corpus/main.txt`:
```
==================
Test 1
==================
foo = bar
---
(document (kvpair))
```
The test passes...
$ ./node_modules/.bin/tree-sitter generate && ./node_modules/.bin/tree-sitter test
main:
✓ Test 1
...but this initial grammar lacks granularity: it doesn't parse the individual
parts of kvpair
, i.e. we probably want "key" and "value". And it feels wrong
to use regex to exclude "=", that will probably be avoided more "formally" in
the final grammar.
In fact this section of the docs mentions two properties to write a good parser:
- Intuitive structure. Because the grammar directly influences the structure of the AST!
- Adherence to LR(1). For optimal performance.
Now that I have a toy grammar feeding into a passing test, I am ready to experiment.
- I see tree-sitter-json grammar
combines
seq(optional(choice(…)))
in a rule--even assigning rules to local variables--and return a single rule viareturn token(…)
. - I see that tree-sitter-javascript
includes comments in its
extra
directive.
ini files are very simple:
- they have three parts: section, settings, and comments
[section 1] a = val1 b = val2 # comment
- a section might not have any "content", only a title:
[section 1]
From there I'm able to introduce a setting
rule, and define section
in
terms of section_name
and setting
.
module.exports = grammar({
name: 'ini',
extras: $ => [
$.comment,
/\s/,
],
rules: {
document: $ => repeat(
$.section
),
section: $ => {
return seq(
// Section must have a name.
$.section_name,
// Section has zero or more settings (name=value pairs).
repeat($.setting),
)
},
section_name: $ => seq(
'[',
/[^\[\]]+/,
']',
'\n', // Section name must be on its own line.
),
setting: $ => seq(
$.setting_name,
'=',
$.setting_value,
'\n', // Only one setting per line.
),
setting_name: $ => /[^=]+/,
setting_value: $ => /.+/,
comment: $ => token(
seq('#', /.*/),
),
}
});
That passes this updated test:
==================
Test 1
==================
[a section title]
foo = bar
---
(document
(section
(section_name)
(setting (setting_name) (setting_value))))
Now we can actually parse stuff! Such as this AWS config example:
File test-awsconfig
:
[default]
region = us-west-2
output = json
[profile dev-user]
region = us-east-1
output = text
[profile developers]
role_arn = arn:aws:iam::123456789012:role/developers
source_profile = dev-user
region = us-west-2
output = json
Parse it:
$ ./node_modules/.bin/tree-sitter parse test-awsconfig
(document [0, 0] - [13, 0]
(section [0, 0] - [13, 0]
(section_name [0, 0] - [1, 0])
(setting [1, 0] - [2, 0]
(setting_name [1, 0] - [1, 7])
(setting_value [1, 8] - [1, 18]))
(setting [2, 0] - [4, 0]
(setting_name [2, 0] - [2, 7])
(setting_value [2, 8] - [2, 13]))
(setting [4, 0] - [6, 0]
(setting_name [4, 0] - [5, 7])
(setting_value [5, 8] - [5, 18]))
(setting [6, 0] - [8, 0]
(setting_name [6, 0] - [6, 7])
(setting_value [6, 8] - [6, 13]))
(setting [8, 0] - [10, 0]
(setting_name [8, 0] - [9, 9])
(setting_value [9, 10] - [9, 52]))
(setting [10, 0] - [11, 0]
(setting_name [10, 0] - [10, 15])
(setting_value [10, 16] - [10, 25]))
(setting [11, 0] - [12, 0]
(setting_name [11, 0] - [11, 7])
(setting_value [11, 8] - [11, 18]))
(setting [12, 0] - [13, 0]
(setting_name [12, 0] - [12, 7])
(setting_value [12, 8] - [12, 13]))))
One of the most valuable features of a parser-generator is its ability to flag errors and report where the error occurred in the source code.
An ini file without a section heading is an error.
foo = bar
$ ./node_modules/.bin/tree-sitter parse test-awsconfig
(document [0, 0] - [1, 0]
(ERROR [0, 0] - [0, 9]
(ERROR [0, 0] - [0, 3])
(ERROR [0, 6] - [0, 9])))
From an early tree-sitter discussion, author Max Brunsfeld:
Thus far, I don't really consider "type II correctness" to be a goal for Tree-sitter, because I think it's fundamentally incompatible with one of the most important goals of the project: to be able to parse source code files based solely on their language, without knowing any auxiliary information about the files.
For example, tree-sitter-python should be able to parse any python code you can find, without knowing what python version the code is meant to run on. That means it needs to parse the union of Python 2 and Python 3.
Whitespace will be included in all nodes if you have any $.word
node that captures leading whitespace.