Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expressing permutations #130

Closed
p-e-w opened this issue Feb 11, 2018 · 8 comments
Closed

Expressing permutations #130

p-e-w opened this issue Feb 11, 2018 · 8 comments

Comments

@p-e-w
Copy link

p-e-w commented Feb 11, 2018

Tree-sitter currently thinks that

const const const const const const int a = 0;

is valid C, and that

urbuuuubr"Hello, world!"

is a Python string literal. The root cause for both issues is the difficulty of expressing a permutation of rules, which is why the rule matching a string literal in the Python grammar starts with

string: $ => token(seq(
  repeat(choice('u', 'r', 'b')),
  ...

Qualifiers, access modifiers and prefixes are tokens that programming languages commonly allow to occur in any order but without repetition. It would be nice to be able to express such constructs concisely in tree-sitter grammars, perhaps as

string: $ => token(seq(
  permutation(optional('u'), optional('r'), optional('b')),
  ...

This is something most parser generators struggle with, owing to a theoretical limitation of CFGs that requires combinatorial expansion to describe permutations. However, tree-sitter already supports (non-CFG) externally parsed rules and I wonder if it might not be possible to track the required state directly in the parser library to support PERMUTATION as a new rule primitive.

If that is not feasible, I suggest the addition of a permutation rule to the DSL that simply expands

permutation(a, b, c)

into

choice(
  seq(a, b, c),
  seq(a, c, b),
  seq(b, a, c),
  seq(c, a, b),
  seq(b, c, a),
  seq(c, b, a)
)

When the number of arguments is less than 5, the size of the rule tree will still be acceptable in most cases, and at the level of the DSL the expression is both compact and correct.


This issue was found using an experimental fuzzer called tree-saw. Permutation rules being expressed as repeats is one of the most common reasons why programs generated from tree-sitter grammars are syntactically invalid.

tree-saw finds many more similar and unrelated issues in all current grammars, but I won't spam you with those yet because I don't know whether "type II" correctness is even a goal you are interested in (personally, I think it would be fantastic if tree-sitter could reliably indicate syntax errors while typing, or if tree-sitter grammars could double as compiler fuzzers, augmenting the incomplete afl dictionaries that are common now).

@maxbrunsfeld
Copy link
Contributor

maxbrunsfeld commented Feb 12, 2018

Thus far, I don't really consider "type II correctness" to be a goal for Tree-sitter, because I think it's fundamentally incompatible with one of the most important goals of the project: to be able to parse source code files based solely on their language, without knowing any auxiliary information about the files.

For example, tree-sitter-python should be able to parse any python code you can find, without knowing what python version the code is meant to run on. That means it needs to parse the union of Python 2 and Python 3.

And tree-sitter-c should be able to parse most working C code, without actually reading any of the headers that are included via the preprocessor. That means Tree-sitter has to do the best it can to figure out which identifiers are types, which are values, and which ones are exotic kinds of macros. This is not trivial, but it's what is required if we want truly fast syntax information in a text editor.

So again, because of how little information it receives, Tree-sitter generally needs to parse a superset of any given language. That's why, when implementing the parsers, I have generally prioritized simplicity over strictness.

@maxbrunsfeld
Copy link
Contributor

That said, I'm open to making changes for issues like the python example that you gave, as long as it doesn't add significantly to the size of the generated parser or lexer.

Using tree-sitter grammars for compiler fuzzing is a very cool idea. I don't think type II correctness is really required for that though; as you mentioned, afl currently uses a pretty simple dictionary-based strategy for that purpose, so generating programs that are even 80-90% syntactically valid using the current grammars as-is would already be a big improvement over the status quo.

@p-e-w
Copy link
Author

p-e-w commented Feb 12, 2018

Fair enough, although it should be noted that the examples given above are invalid in every version of Python and C, and the reason the grammar allows them to pass is likely not the need to maintain superset version compatibility but the awkwardness of expressing permutations in the existing DSL. I therefore think that the issue in itself is still valid as it points to a common language construct that grammars cannot currently express concisely.

I guess much hinges on whether your vision for tree-sitter includes things besides syntax highlighting. I've been following this project for a while and although it's clear that it is currently mostly geared towards integration in Atom for highlighting, there seems to be great potential beyond that. If identifiers and types were reliably resolved across all grammars, many common "refactoring" actions that are now shelled out to language servers could be performed in milliseconds using a shared codebase. I also imagine linters that lie somewhere between syntax checking and compilation could be built for well-behaved languages like Rust, detecting things like duplicate symbol declarations while typing. Of course, such features don't really work if we can't even detect obvious syntax errors, so tightening the grammars would be a prerequisite.

Before I forget, here's a question I've been struggling with: What exactly are the semantics of the TOKEN rule? Is it simply that subtrees having TOKEN as an ancestor cannot have extras inserted between their constituents?

@maxbrunsfeld
Copy link
Contributor

maxbrunsfeld commented Feb 12, 2018

Fair enough, although it should be noted that the examples given above are invalid in every version of Python and C.

Yeah, I agree with that. I wasn't trying to say that those particular problems were not solvable within Tree-sitter's constraints. I was just saying that in general, "type II correctness" is not fully achievable within the constraints. I'm definitely open to handling python string literals more strictly.

I guess much hinges on whether your vision for tree-sitter includes things besides syntax highlighting.

Yeah, syntax highlighting is just the beginning. I would like to make the editor more powerful in many ways. I think that some things (possibly including type inference and accurate symbol resolution) are probably better handled by language servers, but there is a large set of features that just need fast, accurate syntax information and these could be implemented using Tree-sitter.

many common "refactoring" actions that are now shelled out to language servers could be performed in milliseconds using a shared codebase.
...
Of course, such features don't really work if we can't even detect obvious syntax errors, so tightening the grammars would be a prerequisite.

I'm not sure if I agree with this. I don't think tightening the grammars is a prerequisite for anything (except for highlighting a few additional types of syntax errors). Why would our failure to detect this type of error (like including an extra const in a declaration or an extra u flag in a string literal) prevent us from performing useful refactorings, or detecting duplicate symbols? It seems like an orthogonal concern.

@maxbrunsfeld
Copy link
Contributor

maxbrunsfeld commented Feb 12, 2018

Before I forget, here's a question I've been struggling with: What exactly are the semantics of the TOKEN rule? Is it simply that subtrees having TOKEN as an ancestor cannot have extras inserted between their constituents?

Yeah, that one's not very obvious; sorry I haven't had time to document this stuff. The TOKEN rule causes an entire rule subtree to be handled atomically by the lexer, as if you had written it all using a single regular expression. This way, its content will be represented in the final syntax tree as a single leaf node. If you don't use TOKEN, there will be a separate leaf node in the tree for each string literal or regexp in your grammar.

You're right though: this implies that extras can't be inserted inside of one of these rules.

@p-e-w
Copy link
Author

p-e-w commented Feb 13, 2018

I don't think tightening the grammars is a prerequisite for anything (except for highlighting a few additional types of syntax errors). Why would our failure to detect this type of error (like including an extra const in a declaration or an extra u flag in a string literal) prevent us from performing useful refactorings, or detecting duplicate symbols?

The problem I'm pointing to is that performing semantically correct refactorings on syntactically invalid code is something of a contradiction in terms. Thus a system that does not recognize invalid syntax in some cases necessarily ventures into unsafe territory. For the above examples the results might be harmless, but in other cases invalid syntax might confuse tree-sitter as to what exactly the name of an identifier is, and then the refactoring could break semantics.

All refactoring engines I've seen that are truly "industrial strength" (JDT, C#) at least warn about syntax errors if there are any before refactoring. I don't think I could blindly trust an engine that doesn't do this.

@maxbrunsfeld
Copy link
Contributor

Yeah, I agree with the general idea: we don't want to be permissive of invalid syntax in ways that will make downstream analysis more difficult. We've grappled with this type of concern a bit already because of some other use cases that we have for Tree-sitter at GitHub besides syntax highlighting. I'm guessing that as people start trying to build packages that do refactoring and stuff, we'll want to make additional changes to the parsers based on the feedback that we get.

@maxbrunsfeld
Copy link
Contributor

@p-e-w I really appreciate the investigations you're doing. It's really beneficial to have my decisions regarding tree-sitter questioned in various ways.

I'm going to close this one out because I don't think that building special support for permutations into tree-sitter's runtime is worth the complexity right now. Any additional state in the parsing process brings with it some special concerns relating to the validity of reusing trees during incremental parsing.

If that is not feasible, I suggest the addition of a permutation rule to the DSL that simply expands permutation(a, b, c) into...

I don't think I'm in favor of making it easier to define that sort of rule because of its potential impact on the size of the parse table. It also would need to be implemented completely differently when used within a token (as with the u and b flags in python's strings) and outside of a token (as with const specifiers in C).

Anyway, thanks for the good discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants