-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expressing permutations #130
Comments
Thus far, I don't really consider "type II correctness" to be a goal for Tree-sitter, because I think it's fundamentally incompatible with one of the most important goals of the project: to be able to parse source code files based solely on their language, without knowing any auxiliary information about the files. For example, And So again, because of how little information it receives, Tree-sitter generally needs to parse a superset of any given language. That's why, when implementing the parsers, I have generally prioritized simplicity over strictness. |
That said, I'm open to making changes for issues like the python example that you gave, as long as it doesn't add significantly to the size of the generated parser or lexer. Using tree-sitter grammars for compiler fuzzing is a very cool idea. I don't think type II correctness is really required for that though; as you mentioned, afl currently uses a pretty simple dictionary-based strategy for that purpose, so generating programs that are even 80-90% syntactically valid using the current grammars as-is would already be a big improvement over the status quo. |
Fair enough, although it should be noted that the examples given above are invalid in every version of Python and C, and the reason the grammar allows them to pass is likely not the need to maintain superset version compatibility but the awkwardness of expressing permutations in the existing DSL. I therefore think that the issue in itself is still valid as it points to a common language construct that grammars cannot currently express concisely. I guess much hinges on whether your vision for tree-sitter includes things besides syntax highlighting. I've been following this project for a while and although it's clear that it is currently mostly geared towards integration in Atom for highlighting, there seems to be great potential beyond that. If identifiers and types were reliably resolved across all grammars, many common "refactoring" actions that are now shelled out to language servers could be performed in milliseconds using a shared codebase. I also imagine linters that lie somewhere between syntax checking and compilation could be built for well-behaved languages like Rust, detecting things like duplicate symbol declarations while typing. Of course, such features don't really work if we can't even detect obvious syntax errors, so tightening the grammars would be a prerequisite. Before I forget, here's a question I've been struggling with: What exactly are the semantics of the |
Yeah, I agree with that. I wasn't trying to say that those particular problems were not solvable within Tree-sitter's constraints. I was just saying that in general, "type II correctness" is not fully achievable within the constraints. I'm definitely open to handling python string literals more strictly.
Yeah, syntax highlighting is just the beginning. I would like to make the editor more powerful in many ways. I think that some things (possibly including type inference and accurate symbol resolution) are probably better handled by language servers, but there is a large set of features that just need fast, accurate syntax information and these could be implemented using Tree-sitter.
I'm not sure if I agree with this. I don't think tightening the grammars is a prerequisite for anything (except for highlighting a few additional types of syntax errors). Why would our failure to detect this type of error (like including an extra |
Yeah, that one's not very obvious; sorry I haven't had time to document this stuff. The You're right though: this implies that |
The problem I'm pointing to is that performing semantically correct refactorings on syntactically invalid code is something of a contradiction in terms. Thus a system that does not recognize invalid syntax in some cases necessarily ventures into unsafe territory. For the above examples the results might be harmless, but in other cases invalid syntax might confuse tree-sitter as to what exactly the name of an identifier is, and then the refactoring could break semantics. All refactoring engines I've seen that are truly "industrial strength" (JDT, C#) at least warn about syntax errors if there are any before refactoring. I don't think I could blindly trust an engine that doesn't do this. |
Yeah, I agree with the general idea: we don't want to be permissive of invalid syntax in ways that will make downstream analysis more difficult. We've grappled with this type of concern a bit already because of some other use cases that we have for Tree-sitter at GitHub besides syntax highlighting. I'm guessing that as people start trying to build packages that do refactoring and stuff, we'll want to make additional changes to the parsers based on the feedback that we get. |
@p-e-w I really appreciate the investigations you're doing. It's really beneficial to have my decisions regarding tree-sitter questioned in various ways. I'm going to close this one out because I don't think that building special support for permutations into tree-sitter's runtime is worth the complexity right now. Any additional state in the parsing process brings with it some special concerns relating to the validity of reusing trees during incremental parsing.
I don't think I'm in favor of making it easier to define that sort of rule because of its potential impact on the size of the parse table. It also would need to be implemented completely differently when used within a Anyway, thanks for the good discussion. |
Tree-sitter currently thinks that
is valid C, and that
is a Python string literal. The root cause for both issues is the difficulty of expressing a permutation of rules, which is why the rule matching a string literal in the Python grammar starts with
Qualifiers, access modifiers and prefixes are tokens that programming languages commonly allow to occur in any order but without repetition. It would be nice to be able to express such constructs concisely in tree-sitter grammars, perhaps as
This is something most parser generators struggle with, owing to a theoretical limitation of CFGs that requires combinatorial expansion to describe permutations. However, tree-sitter already supports (non-CFG) externally parsed rules and I wonder if it might not be possible to track the required state directly in the parser library to support
PERMUTATION
as a new rule primitive.If that is not feasible, I suggest the addition of a
permutation
rule to the DSL that simply expandsinto
When the number of arguments is less than 5, the size of the rule tree will still be acceptable in most cases, and at the level of the DSL the expression is both compact and correct.
This issue was found using an experimental fuzzer called tree-saw. Permutation rules being expressed as
repeat
s is one of the most common reasons why programs generated from tree-sitter grammars are syntactically invalid.tree-saw finds many more similar and unrelated issues in all current grammars, but I won't spam you with those yet because I don't know whether "type II" correctness is even a goal you are interested in (personally, I think it would be fantastic if tree-sitter could reliably indicate syntax errors while typing, or if tree-sitter grammars could double as compiler fuzzers, augmenting the incomplete afl dictionaries that are common now).
The text was updated successfully, but these errors were encountered: