Support parsing unterminated statements #65

XVilka · 2021-04-27T04:35:49Z

Currently parser is only able to successfully parse terminated statements, like:

const char * myarray[25];

But if you feed something like

const char * [25]

or

const char *

It emits an error.
It would be beneficial to support parsing such statements too.
@thestr4ng3r proposed the following change in the grammar:

diff --git a/grammar.js b/grammar.js
index 6a5fa25..5dc99a3 100644
--- a/grammar.js
+++ b/grammar.js
@@ -51,6 +51,7 @@ module.exports = grammar({
   word: $ => $.identifier,

   rules: {
+    the_actual_root: $ => $.type_descriptor,
     translation_unit: $ => repeat($._top_level_item),

     _top_level_item: $ => choice(

thestr4ng3r · 2021-04-27T05:52:07Z

To clarify, what we actually need is in addition to parsing a full translation_unit like int a() { x = (const char *[25])y; }, in the same application parse only the part inside the cast like const char *[25], which is type_descriptor.

So essentially, we would need a way to change the root rule to use at runtime, which isn't really tree-sitter-c specific. I wonder if that is even theoretically possible with how the code generator works.

Alternatively we will have to use two grammars for this where one is the original tree-sitter-c and the other is conceptually what is shown in the issue description (which of course breaks parsing regular translation_units).

XVilka · 2021-05-06T08:34:22Z

Maybe worth to transfer the issue to the tree-sitter repository then? @maxbrunsfeld

maxbrunsfeld · 2021-05-13T16:03:08Z

When you need to parse a fragment of incomplete source code (like a type_descriptor), can you just surround the fragment with a "context" that turns it into a valid C translation unit, and then extract out the piece of the syntax tree that you're interested in?

For example, to parse a type_descriptor, take the input string, append the suffix string x;, parse that combined string, and then take the subtree for the relevant byte range.

There is a long-standing Tree-sitter issue about selecting alternative root rules at runtime, but that is going to be complex to implement, and this workaround actually seems quite straightforward and scalable, in cases where you had many different rules that you wanted to try.

thestr4ng3r · 2021-05-13T17:58:20Z

Appending x; would not work for type_descriptor:

char *x;

(translation_unit [0, 0] - [1, 0]
  (declaration [0, 0] - [0, 8]
    type: (primitive_type [0, 0] - [0, 4])
    declarator: (pointer_declarator [0, 5] - [0, 7]
      declarator: (identifier [0, 6] - [0, 7]))))

But we could in theory use a cast, so assuming we want to parse const char *[42], wrap it like so:

void a() { (const char *[42])x; }

(translation_unit [0, 0] - [1, 0]
  (function_definition [0, 0] - [0, 33]
    type: (primitive_type [0, 0] - [0, 4])
    declarator: (function_declarator [0, 5] - [0, 8]
      declarator: (identifier [0, 5] - [0, 6])
      parameters: (parameter_list [0, 6] - [0, 8]))
    body: (compound_statement [0, 9] - [0, 33]
      (expression_statement [0, 11] - [0, 31]
        (cast_expression [0, 11] - [0, 30]
          type: (type_descriptor [0, 12] - [0, 28]
            (type_qualifier [0, 12] - [0, 17])
            type: (primitive_type [0, 18] - [0, 22])
            declarator: (abstract_pointer_declarator [0, 23] - [0, 28]
              declarator: (abstract_array_declarator [0, 24] - [0, 28]
                size: (number_literal [0, 25] - [0, 27]))))
          value: (identifier [0, 29] - [0, 30]))))))

The reason why in practice we can't do this is that the string that we want to parse could do some sort of injection and easily escape our wrapping, for example when we try to parse int)0; now_i_have_escaped(); //, we want to get a meaningful error rather than a well-parsed int type_descriptor with some garbage in the wrapped tree.

But I think the first workaround proposed in tree-sitter/tree-sitter#870, which is to always prepend some magic string to tell the parser how to proceed could work very well for us.

XVilka · 2021-05-14T09:30:31Z

Just for the record, this is what I came up with:

    [$._type_specifier, $._expression],
    [$._type_specifier, $._expression, $.macro_type_specifier],
    [$._type_specifier, $.macro_type_specifier],
+   [$.type_expression, $._abstract_declarator],
+   [$.type_expression],
    [$.sized_type_specifier],
  ],

  word: $ => $.identifier,

  rules: {
-    translation_unit: $ => repeat($._top_level_item),
+    translation_unit: $ => choice(
+            repeat1($.type_expression),
+            repeat1($._top_level_item)
+    ),
+
+    type_expression: $ => seq(
+       '__TYPE_EXPRESSION',
+       repeat($.type_qualifier),
+       field('type', $._type_specifier),
+       repeat($.abstract_pointer_declarator),
+       repeat($.abstract_array_declarator),
+       repeat($.abstract_pointer_declarator),
+    ),

You can see the examples of what it can parse here: XVilka@fed7bd0:

__TYPE_EXPRESSION const int* [5]
__TYPE_EXPRESSION volatile uint8_t* [2]
__TYPE_EXPRESSION const uintptr_t* []
__TYPE_EXPRESSION struct s1 *

__TYPE_EXPRESSION struct s2 {
  int x;
  float y : 5;
} [5]

XVilka mentioned this issue May 13, 2021

Ability to change the root rule at runtime tree-sitter/tree-sitter#1105

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support parsing unterminated statements #65

Support parsing unterminated statements #65

XVilka commented Apr 27, 2021

thestr4ng3r commented Apr 27, 2021

XVilka commented May 6, 2021

maxbrunsfeld commented May 13, 2021

thestr4ng3r commented May 13, 2021

XVilka commented May 14, 2021

Support parsing unterminated statements #65

Support parsing unterminated statements #65

Comments

XVilka commented Apr 27, 2021

thestr4ng3r commented Apr 27, 2021

XVilka commented May 6, 2021

maxbrunsfeld commented May 13, 2021

thestr4ng3r commented May 13, 2021

XVilka commented May 14, 2021