Fix errors in grammar syntax #416

DaelonSuzuka · 2022-09-05T19:30:22Z

Per #415, this PR fixes some regexes that appeared to work in VSCode/onigumura but do not work in linguist using PCRE.

When this PR is done, I need to remember to post in github-linguist/linguist#3924 and let the linguist project know that we've fixed our grammars.

Additionally, @lildude has documented a process here that we can use to verify that these changes satisfy linguist's requirements. It's worth investigating if this process or a derivative can be integrated into our CI, which will at least provide syntax validation of our grammar files.

All three of the errors are invalid uses of positive lookbehind using a non-fixed length.

Problem areas:

godot-vscode-plugin/syntaxes/GDScript.tmLanguage.json

Line 298 in e7674c1

"match": "\\b(?<=for\\s[\\w]*\\s)(in)\\b",
godot-vscode-plugin/syntaxes/GDShader.tmLanguage.json

Line 345 in e7674c1

"match": "(?<=[.]\\s*)(?:[xyzw]{2,4}|[rgba]{2,4}|[stpq]{2,4})\b"
godot-vscode-plugin/syntaxes/GDShader.tmLanguage.json

Line 349 in e7674c1

"match": "(?<=[.]\\s*)[a-zA-Z_]\\w*\b"

I've already fixed the first problem, in the GDScript grammar, but unfortunately I've discovered that there was a regression sometime between 1.3.1 and now. The GDShader syntax doesn't even run, and there aren't any error messages about failing to load the grammar.

I'll have to find the cause of this regression before I can fix the two regex errors in GDShader.

DaelonSuzuka · 2022-09-05T19:50:41Z

I have absolutely no idea how this happened, but there were a ton of regex special characters that were missing the second backslash(they need to be double escaped because reasons). I recopied the grammar from @AlfishSoftware's extension, fixed our handful of differences, and it's working again.

I replaced the invalid lookbehinds and it seems to work. Next, I'll run linguist's grammar compiler and see if we pass.

DaelonSuzuka · 2022-09-05T20:16:55Z

I followed the steps to set up linguist, ran script/bootstrap, manually replaced the grammars it download with the fixed versions from this PR, ran script/grammar-compiler add vendor/grammars/godot-vscode-plugin/, and I believe this is a positive result:

daelon@isotope:~/linguist$ sudo script/grammar-compiler add vendor/grammars/godot-vscode-plugin/
latest: Pulling from linguist/grammar-compiler
Digest: sha256:203514bbf04b83d3594245338fc08097301e54139ccec20ab05861302c23ab64
Status: Image is up to date for linguist/grammar-compiler:latest
docker.io/linguist/grammar-compiler:latest
OK! added grammar source 'vendor/grammars/godot-vscode-plugin/'
        new scope: source.gdshader
        new scope: source.gdresource
        new scope: source.gdscript
daelon@isotope:~/linguist$

@lildude does this mean that your compiler is happy now?

@Calinou I think this is done, but it might be safer to wait for the thumbs-up from the linguist people before merging.

syntaxes/GDShader.tmLanguage.json

AlfishSoftware · 2022-09-05T21:55:51Z

By the way, I didn't test it in VSCode (no time at the moment), so I don't know if the dot changes affect .length(). It might be the case where it picks .length as identifierField instead of length() as a identifierFunction as before, so you might want to double-check that length is colored correctly as a function, just in case.

DaelonSuzuka · 2022-09-05T22:05:22Z

I'll definitely double check that, thanks for the tip.

DaelonSuzuka · 2022-09-05T23:06:55Z

It might be the case where it picks .length as identifierField instead of length() as a identifierFunction

It was doing exactly this. Fantastic catch, as usual.

syntaxes/GDShader.tmLanguage.json

AlfishSoftware · 2022-09-06T01:45:06Z

Oh. I just realized review comments aren't published right away, it seems? Sorry about that, I guess you didn't see the first one (marked outdated) before.

DaelonSuzuka · 2022-09-06T02:11:08Z

Oh. I just realized review comments aren't published right away, it seems? Sorry about that, I guess you didn't see the first one (marked outdated) before.

I replied to every comment I can see, and I marked the first one as 'resolved' after I (tried to) fix the relevant bits.

If there are other comments I didn't interact with, then I definitely blame github's rather obtuse UI.

DaelonSuzuka · 2022-09-09T17:44:50Z

@Calinou This PR is also done. I rebased it to master myself just to confirm that #419 doesn't conflict(both PRs touched the GDScript grammar).

AlfishSoftware · 2022-09-16T03:35:38Z

syntaxes/GDScript.tmLanguage.json

-                    "match": "\\b(?<=for\\s[\\w]*\\s)(in)\\b",
+                    "match": "\\b(?:for\\s[\\w]*\\s)(in)\\b",
                    "name": "keyword.control.gdscript"


I didn't test this, but I suspect that it will now consider the whole thing as a keyword.
E.g. for i in array then for i in is a keyword. It shouldn't mark the loop variable (i in this case) as a keyword.

The correct thing to do here is to join this with the for keyword so that one rule recognizes both keywords. You don't need to remove for from control_flow (for should still be a keyword even before in is typed) as long as this is before so it has higher priority; but change this rule to something along the lines of:

{ "match": "\\b(for)\\s+([a-zA-Z_]\\w*)\\s+(in)\\b", "captures": { "1": { "name": "keyword.control.gdscript" }, "2": { "name": "variable.name.gdscript" }, "3": { "name": "keyword.control.gdscript" } } },

I changed \w (which matches [a-zA-Z0-9_] including numbers) since I believe identifiers can't start with a number (correct me if I'm wrong). It's probably worth checking this on other places in the code too. I'm assuming identifiers allow ASCII only.
I also allowed more than one space in the syntax.
Btw, using a tool like https://regex101.com/ while you write it really helps with understanding nuances in regexes and checking that it will really do what is expected, and prevent against corner cases.

I cannot believe that I missed this. You were correct, i was purple.

Your rule worked great. I removed the middle capture group and it's name block, because we don't actually use the variable.name tag for variables. Python doesn't tag variable names, they're just white, so that's the behavior I want to imitate.

I changed \w (which matches [a-zA-Z0-9_] including numbers) since I believe identifiers can't start with a number (correct me if I'm wrong). It's probably worth checking this on other places in the code too. I'm assuming identifiers allow ASCII only.

Many other places used the full form [a-zA-Z_][a-zA-Z_0-9]*, but I just replaced them with [a-zA-Z_]\\w*. I wish we could give a block like this a custom name so we don't have to copy it everywhere.

Btw, using a tool like regex101.com while you write it really helps with understanding nuances in regexes and checking that it will really do what is expected, and prevent against corner cases.

I started off doing this, but having to add all the extra escapes to move a rule over to vscode was too annoying so I stopped. I've thought about switching to yaml files, since the regex strings won't need escapes, but there was some problem with that too.

It's probably worth checking this on other places in the code too.

Many other places used the full form [a-zA-Z_][a-zA-Z_0-9]*, but I just replaced them with [a-zA-Z_]\\w*.

I was actually referring to places capturing identifiers with just \w+, which would allow names starting with numbers, e.g. 1whatever, 0e2f, 0xFACADE. I took a look and saw it in 2 places.
So, unless this is on purpose for some reason, this seems like it makes more sense:

Line 360:

"match": "(:)\\s*([a-zA-Z_]\\w*)?",

Line 375:

"match": "(setget)\\s+([a-zA-Z_]\\w*)?(?:[,]\\s*([a-zA-Z_]\\w*)?)?",

Another good catch, thanks. I found some other places where identifier names were being matched with [[:alpha:]_]\\w*, so I standardized all of them to [a-zA-Z_]\\w*.

DaelonSuzuka · 2022-09-16T16:28:47Z

I checked this latest commit with linguist again, and there's no warnings/errors.

I've decided that I'm definitely going to set up that grammar unit testing tool, but I don't have time right now.

DaelonSuzuka · 2022-10-25T17:38:45Z

I checked the latest changes with linguist and it looks like they're good to go.

Calinou

Thanks!

AlfishSoftware · 2022-10-26T00:36:38Z

Cool! Later I'll replicate those changes to gdshader (removing look-behinds) back in my own extension too, just in case.

aaronfranke · 2022-12-20T23:14:43Z

Does this PR fix #415?

DaelonSuzuka · 2022-12-21T01:20:04Z

Does this PR fix #415?

It does, but there's been some delay in communicating with the Github LInguist project to make sure the original source of the reported issue is fixed. I intended to close all the related items together but then time got away from me.

Thanks for noticing my miss!

Calinou added the bug label Sep 5, 2022

DaelonSuzuka marked this pull request as ready for review September 5, 2022 20:16

DaelonSuzuka mentioned this pull request Sep 5, 2022

Improve CI to check for grammar/highlighting issues #417

Closed

AlfishSoftware reviewed Sep 5, 2022

View reviewed changes

syntaxes/GDShader.tmLanguage.json Outdated Show resolved Hide resolved

AlfishSoftware suggested changes Sep 6, 2022

View reviewed changes

syntaxes/GDShader.tmLanguage.json Outdated Show resolved Hide resolved

syntaxes/GDShader.tmLanguage.json Outdated Show resolved Hide resolved

DaelonSuzuka added 8 commits September 9, 2022 13:31

Replace invalid lookbehind assertion

e5cacae

Fix regression caused by missing double escapes

a8104e8

Replace invalid lookbehind assertions

56147ef

Remove unused fields from GDResource grammar

d912895

Apply feedback from code review

385bc98

Add additional gdshader example, provided by @AlfishSoftware

9f2a0f4

Prevent "identifierField" rule from consuming function calls

b712eff

Apply suggestions from code review

8f318f9

DaelonSuzuka force-pushed the fix_grammar_errors branch from 94fdc33 to 8f318f9 Compare September 9, 2022 17:32

AlfishSoftware reviewed Sep 16, 2022

View reviewed changes

DaelonSuzuka added 3 commits September 16, 2022 11:54

Add more example cases

7166143

Replace broken rule for "in" keyword

e5ab256

SImplify rules for identifier names

fcad85d

Standardized regex rule for identifier names

87feab1

DaelonSuzuka added this to the 1.4.0 milestone Oct 25, 2022

Calinou approved these changes Oct 25, 2022

View reviewed changes

Calinou merged commit 5a67e4c into godotengine:master Oct 25, 2022

DaelonSuzuka mentioned this pull request Nov 29, 2022

Outstanding Grammar Issues github-linguist/linguist#3924

Open

18 tasks

DaelonSuzuka mentioned this pull request Dec 21, 2022

Grammar contains several invalid regex with lookbehind assertions that are not a fixed length #415

Closed

DaelonSuzuka deleted the fix_grammar_errors branch January 6, 2024 21:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix errors in grammar syntax #416

Fix errors in grammar syntax #416

DaelonSuzuka commented Sep 5, 2022 •

edited

Loading

DaelonSuzuka commented Sep 5, 2022 •

edited

Loading

DaelonSuzuka commented Sep 5, 2022

AlfishSoftware commented Sep 5, 2022

DaelonSuzuka commented Sep 5, 2022

DaelonSuzuka commented Sep 5, 2022

AlfishSoftware commented Sep 6, 2022 •

edited

Loading

DaelonSuzuka commented Sep 6, 2022

DaelonSuzuka commented Sep 9, 2022

AlfishSoftware Sep 16, 2022

DaelonSuzuka Sep 16, 2022

AlfishSoftware Sep 16, 2022 •

edited

Loading

DaelonSuzuka Oct 25, 2022

DaelonSuzuka commented Sep 16, 2022

DaelonSuzuka commented Oct 25, 2022

Calinou left a comment •

edited

Loading

AlfishSoftware commented Oct 26, 2022 •

edited

Loading

aaronfranke commented Dec 20, 2022

DaelonSuzuka commented Dec 21, 2022

Fix errors in grammar syntax #416

Fix errors in grammar syntax #416

Conversation

DaelonSuzuka commented Sep 5, 2022 • edited Loading

DaelonSuzuka commented Sep 5, 2022 • edited Loading

DaelonSuzuka commented Sep 5, 2022

AlfishSoftware commented Sep 5, 2022

DaelonSuzuka commented Sep 5, 2022

DaelonSuzuka commented Sep 5, 2022

AlfishSoftware commented Sep 6, 2022 • edited Loading

DaelonSuzuka commented Sep 6, 2022

DaelonSuzuka commented Sep 9, 2022

AlfishSoftware Sep 16, 2022

Choose a reason for hiding this comment

DaelonSuzuka Sep 16, 2022

Choose a reason for hiding this comment

AlfishSoftware Sep 16, 2022 • edited Loading

Choose a reason for hiding this comment

DaelonSuzuka Oct 25, 2022

Choose a reason for hiding this comment

DaelonSuzuka commented Sep 16, 2022

DaelonSuzuka commented Oct 25, 2022

Calinou left a comment • edited Loading

Choose a reason for hiding this comment

AlfishSoftware commented Oct 26, 2022 • edited Loading

aaronfranke commented Dec 20, 2022

DaelonSuzuka commented Dec 21, 2022

DaelonSuzuka commented Sep 5, 2022 •

edited

Loading

DaelonSuzuka commented Sep 5, 2022 •

edited

Loading

AlfishSoftware commented Sep 6, 2022 •

edited

Loading

AlfishSoftware Sep 16, 2022 •

edited

Loading

Calinou left a comment •

edited

Loading

AlfishSoftware commented Oct 26, 2022 •

edited

Loading