possible unexpected behavior from capture sub-patterns (sub-tokenized capture) #74

msftrncs · 2018-09-23T10:23:03Z

I was analyzing the code and noticed that there is no way to anchor (at start) a pattern inside a capture pattern (sub-tokenizing of a capture).

I had not yet the need to do so in any work so far, but had thought that the capture was reprocessed as if it was its own string. So what I notice here is that:

$ anchor works because the reprocessed string is a substr (0,x) of the original line, ending at the end of the capture.
\G, \A, and ^ anchors have limited application.
1. \G will never match at the start of the sub-tokenized capture. (even if the capture starts at the end of the last BEGIN rule, from what I can tell)
2. \A will only match if the sub-tokenized capture is at the start of the first line of the file
3. ^ only matches if the sub-tokenized capture is at the start of the current line
Look behind at the start will work
look ahead at the end will not work.
I'm not even going to touch on $Z/$z, as they are outside typical use.

This makes no sense to me. If there was a clear document that stated this was how it was going to work (for all TextMate based products), I'd have to accept it, but since I cannot find one, and when I try to apply logic, I cannot understand why someone would want the anchors to continue their 'document wide' purpose when sub-tokenizing a capture. There is no reason I can imagine where inside a capture I would want to test to see if this point is the first character of the document. I would have done that before capturing the group for sub-tokenizing. It also makes no sense to me to be able to look behind, and not ahead (past end of capture), but I had expected no ability to see past either edges of the capture. I had expected the ^ would match the start of the capture. I had also figured that since \A's description is to match the start of a 'string' that it would also be valid when retokenizing the capture. I have used $ in capture retokenizing, but only so there was something in an END rule, to make it correct.

FYI at https://github.com/Microsoft/vscode-textmate/blob/6b4f4d7bfde2c1baaf8b25906657fd6d5075a6a3/src/grammar.ts#L540-L544, the && captureIndex.start === 0 is redundant. :)

I am not confirming that anything is broken at this time.

The text was updated successfully, but these errors were encountered:

msftrncs · 2018-09-24T06:31:46Z

Actually I have run in to a place where this is affecting me.

In VS Code's Batchfile syntax from https://github.com/mmims/language-batchfile, numbers must either start at the beginning of the line, be preceded with a space or an =, as in (?<=^|\s|=)

I've been experimenting with this particular syntax, and so I have improved the above match to include more possibilities.

However, in variable expansion, you can use a substring construct, and the original syntax was hard coded to accept only [+-]\d+ to work around this, but really that's simplifying it too much, as batch also accepts hex and octal notation here too.

I have been rewriting the variable portion to be based on sub-tokenizing a larger capture, and after capturing the region that represents either the start or end, include the numbers pattern, which fails, since in front of the number is either a ':' or ','. Its not really acceptable to allow numbers after colons anywhere else. though probably no harm either, but had ^ been available here, the numbers pattern would have worked as is.

The numbers pattern is so simple, and the substring construct (with sub level expansion possible when using delayed expansion) is so complex, it probably merits a special pattern anyway. For instance, negative hex numbers are allowed here, as the '-' (negate) is treated more as a keyword/option to switch counting from the other end of the string. However, with the variable expansion, I would still have trouble detecting if this number value (or base notation) is part of a prefix, or a suffix, or a middle, if the author was only supplying part of the value as a constant and the rest as expansions.

0%var% - specifying octal
%var%0 - specifying decades
0%var%0 - specifying octaves  :)

Yes, I can work through this... thinking actually it would just be easier to scope the entire region or at least valid digits as numeric and scope anything else as invalid, but if I want to handle the 0x properly, I would still need the ability to use ^ to anchor at the beginning of the substring.

alexdima added the feature-request Request for new features or functionality label Jul 12, 2019

alexdima added this to the Backlog milestone Jul 12, 2019

matter123 mentioned this issue Aug 16, 2019

Make numeric outer pattern not a range. jeff-hykin/better-cpp-syntax#356

Closed

KapitanOczywisty mentioned this issue Apr 30, 2020

Add scope for PHP primitive types atom/language-php#389

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

possible unexpected behavior from capture sub-patterns (sub-tokenized capture) #74

possible unexpected behavior from capture sub-patterns (sub-tokenized capture) #74

msftrncs commented Sep 23, 2018

msftrncs commented Sep 24, 2018

possible unexpected behavior from capture sub-patterns (sub-tokenized capture) #74

possible unexpected behavior from capture sub-patterns (sub-tokenized capture) #74

Comments

msftrncs commented Sep 23, 2018

msftrncs commented Sep 24, 2018