Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible unexpected behavior from capture sub-patterns (sub-tokenized capture) #74

Open
msftrncs opened this issue Sep 23, 2018 · 1 comment
Labels
feature-request Request for new features or functionality
Milestone

Comments

@msftrncs
Copy link
Contributor

I was analyzing the code and noticed that there is no way to anchor (at start) a pattern inside a capture pattern (sub-tokenizing of a capture).

I had not yet the need to do so in any work so far, but had thought that the capture was reprocessed as if it was its own string. So what I notice here is that:

  1. $ anchor works because the reprocessed string is a substr (0,x) of the original line, ending at the end of the capture.
  2. \G, \A, and ^ anchors have limited application.
    1. \G will never match at the start of the sub-tokenized capture. (even if the capture starts at the end of the last BEGIN rule, from what I can tell)
    2. \A will only match if the sub-tokenized capture is at the start of the first line of the file
    3. ^ only matches if the sub-tokenized capture is at the start of the current line
  3. Look behind at the start will work
  4. look ahead at the end will not work.
  5. I'm not even going to touch on $Z/$z, as they are outside typical use.

This makes no sense to me. If there was a clear document that stated this was how it was going to work (for all TextMate based products), I'd have to accept it, but since I cannot find one, and when I try to apply logic, I cannot understand why someone would want the anchors to continue their 'document wide' purpose when sub-tokenizing a capture. There is no reason I can imagine where inside a capture I would want to test to see if this point is the first character of the document. I would have done that before capturing the group for sub-tokenizing. It also makes no sense to me to be able to look behind, and not ahead (past end of capture), but I had expected no ability to see past either edges of the capture. I had expected the ^ would match the start of the capture. I had also figured that since \A's description is to match the start of a 'string' that it would also be valid when retokenizing the capture. I have used $ in capture retokenizing, but only so there was something in an END rule, to make it correct.

FYI at https://github.com/Microsoft/vscode-textmate/blob/6b4f4d7bfde2c1baaf8b25906657fd6d5075a6a3/src/grammar.ts#L540-L544, the && captureIndex.start === 0 is redundant. :)

I am not confirming that anything is broken at this time.

@msftrncs
Copy link
Contributor Author

Actually I have run in to a place where this is affecting me.

In VS Code's Batchfile syntax from https://github.com/mmims/language-batchfile, numbers must either start at the beginning of the line, be preceded with a space or an =, as in (?<=^|\s|=)

I've been experimenting with this particular syntax, and so I have improved the above match to include more possibilities.

However, in variable expansion, you can use a substring construct, and the original syntax was hard coded to accept only [+-]\d+ to work around this, but really that's simplifying it too much, as batch also accepts hex and octal notation here too.

I have been rewriting the variable portion to be based on sub-tokenizing a larger capture, and after capturing the region that represents either the start or end, include the numbers pattern, which fails, since in front of the number is either a ':' or ','. Its not really acceptable to allow numbers after colons anywhere else. though probably no harm either, but had ^ been available here, the numbers pattern would have worked as is.

The numbers pattern is so simple, and the substring construct (with sub level expansion possible when using delayed expansion) is so complex, it probably merits a special pattern anyway. For instance, negative hex numbers are allowed here, as the '-' (negate) is treated more as a keyword/option to switch counting from the other end of the string. However, with the variable expansion, I would still have trouble detecting if this number value (or base notation) is part of a prefix, or a suffix, or a middle, if the author was only supplying part of the value as a constant and the rest as expansions.

0%var% - specifying octal
%var%0 - specifying decades
0%var%0 - specifying octaves  :)

Yes, I can work through this... thinking actually it would just be easier to scope the entire region or at least valid digits as numeric and scope anything else as invalid, but if I want to handle the 0x properly, I would still need the ability to use ^ to anchor at the beginning of the substring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request Request for new features or functionality
Projects
None yet
Development

No branches or pull requests

2 participants