Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(cpp) Fix highlighting of unterminated raw strings #2261

Merged
merged 7 commits into from
Apr 27, 2020

Conversation

davidben
Copy link
Contributor

@davidben davidben commented Nov 8, 2019

@yyyc514, thoughts?

PR #1897 switched C++ raw strings to use backreferences, however this breaks souce files where raw strings are truncated. Like comments, it would be preferable to highlight them.

Instead, go back to using separate begin and end regexps, but introduce an endFilter feature to filter out false positive matches. This internally works similarly to endSameAsBegin.

See also issue #2259.

(Also, I have to say, this C++ raw string syntax is a little absurd...)

@joshgoebel joshgoebel self-requested a review November 9, 2019 01:05
@joshgoebel joshgoebel self-assigned this Nov 9, 2019
@joshgoebel
Copy link
Member

joshgoebel commented Nov 9, 2019

I'll play around with a bit. Nice to see this was so easy to accomplish. I'm going to explore going broader though:

       { // heredocs
-        begin: /<<[-~]?'?(\w+)(?:.|\n)*?\n\s*\1\b/,
-        returnBegin: true,
-        contains: [
-          { begin: /<<[-~]?'?/ },
-          { begin: /\w+/,
-            endSameAsBegin: true,
-            contains: [hljs.BACKSLASH_ESCAPE, SUBST],
-          }
-        ]
+        begin: /<<[-~]?'?(\w+)/,
+        end: /(\w+)\b/,
+        onBegin: function(m, state) { state.heredoc = m[1] },
+        onEnd: function(m, state) { return state.heredoc === m[1] },
+        contains: [hljs.BACKSLASH_ESCAPE, SUBST],
       }
     ]
   };

Though now I'm wondering if instead of more keys our keys should get more complex:

      { // heredocs
        begin: {
          re: /<<[-~]?'?(\w+)/,
          callback: function(m, state) { state.heredoc = m[1] }, 
        },
// ...
        end: hljs.withCallback(/(\w+)\b/, function(m, state) { return state.heredoc === m[1] });
        contains: [hljs.BACKSLASH_ESCAPE, SUBST],
      }

@joshgoebel joshgoebel changed the title cpp: Fix highlighting of unterminated raw strings (cpp) Fix highlighting of unterminated raw strings Nov 9, 2019
@joshgoebel joshgoebel added enhancement An enhancement or new feature parser labels Nov 9, 2019
@joshgoebel
Copy link
Member

joshgoebel commented Nov 12, 2019

      {
        begin: /(?:u8?|U|L)?R"([^()\\ ]{0,16})\(/,
        end: /\)([^()\\ ]{0,16})"/,
        onBegin: function(m, state) { state.heredoc = m[1] },
        endWhen: function(m, state) { return state.heredoc === m[1] },
      }

Very open to thoughts on naming. The PR isn't quite right, we'd need lots of thought and additional regarding how this works with endsParent, endsWithParent, etc... endWhen probably needs to happen inside endOfMode so that if the first endWhen failed it would continue up the parent chain looking for a end that might match... vs right now it would find the first end that possibly matches, and then endWhen could cancel that match.

And already you have some good questions about nesting and when both end rules are the same (other than the callback)... if the first callback returns "false" is that a blocker, or does the chain of possible ends continue to be inspected moving up the parents until it finds one that might hit?

These same type of questions would likely apply to the endFilter also... whether a false is a HARD no, or a soft no, etc...

I MUCH prefer a generic solution here though because I very much dislike the idea of adding specific solutions to the parser each time we come across some weird language nuance. For example that's what we did with endSameAsBegin. A VERY specific grammar feature for only TWO languages (pgsql and Ruby)... when that exact problem could have been solved with the more generic solutions being proposed here.

@egor-rogov Any high-level thoughts? Not looking for code critiques at this point but more of the idea conceptually... of adding callbacks to allow for this type of thing, etc...

The callback itself

The callback being proposed takes a normal [almost] regex match object (so it can look at sub matches, etc)... the second variable starts as an empty object and can be used by rules to preserve state between the begin and end rules, such as saving the magic "token" in this case.

@joshgoebel
Copy link
Member

joshgoebel commented Nov 12, 2019

that's what we did with endSameAsBegin. A VERY specific grammar feature for only TWO languages (pgsql and Ruby)... when that exact problem could have been solved with the more generic solutions being proposed here.

Also note this here is really almost EXACTLY the same problem as "endSameAsBegin" but because the solution was hard-coded it can't be applied to this use case... just another point in why I dislike one off grammar "hack" rules vs more generic solutions that can be used more widely.

@joshgoebel joshgoebel mentioned this pull request Nov 12, 2019
15 tasks
@egor-rogov
Copy link
Collaborator

I very much dislike the idea of adding specific solutions to the parser each time we come across some weird language nuance.

Absolutely. I'd gladly sacrifice endSameAsBagin in favour of something more general. I was thinking (back when it was invented) about a way to reference begin's matching groups from end regexp, but couldn't come to a good solution.

Any high-level thoughts? Not looking for code critiques at this point but more of the idea conceptually... of adding callbacks to allow for this type of thing, etc...

I like the idea! Callbacks look promising.
For now I can't see what the use cases for onBegin/onEnd may be. Single endFilter is not enough?

@joshgoebel
Copy link
Member

joshgoebel commented Dec 5, 2019

For now I can't see what the use cases for onBegin/onEnd may be. Single endFilter is not enough?

  • Not all matches result in an actual end match... the file could end before an actual ending is reached.
  • It should probably be possible to reject/ignore a match completely at begin... (hence needing a begin callback of some sort)
  • Perhaps the onBegin wants to simply change the end regex directly at the time the match starts... (not currently allowed, but I'm not opposed to the idea conceptually). But again, this would be trigger by begin (ie, how endSameAsBegin is currently implemented)
  • Or perhaps you're counting tags because you only want to highlight the first occurrence of something or highlight additional occurrences as invalid?

The idea is to build out a flexible plugin/event system. I feel like this comment you're making the same mistake as when endSameAsBegin was added... you're only imagining a single use case (filtering) instead of more generic system that people could build on top of.

@egor-rogov
Copy link
Collaborator

* Or perhaps you're counting tags because you only want to highlight the first occurrence of something or highlight additional occurrences as invalid?

Yes, this makes sense.

The idea is to build out a flexible plugin/event system. I feel like this comment you're making the same mistake as when endSameAsBegin was added... you're only imagining a single use case (filtering) instead of more generic system that people could build on top of.

I understand the intent, and I'm in no way against it. Just wanted to imagine a situation in which onBegin/onEnd can be useful.

@joshgoebel
Copy link
Member

joshgoebel commented Feb 27, 2020

My latest thoughts on this include a response object:

{
        begin: /(?:u8?|U|L)?R"([^()\\ ]{0,16})\(/,
        end: /\)([^()\\ ]{0,16})"/,
        onBegin: function(m, state, resp) { state.heredoc = m[1] },
        onEnd: function(m, state, resp) { if (state.heredoc === m[1]) resp.ignoreMatch()  },
      }

It's possible you could have resp.data and then you'd be back to two params... but then I come to a naming issue. Technically when you have a potential match (remember our callback is going to decide what to do with the match) you have two main possibilities:

  • "abort"... this is an indication of a failed match but you STILL advance the cursor, ie skipping the content in question
  • "ignore"... failed match, but the cursor is not advanced.

I think I could use help with the naming. Examples of both:

beginKeywords in the past used an "abort" style match... So if you have:

B.class = A.class

And beginKeywords: 'class'... the .class would have been matched, but the . would be detected and then the entire match ignored AND the cursor moved to right before the =... ie. any later modes you have that might have matched ".cl" or "las" would fail because the cursor had already moved past.

Meanwhile an "ignore" match would ignore class (because of the .) but your cursor would still be at the .... so .cl WOULD match... or las WOULD match.

I'm thinking perhaps we only should expose the ignore behavior, but still I could use some better names just for internal use and talking about these things.


Abort is very strange IMHO because typically rules that FAIL to match do NOT eat content... but the prior behavior of beginKeyword (and maybe future behavior) means this is something we have to think about or at least name conceptually.

There is also the word skip but our mode definitions already give it a very particular meaning, so not sure if it would be good to reuse it here.

@joshgoebel joshgoebel added this to the 10.1 milestone Feb 29, 2020
@joshgoebel
Copy link
Member

joshgoebel commented Mar 3, 2020

Example of usage:

// in Ruby
Object.assign({
  begin: /(\w+)/, end: /(\w+)/,
  contains: [hljs.BACKSLASH_ESCAPE, SUBST],
}, hljs.END_FIRST_MATCH_SAME_AS_BEGIN)

// or perhaps
hljs.END_SAME_AS_BEGIN({
  begin: /(\w+)/, end: /(\w+)/,
  contains: [hljs.BACKSLASH_ESCAPE, SUBST],
});

// in C++
Object.assign({
  begin: /(?:u8?|U|L)?R"([^()\\ ]{0,16})\(/,
  end: /\)([^()\\ ]{0,16})"/,
}, hljs.END_FIRST_MATCH_SAME_AS_BEGIN)

And the "logic" can then simply become a shared mode snippet:

export const END_FIRST_MATCH_SAME_AS_BEGIN = {
   'after:begin': (m, resp) => { resp.data.heredoc = m[1]; },
   'before:end': (m, resp) => { if (resp.data.heredoc !== m[1]) resp.ignoreMatch(); }
 };

Not quite as pretty as endSameAsBegin: true, but completely explicit, no "magic" mode flags (ala, endSameasBegin) required.

I'm open to better naming than Response. My HTTP/web background here is showing... but technically it is designed to be the response object of the callback...

src/lib/modes.js Outdated Show resolved Hide resolved
docs/reference.rst Outdated Show resolved Hide resolved
@egor-rogov
Copy link
Collaborator

I really like it. Very flexible and concise.
Response sound good, but too general. Maybe CallbackResponse or something like that?
I consider this PR mergeable, as soon as these new callbacks are described in the docs.
Btw, what do you think about deprecating endSameAsBegin?

@joshgoebel
Copy link
Member

Btw, what do you think about deprecating endSameAsBegin?

That would be the next step. Though I'm not sure what that would mean... that we could rip it out in say v11?

@joshgoebel
Copy link
Member

export const END_SAME_AS_BEGIN = function(mode) {
  return Object.assign(mode,
    {
    'after:begin': (m, resp) => { resp.data._beginMatch = m[1]; },
    'before:end': (m, resp) => { if (resp.data._beginMatch !== m[1]) resp.ignoreMatch() }
    });
};

Now I'm wondering if defaulting to the first match group is assuming too much? Hmmm...

@joshgoebel joshgoebel force-pushed the truncated-raw-string branch 3 times, most recently from d116781 to 0a410f9 Compare March 12, 2020 18:29
@joshgoebel
Copy link
Member

joshgoebel commented Mar 12, 2020

@egor-rogov Check out the docs. Now that I had to document them I'm thinking perhaps on:begin and on:end might be better names...

I'm not sure we ever need such granularity as before:begin and after:begin... and if not I'm not sure the prior naming makes good sense. Its kind of weird to remember one is before and one is after, etc... and begs the question why isn't is "before begin" (and maybe it should be, lol).

I guess I could see some crazy data transforms if we allowed that, but we could do the same thing by adding more flexibility to response... ie response.addContent() or some such...

Does the on naming seem clearer to you? Plus I think on still allows for adding after later IF it proved necessary...

@egor-rogov
Copy link
Collaborator

Btw, what do you think about deprecating endSameAsBegin?

That would be the next step. Though I'm not sure what that would mean... that we could rip it out in say v11?

Just so.

Does the on naming seem clearer to you?

Yes, I have the same feeling that on is better.

Maybe that END_SAME_AS_BEGIN snippet worth mentioning in plugin-recipes?

@joshgoebel
Copy link
Member

Maybe that END_SAME_AS_BEGIN snippet worth mentioning in plugin-recipes?

Are all the other MODE helpers documented anywhere? It'd probably be good if they were, but if you document one you have to document all...

@egor-rogov
Copy link
Collaborator

Maybe that END_SAME_AS_BEGIN snippet worth mentioning in plugin-recipes?

Are all the other MODE helpers documented anywhere? It'd probably be good if they were, but if you document one you have to document all...

They are not, but END_SAME_AS_BEGIN is the first one that uses plugins and it is a good example. Anyway, that's a separate story.

Copy link
Collaborator

@egor-rogov egor-rogov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was sure that I've already approved this, but it turns out that no...

@joshgoebel
Copy link
Member

They are not, but END_SAME_AS_BEGIN is the first one that uses plugins and it is a good example. Anyway, that's a separate story.

It uses callbacks, not plugins. :-) But yes separate story.

@egor-rogov
Copy link
Collaborator

It uses callbacks, not plugins. :-)

Oh, right... 🤦 There are no callback recipes yet (:

@joshgoebel joshgoebel force-pushed the truncated-raw-string branch 2 times, most recently from 85eaf3b to 3f370b0 Compare April 27, 2020 18:29
@joshgoebel joshgoebel force-pushed the truncated-raw-string branch 2 times, most recently from 3dc9f1b to b99f275 Compare April 27, 2020 19:16
davidben and others added 7 commits April 27, 2020 15:20
PR highlightjs#1897 switched C++ raw strings to use backreferences, however this
breaks souce files where raw strings are truncated. Like comments, it
would be preferable to highlight them.

- Add `on:begin` and `on:end` to allow more granular matching when
  then end match is dynamic and based on a part of the begin match
- This deprecates the `endSameAsBegin` attribute. That attribute was
  a very specific way to solve this problem, but now we have a much
  more general solution in these added callbacks.

Also related: highlightjs#2259.

Co-authored-by: Josh Goebel <[email protected]>
Adds a mode helper to replace the deprecated `endSameAsBegin` attribute.

The first match group from the begin regex will be compared to the first
match group from the end regex and the end regex will only match if both
strings are identical.

Note this is more advanced functionality than before since now you can
match a larger selection of text yet only use a small portion of it for
the actual "end must match begin" portion.
- even if that existing behavior is questionable
- the ending span should really close before the $$, not after

Fixing this would involve delving into the sublanguage behavior and I'm
not sure we have time to tackle that right this moment.
- I can never find this file because it's name didn't fully match.
- rename callbacks to `on:begin` and `on:end`
@joshgoebel
Copy link
Member

@davidben Sorry for the delay on this one - and it's not exactly what you originally submitted - but I still gave you the byline for the bug fix. :-) Merging now.

@joshgoebel joshgoebel merged commit d4b047b into highlightjs:master Apr 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An enhancement or new feature parser
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants