-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental line-by-line highlighting breaks some languages #2259
Comments
Currently you can do this by nesting, but yes it's annoying. You might also see my work on Elixir: https://github.com/highlightjs/highlight.js/pull/2257/files It doesn't use endSameAsBegin, but same concept with first matching on the prefix and then the child matchers can focus on just the begin/end syntax, not the prefix.
I'm not sure I completely follow, I will follow your link and see if it's helpful. I've found the "outside" edge of the continuation stuff to be more robust than I expected... ie it remembers the state the parser is in re-opens those spans and continues the processing, even after handling a sublanguage in the middle... but again I'm speaking for what continuation were INTENDED for... they weren't intended for line-by-line processing. |
This looks like the behavior I've come to expect with continuations. |
Perhaps it'd be useful to show a "failing case" and what happens and what you would hope to happen, etc... |
I'd suggest you take a look at what's going on here: I think doing anything line-by-line would be done by walking the tree and dealing with \n inside text nodes however you saw fit... Doing something simple like wrapping every line in it's own span would probably be pretty trivial one that foundation. There has also has been mention of a line-by-line callback especially, but honestly I'd probably rather see this first done by a plugin... no reason plugins couldn't extend the callback system or even have their own. |
FWIW, the tracking issue in Gerrit Code Review project is here: [1]. The current repro is: [2]. [1] https://bugs.chromium.org/p/gerrit/issues/detail?id=7748 |
Commented on that thread just to clear up the fact that we do NOT have a line by line mode. In fact the original author was actually pretty against any type of "line number" feature and such things. And it's still mentioned in our docs today, etc... not that I think we're still THAT hostile to it today, but definitely sounds like a good use for a plugin. We just need better hooks to make these type of things easier to do. |
Ah, interesting. Hopefully one of the Gerrit devs will respond either here or on the Gerrit bug you commented on. (I don't actually work on Gerrit. I work on Chromium and BoringSSL, which use Gerrit, and highlight.js was easy enough to hack on, so I'd been poking at the highlighting bugs I was running into. :-) ) You mentioned in the Gerrit bug that the continuation feature is primarily for internal use. Should it be removed from the documentation? The documentation reads to me like feeding in chunks of partial data is supposed to work.
Hrm. I toyed with that a bit but it seemed there'd be issues with C++'s raw strings ending in Of course, if line-by-line isn't supposed to work anyway, that's no longer a reason to change the current C++ definition. That said, hljs currently won't match
It seems the issue may have since been clarified, but in case there's still confusion, yeah that behavior makes sense. The "issue" is the |
I'd actually be for that if @egor-rogov had no objections. It really has all sorts of caveats and really it's more of a limited thing to work with sublanguages. I was not aware we were encouraging its broader use. Either way I'll still say: Not really intended for line to line.
Continuation (in the core library) is intended to support subLanguages (pretty sure that's the only reason it's there, but I'm open to being corrected)... and personally I don't think we should REQUIRE that any languages parse in a certain way if they naturally find it beneficial to parse in larger chunks. IE, we shouldn't do things that limits peoples choices in writing grammars without VERY good reasons to do so. |
You mean when parsing tiny bits at a time only, yes? |
Nah, maybe the file is truncated or the author forgot to close the string. Usually syntax highlighters will highlight the unfinished thing. |
Ah, now that might be worth thinking about. That would be one case where doing the full match would not turn out so well... |
I'd rather not remove it ('cause it is a part of the API), but explicitly explain that this is intended for internal use and may not behave as one might expect. |
What are your thoughts on something like:
Returning a match object for the |
I'm receptive to the idea, though it'd probably be a more generic Although it's actually not that simple because the begin/end matchers are all compiled together and are not modified at run-time. Otherwise we might have to recursively recompile the regexs every time the regexes were dynamically changed... not sure what type of performance implications that would have.
Huh? Not sure I follow? Accessing the invividual capture groups isn't that hard... but I'm not sure exactly what you're suggesting with this comment. |
Ah, interesting. Okay, I think I can do something with that. I'll try my hand at this and we'll see if you like the result.
I mean that you could imagine the |
Yeah, that'd make sense... you'd have to pass it a reference to
Yes, but we keep track of the match offsets (or else we'd have know way to know which rule matched)... so if we had to "rewrite" them back into a sequential array starting at 1 that wouldn't be that hard to do. |
Not to mention the security implications of something like this. :-) Right now grammars are technically code, but are only evaluated once, and what you see is what you get - reasonably easy to audit. They can't really be altered or exploited by specific input (other than bad regexes perhaps going into infinite loops)... so something like this would add a whole other potential vector for attacks. |
PR highlightjs#1897 switched C++ raw strings to use backreferences, however this breaks souce files where raw strings are truncated. Like comments, it would be preferable to highlight them. Instead, go back to using separate begin and end regexps, but introduce an endFilter feature to filter out false positive matches. This internally works similarly to endSameAsBegin. See also issue highlightjs#2259.
PR highlightjs#1897 switched C++ raw strings to use backreferences, however this breaks souce files where raw strings are truncated. Like comments, it would be preferable to highlight them. Instead, go back to using separate begin and end regexps, but introduce an endFilter feature to filter out false positive matches. This internally works similarly to endSameAsBegin. See also issue highlightjs#2259.
We can continue the discussion on the PR as I'm interested in more dynamic grammars in general - and if we figure that out we can then decide where it might and might not make sense to use such dynamism in existing grammars. But I just want to be clear this grammar ( If this type of CPP string really doesn't contain any escape sequences then the simplest way of handling it is probably to just match it with a single regex and take advantage of the group matching capabilities, as was done in #1897. Adding docs clarifying that continuation is NOT intended for be using in the fashion you're using it: I simply don't think it's reasonable to force grammas to always care about WHERE line breaks are at... when it's easier for a grammar to match against line breaks, we're not opposed to them doing so. |
Closing this issue as resolved - existing behavior is expected behavior. More than happy to continue the discussion on a BETTER and more reliable way to do line by line though, just not with continuations. |
@davidben Have you followed the discussion here: Given a raw parse tree (of nodes), wouldn't writing a line-by-line outputter be a few lines of code? You merely have to track the current node "stack" (which you have to do for a normal render anyways) and each time you see a new line in text you output the closing tags... then the newline, then re-open the tags, then output the text portion that comes AFTER the new line. Am I missing something? This would be the "correct" way to do line-by-line stuff off the top of my head. |
Also, if you're target is the browser you can apply the line stuff BEFORE the rendering also... when using highlightBlock, HTML is preserved and merged seamlessly... so if you wanted to do something like:
Etc... and it should "just work". |
I think a lot of people make this type of thing WAY TOO complex though... I'd consider just having two entirely separately |
PR highlightjs#1897 switched C++ raw strings to use backreferences, however this breaks souce files where raw strings are truncated. Like comments, it would be preferable to highlight them. Instead, go back to using separate begin and end regexps, but introduce an endFilter feature to filter out false positive matches. This internally works similarly to endSameAsBegin. See also issue highlightjs#2259.
PR highlightjs#1897 switched C++ raw strings to use backreferences, however this breaks souce files where raw strings are truncated. Like comments, it would be preferable to highlight them. Instead, go back to using separate begin and end regexps, but introduce an endFilter feature to filter out false positive matches. This internally works similarly to endSameAsBegin. See also issue highlightjs#2259.
PR highlightjs#1897 switched C++ raw strings to use backreferences, however this breaks souce files where raw strings are truncated. Like comments, it would be preferable to highlight them. Instead, go back to using separate begin and end regexps, but introduce an endFilter feature to filter out false positive matches. This internally works similarly to endSameAsBegin. See also issue highlightjs#2259.
PR highlightjs#1897 switched C++ raw strings to use backreferences, however this breaks souce files where raw strings are truncated. Like comments, it would be preferable to highlight them. - Add `on:begin` and `on:end` to allow more granular matching when then end match is dynamic and based on a part of the begin match - This deprecates the `endSameAsBegin` attribute. That attribute was a very specific way to solve this problem, but now we have a much more general solution in these added callbacks. Also related: highlightjs#2259.
PR highlightjs#1897 switched C++ raw strings to use backreferences, however this breaks souce files where raw strings are truncated. Like comments, it would be preferable to highlight them. - Add `on:begin` and `on:end` to allow more granular matching when then end match is dynamic and based on a part of the begin match - This deprecates the `endSameAsBegin` attribute. That attribute was a very specific way to solve this problem, but now we have a much more general solution in these added callbacks. Also related: highlightjs#2259. Co-authored-by: Josh Goebel <[email protected]>
Now that parse tree support is in 10.0 someone really should try doing it this way and see how much simpler it is. :-) |
PR #1897 switched C++ raw strings to use backreferences, however this breaks souce files where raw strings are truncated. Like comments, it would be preferable to highlight them. - Add `on:begin` and `on:end` to allow more granular matching when then end match is dynamic and based on a part of the begin match - This deprecates the `endSameAsBegin` attribute. That attribute was a very specific way to solve this problem, but now we have a much more general solution in these added callbacks. Also related: #2259. Co-authored-by: Josh Goebel <[email protected]>
* (docs) Add Chaos to supported languages (highlightjs#2510) * fix(parser) fixes sublanguage with no rule matches (highlightjs#2506) * fix(parser) fixes sublanguage with no rule matches Resolves highlightjs#2504. * (chore) Add ESLint config and clean up the major stuff (highlightjs#2503) * (chore) eslint:recommended * (chore): eslint_standard * relax eslint rules for language grammars (to discourage rewriting them in one fell swoop; I'd rather have the blame history intact) * remove extra escaping * clean up variables * more camelcase * (docs) Add Visual Basic for Applications (VBA) to supported languages (highlightjs#2512) * (yaml) improve tag support; add verbatim tags (highlightjs#2487) * YAML parse non-word characters as part of tags * adds support for verbatim tags Co-authored-by: Josh Goebel <[email protected]> * fix(javascript/typescript): lambda with parens in parameters fails (highlightjs#2502) * fix(javascript/typescript): lambda with parens in parameters fails - Fixes both JavaScript and TypeScript grammars Fixes samples like: const bad = ((a, b) => [...a, b]); sides.every((length,width=(3+2+(4/5))) => length > 0 ); This is done by counting parens in the regex that finds arrows functions. Currently we can only handle 2 levels of nesting as shown in the second example above. * allow much richer highlighting inside params * improve highlighting inside arguments on typescript * enh(cpp): Improve highlighting of unterminated raw strings PR highlightjs#1897 switched C++ raw strings to use backreferences, however this breaks souce files where raw strings are truncated. Like comments, it would be preferable to highlight them. - Add `on:begin` and `on:end` to allow more granular matching when then end match is dynamic and based on a part of the begin match - This deprecates the `endSameAsBegin` attribute. That attribute was a very specific way to solve this problem, but now we have a much more general solution in these added callbacks. Also related: highlightjs#2259. Co-authored-by: Josh Goebel <[email protected]> * (chore) C-like uses the new END_SAME_AS_BEGIN mode * (chore) Ruby uses END_SAME_AS_BEGIN mode/rule * (parser) make END_SAME_AS_BEGIN a function helper Adds a mode helper to replace the deprecated `endSameAsBegin` attribute. The first match group from the begin regex will be compared to the first match group from the end regex and the end regex will only match if both strings are identical. Note this is more advanced functionality than before since now you can match a larger selection of text yet only use a small portion of it for the actual "end must match begin" portion. * (pgsql) add test for $$ quoting existing behavior - even if that existing behavior is questionable - the ending span should really close before the $$, not after Fixing this would involve delving into the sublanguage behavior and I'm not sure we have time to tackle that right this moment. * (chore) pgsql uses END_SAME_AS_BEGIN mode/rule now also * (docs) rename to `mode_reference`; docs for callbacks - I can never find this file because it's name didn't fully match. - rename callbacks to `on:begin` and `on:end` * prevented setter keyword conflicting with setTimeout|setInterval and highlighted them (highlightjs#2514) (highlightjs#2515) * fix(javascript) prevent setter keyword 'set' conflicting with setTimeout|setInterval (highlightjs#2514) * enh(javascript) setTimeout|setInterval now highlighted (highlightjs#2514) * enh (javascript) clearInterval and clearTimeout now highlighted * add keywords to TypeScript also * (docs) add TLDR instructions for building and testing * (dev) improve developer tool UI * (parser) Build common EMCAscript foundation Builds a common keyword foundation for any grammar that is building on top of JavaScript: - LiveScript - CoffeeScript - TypeScript Also uses this common foundation for JS itself. * (parser) Adds SHEBANG mode * (yaml) Add support for inline sequences and mappings (highlightjs#2513) * Use containers to match inline sequences and mappings * Add string type for inside inline elements * Handle nested inline sequences and mappings * Disallow all braces brackets and commas from strings inside inline mappings or sequences * clean up implementation * feed the linter Co-authored-by: Josh Goebel <[email protected]> * [enh] Add `OPTIMIZE:` and `HACK:` to comment doctags * (build) browser build is CommonJS and IIFE, no more AMD (highlightjs#2511) * (build) browser build is CommonJS and IIFE (global) now * (build) dropping support for AMD, which we never truly supported properly in the first place * (build) add test to make sure browser build works as commonJS module Resolves highlightjs#2505 * fix(parser) Fix freezing issue with illegal 0 width matches (highlightjs#2524) * fix[parser] add edge case handle for illegal 0 width matches * add last ditch catch all that tries to detect other uncaught freezes * (docs) added unicorn-rails-log as an 3rd-party language (highlightjs#2528) - (docs) Add syntax highlighting for Rails Unicorn logging to supported languages. * (docs) fix supported languages link: it moved again! (highlightjs#2533) * fix(ts/js) use identifier to match potential keywords (highlightjs#2519) - (parser) Adds `keywords.$pattern` key to grammar definitions - `lexemes` is now deprecated in favor of `keywords.$pattern` key - enh(typescript) use identifier to match potential keywords, preventing false positives - enh(javascript) use identifier to match potential keywords, preventing false positives * fix(javascript) fix regex inside parens after a non-regex (highlightjs#2531) * make the object attr container smarter * deal with multi-line comments also * comments in any order, spanning multiple lines Essentially makes the object attr container much more sensitive by allowing it to look-ahead thru comments to find object keys - and therefore prevent them from being incorrectly matched by the "value container" rule. * (parser) Add hljs.registerAlias() public API (highlightjs#2540) * Add hljs.registerAlias(alias, languageName) public API * Add .registerAlias() test * enh(cpp) add `pair`, `make_pair`, `priority_queue` as built-ins (highlightjs#2538) * (fix) `fixMarkup` would rarely destroy markup when `useBR` was enabled (highlightjs#2532) * enh(cpp) Recognize `priority_queue`, `pair` as containers (highlightjs#2541) * chore: rename `registerAlias` to `registerAliases` Plural form is clearly better as it's not surprising to the reader to see it being passed an array - where as the singular form might have been. Meanwhile it's also easy to assume that it also supports arrays of any size - including an array with a singular alias. The fact that it can magically accept a string as the first argument might not be obvious, but we document it and even if one didn't know this they could still use the array form of the API without any issue by passing a one item array. * (swift) @objcMembers not completely highlighted (highlightjs#2543) * Fixed @objcMembers in Swift Would match `@objc` first, and the `Members` part would be unhighlighted * Update CHANGES.md * Update swift.js * (docs) add OCL to list of supported languages (highlightjs#2547) * (docs) Add Svelte to list of supported languages (highlightjs#2549) * enh(dart) Add `late` and `required` keywords, and `Never` built-in type (Dart 2.9) (highlightjs#2551) * Add new Dart 2.9 keywords for Null Safety language feature * enh(erlang) add support for underscore separators in numeric literals (highlightjs#2554) * (erlang) add support for underscore separators in numeric literals * (erlang) add tests * (docs) add Jolie to Supported Languages (highlightjs#2556) * (parser/docs) Add jsdoc annotations and TypeScript type file (highlightjs#2517) Adds JSDoc annotations and a .tsconfig that allows TypeScript to be run in it's "allowJS" mode and apply type and sanity checking to JavaScript code also. See Type Checking JavaScript Files. I've been using TypeScript a lot lately and finding it very beneficial and wanted to get those same benefits here but without converting the whole project to TypeScript. It was rough at the beginning but now that this is finished I think it's about 80%-90% of the benefits without any of the TS compilation pipeline. The big difference in being JSDoc for adding typing information vs inline types with TypeScript. Should be super helpful for maintainers using an editor with tight TypeScript integration and the improved docs/comments should help everyone else. - Adds types/index.d.ts to NPM build (should be useful for TypeScript peeps) - Improves documentation of many functions - Adds JSDoc annotations to almost all functions - Adds JSDoc type annotations to variables that can't be inferred - Refactors a few smaller things to allow the TypeScript compiler to better infer what is happening (and usually also made the code clearer) * (parser) highlightBlock result key `re` => `relevance` (highlightjs#2553) * enh(handlebars) Support for sub-expressions, path-expressions, hashes, block-parameters and literals (highlightjs#2344) - `htmlbars` grammar is now deprecated. Use `handlebars` instead. A stub is included so that anyone literally referencing the old `htmlbars` file (say manually requiring it in Node.js, etc) is still covered, but everyone should transition to `handlebars` now. * fix(typescript) Add missing `readonly` keyword (highlightjs#2562) * (docs) Mention `c` is a possible class for C (highlightjs#2577) * fix(groovy) strings are not allowed inside ternary clauses (highlightjs#2565) * fix(groovy) strings are not allowed inside ternary clauses * whitespace can also include tabs * Update @typescript-eslint/parser to the latest version 🚀 (highlightjs#2575) * chore(package): update @typescript-eslint/parser to version 3.0.0 * chore(package): update lockfile package-lock.json Co-authored-by: greenkeeper[bot] <23040076+greenkeeper[bot]@users.noreply.github.com> Co-authored-by: Josh Goebel <[email protected]> * Update @typescript-eslint/eslint-plugin to the latest version 🚀 (highlightjs#2576) * chore(package): update @typescript-eslint/eslint-plugin to version 3.0.0 * chore(package): update lockfile package-lock.json Co-authored-by: greenkeeper[bot] <23040076+greenkeeper[bot]@users.noreply.github.com> Co-authored-by: Josh Goebel <[email protected]> * (parser) properly escape ' and " in HTML output (highlightjs#2564) * escape quotes also in final HTML output * [style] update test coding style * update markup tests with new escaping This shouldn't be a security issue -- we've always escaped double quotes inside of HTML attribute values (where they could be used to break out of context) - and we've always used double quotes for enclosing attribute values. This just goes all the way and now properly escapes quotes everywhere. Better safe than sorry. * (docs) add changelog entry for last PR * add nnfx theme (highlightjs#2571) * (themes) Add new lioshi theme (highlightjs#2581) * Added Cisco Command Line to SUPPORTED_LANGUAGES.md (highlightjs#2583) * (themes) add `nnfx-dark` theme (highlightjs#2584) * enh(protobuf) Support multiline comments (highlightjs#2597) * enh(java) added support for hexadecimal floating point literals (highlightjs#2509) - Added support for many additional types of floating point literals - Added related tests There still may be a few gaps, but this is a pretty large improvement. Co-authored-by: Josh Goebel <[email protected]> * (chore) Update issue templates (highlightjs#2574) Co-authored-by: Vladimir Jimenez <[email protected]> * enh(toml)(ini) Improve parsing of complex keys (highlightjs#2595) Fixes: highlightjs#2594 * (chore) add `.js` extension to import statements (highlightjs#2601) Adds file extensions to all import specifiers in ./src/ files. This is useful to run the files straight from source with a web browser , Node.js ESM or Deno. - Also add eslint rules regarding extensions for imports * enh(dart) highlight built-in nullable types (highlightjs#2598) * Dart: allow built-in nullable types with trailing ? to be highlighted * enh(csharp) highlight generics in more cases (highlightjs#2599) * (chore) fix tiny style issues, add linting npm task - fixes tiny style issues - adds `npm run lint` for linting the main library source (not languages which are still much messier) * (chore) bump dev dependencies * (chore) upgrade some dev stuff to newer versions * bump v10.1.0 * (chore) bump copyright * (chore) more import below metadata comment Co-authored-by: M. Mert Yıldıran <[email protected]> Co-authored-by: Josh Goebel <[email protected]> Co-authored-by: Hugo Leblanc <[email protected]> Co-authored-by: Peter Massey-Plantinga <[email protected]> Co-authored-by: David Benjamin <[email protected]> Co-authored-by: Vania Kucher <[email protected]> Co-authored-by: SweetPPro <[email protected]> Co-authored-by: Alexandre ZANNI <[email protected]> Co-authored-by: Taufik Nurrohman <[email protected]> Co-authored-by: Lin <[email protected]> Co-authored-by: nicked <[email protected]> Co-authored-by: Nicolas Homble <[email protected]> Co-authored-by: Ryandi Tjia <[email protected]> Co-authored-by: Sam Rawlins <[email protected]> Co-authored-by: Sergey Prokhorov <[email protected]> Co-authored-by: Brian Alberg <[email protected]> Co-authored-by: Nils Knappmeier <[email protected]> Co-authored-by: Martin <[email protected]> Co-authored-by: Derek Lewis <[email protected]> Co-authored-by: greenkeeper[bot] <23040076+greenkeeper[bot]@users.noreply.github.com> Co-authored-by: Jim Mason <[email protected]> Co-authored-by: lioshi <[email protected]> Co-authored-by: BMatheas <[email protected]> Co-authored-by: Pavel Evstigneev <[email protected]> Co-authored-by: Vladimir Jimenez <[email protected]> Co-authored-by: Antoine du Hamel <[email protected]> Co-authored-by: TupikovVladimir <[email protected]>
highlight.js supports highlighting text incrementally via the
continuation
parameter to thehighlight
function. Gerrit uses this to highlight line-by-line. However, this causes some features, notably C++ raw string matching, to break. See the discussion here.(Ironically, raw string matching partially worked in #1825 and then I broke it again for us when I tried to make it more featureful in #1897.)
I changed the tests to feed all the markup samples line-by-line to demonstrate the issue. Things mostly work, but when a begin or end element crosses a line boundary, things break.
master...davidben:line-by-line-tests
This suggests that the strategy in #1897 was wrong and probably C++ raw strings should have separate
begin
andend
units. ButendSameAsBegin
doesn't seem quite flexible enough, unless there's some trick I'm missing. One thought isbegin
matchesR"<blah>(
and then we have aendFromBegin
attribute which takes a callback and will compute)<blah>"
for theend
pattern given abegin
.But it's not entirely clear what the general API contract here is. Supposing incremental highlight is a thing highlight.js wants to support (I assume yes, given the
continuation
parameter), it seems there's basically no hope in allowing abegin
orend
unit to span an input boundary. The API needs to return something for the text that was passed in, so you can't lookahead across chunks. That means highlight.js needs to have opinions about what input boundaries need to be supported by languages (One call per character obviously cannot do anything useful. Line-by-line seems reasonable?), or we change the incremental highlight API altogether.Thoughts?
The text was updated successfully, but these errors were encountered: