Fix: jumpToHeading shouldn't match within codeblocks (also improve jumpToLink a bit) #233

alythobani · 2024-08-04T15:53:12Z

Finally got around to this :) thanks to @langshangyuan for filing #231! (And to @lebigot for noticing as well and commenting on #222)

Summary

This PR does two things:

Fixes jumpToHeading to not match within code blocks (e.g. in the case of Python comments within a codeblock)
Improves jumpToLink to also be able to jump to standalone hyperlinks

Implementation details

jumpToHeading

Initial approach - single regex with lookarounds

Initially I explored adjusting HEADING_REGEX to use lookarounds to ensure that a given match isn't inside a codeblock, as @langshangyuan suggested (it was a good idea). I did get something working, although it was a little complex:

const HEADING_REGEX = /(?<=(?<!.)(?:(?:(?:(?!^```).)*?^```){2})*(?:(?!^```).)*)^#{1,6} /gms;

The above regex matches on heading patterns (^#{1,6} ) only if the lookbehind assertion passes. The lookbehind asserts that from the top of the string (note that in multiline mode ^ means "start of line", so we use (?<!.) to assert "start of string") up until the potential heading, there are an even number of codeblock markers.

I realized two issues though with this approach:

It fails in the case that the code block starts with more than 3 backticks. In Markdown, fenced codeblocks are allowed to start/end with more than 3 backticks. E.g. this GitHub article notes you can use quadruple backticks to make a code block, and triple backticks within the codeblock will just be displayed rather than ending the codeblock. This Python-Markdown article also notes that a fenced codeblock can start with any number >=3 of backticks, and must end with that many as well. Experimenting in Obsidian, you can see that a codeblock ends as long as its end marker has at least as many backticks as its start marker (otherwise the codeblock extends to the end of the document).
Performance: at each heading match, the lookbehind needs to traverse all the way from the top of the string until the match, to check that there are an even number of preceding code block markers.

Actual implementation - two regexes (one for codeblocks, one for headings)

This approach helps with both issues above:

More than three backticks: While adjusting the lookbehind to account for >=3 backticks might have been possible, it would have made it even more complex/unreadable, which isn't great for codebase maintenance. With the two-regex approach, the codeblock regex on its own isn't too complex, while still accurate (ensuring each codeblock ends with at least as many backticks as it started with, or extends to the end of the document otherwise).

Performance: with this two-regex approach, we need two linear iterations (one for each regex) to get all codeblock matches and all "naive" heading matches. Then for each "naive" heading match, we check against all codeblock matches to make sure it's not in a codeblock. Technically this filter could still be non-linear ($O(N^2)$) time: say there are C codeblocks and H headers and the document is N lines long, then this filter takes $O(C*H)$ time. In theory, if the document has a LOT of codeblocks and headers, we could have e.g. $C = H = N/3$ meaning $O(N^2/9)$ time. But in practice, most documents will have a more reasonable (closer to constant-level) number of codeblocks in them, so the $O(C*H)$ filter should be much more efficient than a lookbehind that traverses from the start of the string until each naive heading match, for every naive heading match.

jumpToLink

I'd been meaning to get around to this, so this was a good opportunity; I've now added a URL_REGEX_STRING subpattern to LINK_REGEX so that jumpToLink can also jump to standalone URLs. I used a simple \w+://\S+ regex pattern, so this will match URLs like chrome://extensions or file://<some_path> or https://google.com.

Screen recordings

jumpToHeading

Screen.Recording.2024-08-04.at.8.48.12.AM.mov

jumpToLink

Screen.Recording.2024-08-04.at.8.51.44.AM.mov

Thanks again for noticing this everyone! Let me know if anyone has any questions/comments/concerns/suggestions.

jumpToPattern now accepts an optional filterMatch param to make this happen

alythobani · 2024-08-04T16:03:51Z

motions/jumpToLink.ts

-const WIKILINK_REGEX_STRING = "\\[\\[[^\\]\\]]+?\\]\\]";
-const MARKDOWN_LINK_REGEX_STRING = "\\[[^\\]]+?\\]\\([^)]+?\\)";
-const LINK_REGEX_STRING = `${WIKILINK_REGEX_STRING}|${MARKDOWN_LINK_REGEX_STRING}`;
+const WIKILINK_REGEX_STRING = "\\[\\[.*?\\]\\]";


Also went ahead and improved the regexes here a little to be more concise and more accurate (e.g. I don't think the original [^\\]\\]] was actually doing what I wanted it to do in WIKILINK_REGEX_STRING; now we just use a lazy .*? to catch all content up to the closing ]])

alythobani · 2024-08-04T16:07:27Z

motions/jumpToHeading.ts

+ * start of a line. It either ends on the nearest future line that starts with at least as many
+ * backticks (\1 back-reference), or extends to the end of the string if no such future line exists.
+ */
+const FENCED_CODEBLOCK_REGEX = /(^```+)(.*?^\1|.*)/gms;


Small note that the s flag is needed in order for . to match newlines as well

alythobani · 2024-08-04T16:08:14Z

motions/jumpToHeading.ts

+/** Naive Regex for a Markdown heading (H1 through H6). "Naive" because it does not account for
+ * whether the match is within a codeblock (e.g. it could be a Python comment, not a heading).
+ */
+const NAIVE_HEADING_REGEX = /^#{1,6} /gm;


Also went ahead and adjusted the heading regex to only match H1-H6, since H7/above don't exist in Markdown

alythobani · 2024-08-04T16:10:42Z

motions/jumpToLink.ts

-const LINK_REGEX_STRING = `${WIKILINK_REGEX_STRING}|${MARKDOWN_LINK_REGEX_STRING}`;
+const WIKILINK_REGEX_STRING = "\\[\\[.*?\\]\\]";
+const MARKDOWN_LINK_REGEX_STRING = "\\[.*?\\]\\(.*?\\)";
+const URL_REGEX_STRING = "\\w+://\\S+";


Kept this at a simple URL regex - technically URL validation is way more complex than this, but I think that could affect both code readability and potentially regex performace. This won't match urls without protocols (e.g. won't match google.com), but will match e.g. https://a. Which I think is good enough for a vim motion like this, but let me know if anyone has other thoughts.

alythobani added 2 commits August 4, 2024 07:52

fix: jumpToHeading should not jump to "headings" within codeblocks

557cc60

jumpToPattern now accepts an optional filterMatch param to make this happen

jumpToLink can now jump to standalone hyperlinks

20eb7a0

alythobani commented Aug 4, 2024

View reviewed changes

Adjust/clarify docstring for jumpToNextLink

43d708b

esm7 approved these changes Aug 4, 2024

View reviewed changes

esm7 merged commit 2795bb6 into esm7:master Aug 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: jumpToHeading shouldn't match within codeblocks (also improve jumpToLink a bit) #233

Fix: jumpToHeading shouldn't match within codeblocks (also improve jumpToLink a bit) #233

alythobani commented Aug 4, 2024 •

edited

Loading

alythobani Aug 4, 2024 •

edited

Loading

alythobani Aug 4, 2024

alythobani Aug 4, 2024

alythobani Aug 4, 2024

Fix: jumpToHeading shouldn't match within codeblocks (also improve jumpToLink a bit) #233

Fix: jumpToHeading shouldn't match within codeblocks (also improve jumpToLink a bit) #233

Conversation

alythobani commented Aug 4, 2024 • edited Loading

Summary

Implementation details

jumpToHeading

Initial approach - single regex with lookarounds

Actual implementation - two regexes (one for codeblocks, one for headings)

jumpToLink

Screen recordings

jumpToHeading

jumpToLink

alythobani Aug 4, 2024 • edited Loading

Choose a reason for hiding this comment

alythobani Aug 4, 2024

Choose a reason for hiding this comment

alythobani Aug 4, 2024

Choose a reason for hiding this comment

alythobani Aug 4, 2024

Choose a reason for hiding this comment

alythobani commented Aug 4, 2024 •

edited

Loading

alythobani Aug 4, 2024 •

edited

Loading