Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: jumpToHeading shouldn't match within codeblocks (also improve jumpToLink a bit) #233

Merged
merged 3 commits into from
Aug 4, 2024

Conversation

alythobani
Copy link
Contributor

@alythobani alythobani commented Aug 4, 2024

Finally got around to this :) thanks to @langshangyuan for filing #231! (And to @lebigot for noticing as well and commenting on #222)

Summary

This PR does two things:

  1. Fixes jumpToHeading to not match within code blocks (e.g. in the case of Python comments within a codeblock)
  2. Improves jumpToLink to also be able to jump to standalone hyperlinks

Implementation details

jumpToHeading

Initial approach - single regex with lookarounds

Initially I explored adjusting HEADING_REGEX to use lookarounds to ensure that a given match isn't inside a codeblock, as @langshangyuan suggested (it was a good idea). I did get something working, although it was a little complex:

const HEADING_REGEX = /(?<=(?<!.)(?:(?:(?:(?!^```).)*?^```){2})*(?:(?!^```).)*)^#{1,6} /gms;

The above regex matches on heading patterns (^#{1,6} ) only if the lookbehind assertion passes. The lookbehind asserts that from the top of the string (note that in multiline mode ^ means "start of line", so we use (?<!.) to assert "start of string") up until the potential heading, there are an even number of codeblock markers.

I realized two issues though with this approach:

  1. It fails in the case that the code block starts with more than 3 backticks. In Markdown, fenced codeblocks are allowed to start/end with more than 3 backticks. E.g. this GitHub article notes you can use quadruple backticks to make a code block, and triple backticks within the codeblock will just be displayed rather than ending the codeblock. This Python-Markdown article also notes that a fenced codeblock can start with any number >=3 of backticks, and must end with that many as well. Experimenting in Obsidian, you can see that a codeblock ends as long as its end marker has at least as many backticks as its start marker (otherwise the codeblock extends to the end of the document).

  2. Performance: at each heading match, the lookbehind needs to traverse all the way from the top of the string until the match, to check that there are an even number of preceding code block markers.

Actual implementation - two regexes (one for codeblocks, one for headings)

This approach helps with both issues above:

More than three backticks: While adjusting the lookbehind to account for >=3 backticks might have been possible, it would have made it even more complex/unreadable, which isn't great for codebase maintenance. With the two-regex approach, the codeblock regex on its own isn't too complex, while still accurate (ensuring each codeblock ends with at least as many backticks as it started with, or extends to the end of the document otherwise).

Performance: with this two-regex approach, we need two linear iterations (one for each regex) to get all codeblock matches and all "naive" heading matches. Then for each "naive" heading match, we check against all codeblock matches to make sure it's not in a codeblock. Technically this filter could still be non-linear ($O(N^2)$) time: say there are C codeblocks and H headers and the document is N lines long, then this filter takes $O(C*H)$ time. In theory, if the document has a LOT of codeblocks and headers, we could have e.g. $C = H = N/3$ meaning $O(N^2/9)$ time. But in practice, most documents will have a more reasonable (closer to constant-level) number of codeblocks in them, so the $O(C*H)$ filter should be much more efficient than a lookbehind that traverses from the start of the string until each naive heading match, for every naive heading match.

jumpToLink

I'd been meaning to get around to this, so this was a good opportunity; I've now added a URL_REGEX_STRING subpattern to LINK_REGEX so that jumpToLink can also jump to standalone URLs. I used a simple \w+://\S+ regex pattern, so this will match URLs like chrome://extensions or file://<some_path> or https://google.com.

Screen recordings

jumpToHeading

Screen.Recording.2024-08-04.at.8.48.12.AM.mov

jumpToLink

Screen.Recording.2024-08-04.at.8.51.44.AM.mov

Thanks again for noticing this everyone! Let me know if anyone has any questions/comments/concerns/suggestions.

const WIKILINK_REGEX_STRING = "\\[\\[[^\\]\\]]+?\\]\\]";
const MARKDOWN_LINK_REGEX_STRING = "\\[[^\\]]+?\\]\\([^)]+?\\)";
const LINK_REGEX_STRING = `${WIKILINK_REGEX_STRING}|${MARKDOWN_LINK_REGEX_STRING}`;
const WIKILINK_REGEX_STRING = "\\[\\[.*?\\]\\]";
Copy link
Contributor Author

@alythobani alythobani Aug 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also went ahead and improved the regexes here a little to be more concise and more accurate (e.g. I don't think the original [^\\]\\]] was actually doing what I wanted it to do in WIKILINK_REGEX_STRING; now we just use a lazy .*? to catch all content up to the closing ]])

* start of a line. It either ends on the nearest future line that starts with at least as many
* backticks (\1 back-reference), or extends to the end of the string if no such future line exists.
*/
const FENCED_CODEBLOCK_REGEX = /(^```+)(.*?^\1|.*)/gms;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small note that the s flag is needed in order for . to match newlines as well

/** Naive Regex for a Markdown heading (H1 through H6). "Naive" because it does not account for
* whether the match is within a codeblock (e.g. it could be a Python comment, not a heading).
*/
const NAIVE_HEADING_REGEX = /^#{1,6} /gm;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also went ahead and adjusted the heading regex to only match H1-H6, since H7/above don't exist in Markdown

const LINK_REGEX_STRING = `${WIKILINK_REGEX_STRING}|${MARKDOWN_LINK_REGEX_STRING}`;
const WIKILINK_REGEX_STRING = "\\[\\[.*?\\]\\]";
const MARKDOWN_LINK_REGEX_STRING = "\\[.*?\\]\\(.*?\\)";
const URL_REGEX_STRING = "\\w+://\\S+";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept this at a simple URL regex - technically URL validation is way more complex than this, but I think that could affect both code readability and potentially regex performace. This won't match urls without protocols (e.g. won't match google.com), but will match e.g. https://a. Which I think is good enough for a vim motion like this, but let me know if anyone has other thoughts.

@esm7 esm7 merged commit 2795bb6 into esm7:master Aug 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants