-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: jumpToHeading shouldn't match within codeblocks (also improve jumpToLink a bit) #233
Fix: jumpToHeading shouldn't match within codeblocks (also improve jumpToLink a bit) #233
Conversation
jumpToPattern now accepts an optional filterMatch param to make this happen
const WIKILINK_REGEX_STRING = "\\[\\[[^\\]\\]]+?\\]\\]"; | ||
const MARKDOWN_LINK_REGEX_STRING = "\\[[^\\]]+?\\]\\([^)]+?\\)"; | ||
const LINK_REGEX_STRING = `${WIKILINK_REGEX_STRING}|${MARKDOWN_LINK_REGEX_STRING}`; | ||
const WIKILINK_REGEX_STRING = "\\[\\[.*?\\]\\]"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also went ahead and improved the regexes here a little to be more concise and more accurate (e.g. I don't think the original [^\\]\\]]
was actually doing what I wanted it to do in WIKILINK_REGEX_STRING
; now we just use a lazy .*?
to catch all content up to the closing ]]
)
* start of a line. It either ends on the nearest future line that starts with at least as many | ||
* backticks (\1 back-reference), or extends to the end of the string if no such future line exists. | ||
*/ | ||
const FENCED_CODEBLOCK_REGEX = /(^```+)(.*?^\1|.*)/gms; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small note that the s
flag is needed in order for .
to match newlines as well
/** Naive Regex for a Markdown heading (H1 through H6). "Naive" because it does not account for | ||
* whether the match is within a codeblock (e.g. it could be a Python comment, not a heading). | ||
*/ | ||
const NAIVE_HEADING_REGEX = /^#{1,6} /gm; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also went ahead and adjusted the heading regex to only match H1-H6, since H7/above don't exist in Markdown
const LINK_REGEX_STRING = `${WIKILINK_REGEX_STRING}|${MARKDOWN_LINK_REGEX_STRING}`; | ||
const WIKILINK_REGEX_STRING = "\\[\\[.*?\\]\\]"; | ||
const MARKDOWN_LINK_REGEX_STRING = "\\[.*?\\]\\(.*?\\)"; | ||
const URL_REGEX_STRING = "\\w+://\\S+"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kept this at a simple URL regex - technically URL validation is way more complex than this, but I think that could affect both code readability and potentially regex performace. This won't match urls without protocols (e.g. won't match google.com
), but will match e.g. https://a
. Which I think is good enough for a vim motion like this, but let me know if anyone has other thoughts.
Finally got around to this :) thanks to @langshangyuan for filing #231! (And to @lebigot for noticing as well and commenting on #222)
Summary
This PR does two things:
jumpToHeading
to not match within code blocks (e.g. in the case of Python comments within a codeblock)jumpToLink
to also be able to jump to standalone hyperlinksImplementation details
jumpToHeading
Initial approach - single regex with lookarounds
Initially I explored adjusting
HEADING_REGEX
to use lookarounds to ensure that a given match isn't inside a codeblock, as @langshangyuan suggested (it was a good idea). I did get something working, although it was a little complex:const HEADING_REGEX = /(?<=(?<!.)(?:(?:(?:(?!^```).)*?^```){2})*(?:(?!^```).)*)^#{1,6} /gms;
The above regex matches on heading patterns (
^#{1,6}
) only if the lookbehind assertion passes. The lookbehind asserts that from the top of the string (note that in multiline mode^
means "start of line", so we use(?<!.)
to assert "start of string") up until the potential heading, there are an even number of codeblock markers.I realized two issues though with this approach:
It fails in the case that the code block starts with more than 3 backticks. In Markdown, fenced codeblocks are allowed to start/end with more than 3 backticks. E.g. this GitHub article notes you can use quadruple backticks to make a code block, and triple backticks within the codeblock will just be displayed rather than ending the codeblock. This Python-Markdown article also notes that a fenced codeblock can start with any number >=3 of backticks, and must end with that many as well. Experimenting in Obsidian, you can see that a codeblock ends as long as its end marker has at least as many backticks as its start marker (otherwise the codeblock extends to the end of the document).
Performance: at each heading match, the lookbehind needs to traverse all the way from the top of the string until the match, to check that there are an even number of preceding code block markers.
Actual implementation - two regexes (one for codeblocks, one for headings)
This approach helps with both issues above:
More than three backticks: While adjusting the lookbehind to account for >=3 backticks might have been possible, it would have made it even more complex/unreadable, which isn't great for codebase maintenance. With the two-regex approach, the codeblock regex on its own isn't too complex, while still accurate (ensuring each codeblock ends with at least as many backticks as it started with, or extends to the end of the document otherwise).
Performance: with this two-regex approach, we need two linear iterations (one for each regex) to get all codeblock matches and all "naive" heading matches. Then for each "naive" heading match, we check against all codeblock matches to make sure it's not in a codeblock. Technically this filter could still be non-linear ($O(N^2)$) time: say there are C codeblocks and H headers and the document is$O(C*H)$ time. In theory, if the document has a LOT of codeblocks and headers, we could have e.g. $C = H = N/3$ meaning $O(N^2/9)$ time. But in practice, most documents will have a more reasonable (closer to constant-level) number of codeblocks in them, so the $O(C*H)$ filter should be much more efficient than a lookbehind that traverses from the start of the string until each naive heading match, for every naive heading match.
N
lines long, then this filter takesjumpToLink
I'd been meaning to get around to this, so this was a good opportunity; I've now added a
URL_REGEX_STRING
subpattern toLINK_REGEX
so thatjumpToLink
can also jump to standalone URLs. I used a simple\w+://\S+
regex pattern, so this will match URLs likechrome://extensions
orfile://<some_path>
orhttps://google.com
.Screen recordings
jumpToHeading
Screen.Recording.2024-08-04.at.8.48.12.AM.mov
jumpToLink
Screen.Recording.2024-08-04.at.8.51.44.AM.mov
Thanks again for noticing this everyone! Let me know if anyone has any questions/comments/concerns/suggestions.