Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backslash escape not unescaped in ID attributes #864

Closed
SHxKM opened this issue Oct 6, 2019 · 5 comments · Fixed by #881
Closed

Backslash escape not unescaped in ID attributes #864

SHxKM opened this issue Oct 6, 2019 · 5 comments · Fixed by #881
Labels
bug Bug report. confirmed Confirmed bug report or approved feature request. extension Related to one or more of the included extensions.

Comments

@SHxKM
Copy link

SHxKM commented Oct 6, 2019

I'm using the TocExtension to generate table of contens for my `.md. files (which could be relevant/irrelevant to the issue below).

In my original text, I'm escaping the single underscore _ with a backslash:

### select\_related

After running the text through markdown, this is the HTML result:

<h3 id="select95related">select_related</h3>

And the generated TOC points to the ID above as well. I'm wondering if this is the expected behaviour? I would expect markdown to simply render:

<h3 id="select_related">select_related</h3>

At least according to one online converter, the result should be: <...id="select_related"...>

Am I missing something?

@SHxKM SHxKM changed the title backslash before underscore parsed as 95 backslash before underscore parsed as "95" Oct 6, 2019
@waylan
Copy link
Member

waylan commented Oct 6, 2019

Yep, that's a bug. Thanks for the report.

To handle escaped characters, we covert the escaped character to its Unicode code point surrounded by "START OF TEXT" (STX) and "END OF TEXT" (ETX) Unicode characters. STX (U+0002) and ETX (U+0003) are both zero length characters so, while they are present, you can't see them. In any event, after any Markdown parsing is completed, any escape sequences found in the output are replaced with the character which corresponds with the code point. In this case, the string STX-95-ETX would get replaced with an underscore.

Apparently we aren't doing those replacements for the content of id attributes. That makes sense as we would only do such replacements for the text of the HTML, not the tags themselves. To fix this, we need to add a call to the unescape function on the text inserted into any id attributes.

@waylan waylan changed the title backslash before underscore parsed as "95" Backslash escape not unescaped in ID attributes Oct 6, 2019
@waylan waylan added bug Bug report. confirmed Confirmed bug report or approved feature request. extension Related to one or more of the included extensions. labels Oct 6, 2019
@SHxKM
Copy link
Author

SHxKM commented Oct 6, 2019

@waylan Thanks for taking a look, and your hard work on this!

@SHxKM
Copy link
Author

SHxKM commented Oct 8, 2019

@waylan - just to make sure my issue report is complete: at least the way markdown is reading the text, both the toc generated and the HTML (parsed) content have the unescaped version of the ID. I only discovered this because I'm using the toc part of the content in a bigger workflow (and my toc links were broken because one was with ID select95related and the original select_related)

@SHxKM
Copy link
Author

SHxKM commented Oct 16, 2019

@waylan - anywhere specific I should look to start poking in the code and expedite this process?

@waylan
Copy link
Member

waylan commented Oct 17, 2019

The text used for the ID attribute is extracted at markdown/extensions/toc.py#L244. I would do the unescaping there before the text is processed any further. Normally, the unescaping is done by the markdown.postprocessors.UnescapePostprocessor class.

Looking at this I was wondering why the postprocessor wasn't addressing the issue. After all postprocessors run on the HTML as text, so there is no distinction between between attributes or any other text of the HTML. Then I realized that the STX and ETX are being removed by the slugify function as they are not ascii characters. Therefore, they do on exist in the ID. That being the case, the only way to address this is at markdown/extensions/toc.py#L244.

@Python-Markdown Python-Markdown deleted a comment from SHxKM Nov 25, 2019
@Python-Markdown Python-Markdown deleted a comment from wayland Nov 25, 2019
waylan added a commit to waylan/markdown that referenced this issue Nov 25, 2019
The slugify function will stript the STX and ETX characters from
placeholders for backslash excaped characters. Therefore, we need
to unescape any text before passing it to slugify. Fixes Python-Markdown#864.
waylan added a commit that referenced this issue Nov 25, 2019
The slugify function will stript the STX and ETX characters from
placeholders for backslash excaped characters. Therefore, we need
to unescape any text before passing it to slugify. Fixes #864.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report. confirmed Confirmed bug report or approved feature request. extension Related to one or more of the included extensions.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants