Backslash escape not unescaped in ID attributes #864

SHxKM · 2019-10-06T17:48:44Z

I'm using the TocExtension to generate table of contens for my `.md. files (which could be relevant/irrelevant to the issue below).

In my original text, I'm escaping the single underscore _ with a backslash:

### select\_related

After running the text through markdown, this is the HTML result:

<h3 id="select95related">select_related</h3>

And the generated TOC points to the ID above as well. I'm wondering if this is the expected behaviour? I would expect markdown to simply render:

<h3 id="select_related">select_related</h3>

At least according to one online converter, the result should be: <...id="select_related"...>

Am I missing something?

The text was updated successfully, but these errors were encountered:

waylan · 2019-10-06T22:07:06Z

Yep, that's a bug. Thanks for the report.

To handle escaped characters, we covert the escaped character to its Unicode code point surrounded by "START OF TEXT" (STX) and "END OF TEXT" (ETX) Unicode characters. STX (U+0002) and ETX (U+0003) are both zero length characters so, while they are present, you can't see them. In any event, after any Markdown parsing is completed, any escape sequences found in the output are replaced with the character which corresponds with the code point. In this case, the string STX-95-ETX would get replaced with an underscore.

Apparently we aren't doing those replacements for the content of id attributes. That makes sense as we would only do such replacements for the text of the HTML, not the tags themselves. To fix this, we need to add a call to the unescape function on the text inserted into any id attributes.

SHxKM · 2019-10-06T22:10:35Z

@waylan Thanks for taking a look, and your hard work on this!

SHxKM · 2019-10-08T04:42:51Z

@waylan - just to make sure my issue report is complete: at least the way markdown is reading the text, both the toc generated and the HTML (parsed) content have the unescaped version of the ID. I only discovered this because I'm using the toc part of the content in a bigger workflow (and my toc links were broken because one was with ID select95related and the original select_related)

SHxKM · 2019-10-16T07:19:38Z

@waylan - anywhere specific I should look to start poking in the code and expedite this process?

waylan · 2019-10-17T18:18:55Z

The text used for the ID attribute is extracted at markdown/extensions/toc.py#L244. I would do the unescaping there before the text is processed any further. Normally, the unescaping is done by the markdown.postprocessors.UnescapePostprocessor class.

Looking at this I was wondering why the postprocessor wasn't addressing the issue. After all postprocessors run on the HTML as text, so there is no distinction between between attributes or any other text of the HTML. Then I realized that the STX and ETX are being removed by the slugify function as they are not ascii characters. Therefore, they do on exist in the ID. That being the case, the only way to address this is at markdown/extensions/toc.py#L244.

The slugify function will stript the STX and ETX characters from placeholders for backslash excaped characters. Therefore, we need to unescape any text before passing it to slugify. Fixes Python-Markdown#864.

The slugify function will stript the STX and ETX characters from placeholders for backslash excaped characters. Therefore, we need to unescape any text before passing it to slugify. Fixes #864.

SHxKM changed the title ~~backslash before underscore parsed as 95~~ backslash before underscore parsed as "95" Oct 6, 2019

waylan changed the title ~~backslash before underscore parsed as "95"~~ Backslash escape not unescaped in ID attributes Oct 6, 2019

waylan added bug confirmed extension labels Oct 6, 2019

Python-Markdown deleted a comment from SHxKM Nov 25, 2019

Python-Markdown deleted a comment from wayland Nov 25, 2019

waylan mentioned this issue Nov 25, 2019

Unescape IDs in TOC. #881

Merged

waylan closed this as completed in #881 Nov 25, 2019

waylan added a commit that referenced this issue Nov 25, 2019

Unescape IDs in TOC.

15cbaef

The slugify function will stript the STX and ETX characters from placeholders for backslash excaped characters. Therefore, we need to unescape any text before passing it to slugify. Fixes #864.

pawamoy mentioned this issue Dec 10, 2024

toc: Unusual characters in heading ids not well supported #1493

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backslash escape not unescaped in ID attributes #864

Backslash escape not unescaped in ID attributes #864

SHxKM commented Oct 6, 2019 •

edited

Loading

waylan commented Oct 6, 2019 •

edited

Loading

SHxKM commented Oct 6, 2019

SHxKM commented Oct 8, 2019 •

edited

Loading

SHxKM commented Oct 16, 2019

waylan commented Oct 17, 2019

Backslash escape not unescaped in ID attributes #864

Backslash escape not unescaped in ID attributes #864

Comments

SHxKM commented Oct 6, 2019 • edited Loading

waylan commented Oct 6, 2019 • edited Loading

SHxKM commented Oct 6, 2019

SHxKM commented Oct 8, 2019 • edited Loading

SHxKM commented Oct 16, 2019

waylan commented Oct 17, 2019

SHxKM commented Oct 6, 2019 •

edited

Loading

waylan commented Oct 6, 2019 •

edited

Loading

SHxKM commented Oct 8, 2019 •

edited

Loading