toc: Unusual characters in heading ids not well supported #1493

pawamoy · 2024-12-10T21:28:23Z

I noticed that toc encodes characters like * as \x0242\x03, 42 being the index of * in the ASCII table. This causes a discrepancy between the permalink of a heading and the link in the table of contents.

mkdir /tmp/toc
cd /tmp/toc
python -m venv .venv
. .venv/bin/activate
python -m pip install mkdocs
python -m mkdocs new .

index file:

# Welcome

Demonstrating an issue with HTML ids and `toc`.

## `*Foo*` { id="\*Foo\*" }

- Click on `*Foo*`'s permalink: `#*Foo*` in the URL.
- Click on `*Foo*` in the table of contents: `#%0242%03Foo%0242%03` in the URL.

mkdocs config:

site_name: My Docs

markdown_extensions:
- attr_list
- toc:
    permalink: true

Serve and observe the behavior described in the index page.

I'm not saying this is a bug. I'm just curious if this is expected, and whether there would be a way improve support for headings with such "unusual" ids. This would help for the work I'm doing with mkdocstrings, where we try to expand our languages support, and some languages might use uncommon characters in object identifiers. Not only toc would have to work, but also mkdocs-autorefs, which picks up ids from the table of contents when registering URLs and anchors to objects.

I believe HTML5 supports any kind of characters in ids. Some of them just cause a bit of pain, like . or #, because they then need to be escaped in CSS selectors.

The text was updated successfully, but these errors were encountered:

pawamoy · 2024-12-10T21:35:51Z

Might be related to #864.

waylan · 2024-12-12T17:43:38Z

I'm not sure I follow. Could you please provide sample input, actual output and expected output as 3 separate code blocks?

Might be related to #864.

That is possible. As explained in that discussion, the various parts of the parser rely on each other and require that the parts run in a certain order. I know you are often recalling parts of the parser in an unusual order, which could result in some unescaping being missed or some similar issue. That's why I ask for input, actual output and expected output.

pawamoy · 2024-12-12T18:29:13Z

Of course, here are some sample inputs/outputs 🙂

Input
```
## `*Foo*` { id="\*Foo\*" }
```

Actual heading output

<h2 id="*Foo*"><code>*Foo*</code><a class="headerlink" href="#*Foo*" title="Permanent link">¶</a></h2>

Actual table of contents output

<a href="#�42�Foo�42�" class="nav-link">*Foo*</a>

Expected table of contents output (only a supposition)
```
<a href="#*Foo*" class="nav-link">*Foo*</a>
```

pawamoy · 2024-12-12T18:37:38Z

The reason why I'm escaping * is that otherwise attr_list fails to parse it, outputting this HTML:

<h2 id="foo-idfoo"><code>*Foo*</code> { id="<em>Foo</em>" }<a class="headerlink" href="#foo-idfoo" title="Permanent link">¶</a></h2>

pawamoy · 2024-12-12T18:42:09Z

This issue might actually not arise from inside mkdocstrings, since we output HTML directly (so attr_list isn't involved). I'll check and report back.

waylan · 2024-12-12T18:45:45Z

I can confirm that this is indeed a bug.

>>> import markdown
>>> md = markdown.Markdown(extensions=['toc', 'attr_list'])
>>> md.convert('## `*Foo*` { id="\*Foo\*" }')
'<h2 id="*Foo*"><code>*Foo*</code></h2>'
>>> md.toc_tokens
[{'level': 2, 'id': '\x0242\x03Foo\x0242\x03', 'name': '*Foo*', 'html': '<code>*Foo*</code>', 'data-toc-label': '', 'children': []}]
>>> md.toc
'<div class="toc">\n<ul>\n<li><a href="#\x0242\x03Foo\x0242\x03">*Foo*</a></li>\n</ul>\n</div>\n'

Presumably, this requires a similar fix to #864 when building the toc_tokens.

pawamoy · 2024-12-12T18:49:21Z

So, here's how we render HTML headings within mkdocstrings:

{# just an example #}
{% filter heading(
    heading_level + 1,
    role="module",
    id="*foo*",
    class="doc doc-heading",
    toc_label="*foo*",
  ) %}
  Hello from Foo!
{% endfilter %}

And here's the final output (heading then TOC):

<h2 id="*foo*" class="doc doc-heading">  Hello from Foo!
<a href="#*foo*" class="headerlink" title="Permanent link">¤</a></h2>

<a href="#*foo*" class="md-nav__link">
    <span class="md-ellipsis">
        *foo*   
    </span>
</a>

(TOC styled by Material for MkDocs)

As we can see mkdocstrings doesn't actually suffer from this, since we bypass attr_list probably, or override some parts of toc.

Anyway, glad you were able to reproduce the issue!

Fixing it will at least make it much easier for mkdocs-autorefs to test its upcoming features 👍

waylan · 2024-12-12T19:19:46Z

So, it appears this is an issue with attr_lists. If the id is defined outside of an attr_list, then it is fine.

>>> md.convert('## \*Foo\*)')
'<h2 id="foo">*Foo*)</h2>'
>>> md.toc_tokens
[{'level': 2, 'id': 'foo', 'name': '*Foo*)', 'html': '*Foo*)', 'data-toc-label': '', 'children': []}]

In fact, we already have a test for that.

There are 2 ways we could fix this.

Call unescape on the value in attr_list before returning it. (attr_list.py#L173L-183)
Call unescape on the value assigned to the toc_token['id'] (toc.py#L394).

Option 1 fixes the instance issue while option 2 ensures any future issues are addressed as well. However, option 2 would be redundant in the exiting test. My inclination is to go with option 1 and expect each individual extension to properly unescape its own output.

pawamoy · 2024-12-12T19:44:06Z

Option 1 sounds reasonable. I don't expect a lot of extensions to be affected by this.

Fixes Python-Markdown#1493.

waylan · 2024-12-12T20:01:44Z

Actually, I couldn't create a test that failed on the attr_list extension alone. So I went with option 2.

waylan added the more-info-needed More information needs to be provided. label Dec 12, 2024

waylan removed the more-info-needed More information needs to be provided. label Dec 12, 2024

waylan added bug Bug report. extension Related to one or more of the included extensions. confirmed Confirmed bug report or approved feature request. labels Dec 12, 2024

waylan added a commit to waylan/markdown that referenced this issue Dec 12, 2024

Backslash Unescape IDs set via attr_list on toc

178852a

Fixes Python-Markdown#1493.

waylan mentioned this issue Dec 12, 2024

Backslash Unescape IDs set via attr_list on toc #1494

Merged

waylan closed this as completed in 7aace29 Dec 13, 2024

waylan closed this as completed in #1494 Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

toc: Unusual characters in heading ids not well supported #1493

toc: Unusual characters in heading ids not well supported #1493

pawamoy commented Dec 10, 2024 •

edited

Loading

pawamoy commented Dec 10, 2024

waylan commented Dec 12, 2024

pawamoy commented Dec 12, 2024 •

edited

Loading

pawamoy commented Dec 12, 2024

pawamoy commented Dec 12, 2024

waylan commented Dec 12, 2024

pawamoy commented Dec 12, 2024 •

edited

Loading

waylan commented Dec 12, 2024

pawamoy commented Dec 12, 2024

waylan commented Dec 12, 2024

toc: Unusual characters in heading ids not well supported #1493

toc: Unusual characters in heading ids not well supported #1493

Comments

pawamoy commented Dec 10, 2024 • edited Loading

pawamoy commented Dec 10, 2024

waylan commented Dec 12, 2024

pawamoy commented Dec 12, 2024 • edited Loading

pawamoy commented Dec 12, 2024

pawamoy commented Dec 12, 2024

waylan commented Dec 12, 2024

pawamoy commented Dec 12, 2024 • edited Loading

waylan commented Dec 12, 2024

pawamoy commented Dec 12, 2024

waylan commented Dec 12, 2024

pawamoy commented Dec 10, 2024 •

edited

Loading

pawamoy commented Dec 12, 2024 •

edited

Loading

pawamoy commented Dec 12, 2024 •

edited

Loading