Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

toc: Unusual characters in heading ids not well supported #1493

Closed
pawamoy opened this issue Dec 10, 2024 · 10 comments · Fixed by #1494
Closed

toc: Unusual characters in heading ids not well supported #1493

pawamoy opened this issue Dec 10, 2024 · 10 comments · Fixed by #1494
Labels
bug Bug report. confirmed Confirmed bug report or approved feature request. extension Related to one or more of the included extensions.

Comments

@pawamoy
Copy link
Contributor

pawamoy commented Dec 10, 2024

I noticed that toc encodes characters like * as \x0242\x03, 42 being the index of * in the ASCII table. This causes a discrepancy between the permalink of a heading and the link in the table of contents.

mkdir /tmp/toc
cd /tmp/toc
python -m venv .venv
. .venv/bin/activate
python -m pip install mkdocs
python -m mkdocs new .

index file:

# Welcome

Demonstrating an issue with HTML ids and `toc`.

## `*Foo*` { id="\*Foo\*" }

- Click on `*Foo*`'s permalink: `#*Foo*` in the URL.
- Click on `*Foo*` in the table of contents: `#%0242%03Foo%0242%03` in the URL.

mkdocs config:

site_name: My Docs

markdown_extensions:
- attr_list
- toc:
    permalink: true

Serve and observe the behavior described in the index page.

I'm not saying this is a bug. I'm just curious if this is expected, and whether there would be a way improve support for headings with such "unusual" ids. This would help for the work I'm doing with mkdocstrings, where we try to expand our languages support, and some languages might use uncommon characters in object identifiers. Not only toc would have to work, but also mkdocs-autorefs, which picks up ids from the table of contents when registering URLs and anchors to objects.

I believe HTML5 supports any kind of characters in ids. Some of them just cause a bit of pain, like . or #, because they then need to be escaped in CSS selectors.

@pawamoy
Copy link
Contributor Author

pawamoy commented Dec 10, 2024

Might be related to #864.

@waylan
Copy link
Member

waylan commented Dec 12, 2024

I'm not sure I follow. Could you please provide sample input, actual output and expected output as 3 separate code blocks?

Might be related to #864.

That is possible. As explained in that discussion, the various parts of the parser rely on each other and require that the parts run in a certain order. I know you are often recalling parts of the parser in an unusual order, which could result in some unescaping being missed or some similar issue. That's why I ask for input, actual output and expected output.

@waylan waylan added the more-info-needed More information needs to be provided. label Dec 12, 2024
@pawamoy
Copy link
Contributor Author

pawamoy commented Dec 12, 2024

Of course, here are some sample inputs/outputs 🙂

  • Input

    ## `*Foo*` { id="\*Foo\*" }
  • Actual heading output

    <h2 id="*Foo*"><code>*Foo*</code><a class="headerlink" href="#*Foo*" title="Permanent link">¶</a></h2>
  • Actual table of contents output

    <a href="#�42�Foo�42�" class="nav-link">*Foo*</a>
  • Expected table of contents output (only a supposition)

    <a href="#*Foo*" class="nav-link">*Foo*</a>

@waylan waylan removed the more-info-needed More information needs to be provided. label Dec 12, 2024
@pawamoy
Copy link
Contributor Author

pawamoy commented Dec 12, 2024

The reason why I'm escaping * is that otherwise attr_list fails to parse it, outputting this HTML:

<h2 id="foo-idfoo"><code>*Foo*</code> { id="<em>Foo</em>" }<a class="headerlink" href="#foo-idfoo" title="Permanent link"></a></h2>

@pawamoy
Copy link
Contributor Author

pawamoy commented Dec 12, 2024

This issue might actually not arise from inside mkdocstrings, since we output HTML directly (so attr_list isn't involved). I'll check and report back.

@waylan
Copy link
Member

waylan commented Dec 12, 2024

I can confirm that this is indeed a bug.

>>> import markdown
>>> md = markdown.Markdown(extensions=['toc', 'attr_list'])
>>> md.convert('## `*Foo*` { id="\*Foo\*" }')
'<h2 id="*Foo*"><code>*Foo*</code></h2>'
>>> md.toc_tokens
[{'level': 2, 'id': '\x0242\x03Foo\x0242\x03', 'name': '*Foo*', 'html': '<code>*Foo*</code>', 'data-toc-label': '', 'children': []}]
>>> md.toc
'<div class="toc">\n<ul>\n<li><a href="#\x0242\x03Foo\x0242\x03">*Foo*</a></li>\n</ul>\n</div>\n'

Presumably, this requires a similar fix to #864 when building the toc_tokens.

@waylan waylan added bug Bug report. extension Related to one or more of the included extensions. confirmed Confirmed bug report or approved feature request. labels Dec 12, 2024
@pawamoy
Copy link
Contributor Author

pawamoy commented Dec 12, 2024

So, here's how we render HTML headings within mkdocstrings:

{# just an example #}
{% filter heading(
    heading_level + 1,
    role="module",
    id="*foo*",
    class="doc doc-heading",
    toc_label="*foo*",
  ) %}
  Hello from Foo!
{% endfilter %}

And here's the final output (heading then TOC):

<h2 id="*foo*" class="doc doc-heading">  Hello from Foo!
<a href="#*foo*" class="headerlink" title="Permanent link">¤</a></h2>
<a href="#*foo*" class="md-nav__link">
    <span class="md-ellipsis">
        *foo*   
    </span>
</a>

(TOC styled by Material for MkDocs)

As we can see mkdocstrings doesn't actually suffer from this, since we bypass attr_list probably, or override some parts of toc.

Anyway, glad you were able to reproduce the issue!

Fixing it will at least make it much easier for mkdocs-autorefs to test its upcoming features 👍

@waylan
Copy link
Member

waylan commented Dec 12, 2024

So, it appears this is an issue with attr_lists. If the id is defined outside of an attr_list, then it is fine.

>>> md.convert('## \*Foo\*)')
'<h2 id="foo">*Foo*)</h2>'
>>> md.toc_tokens
[{'level': 2, 'id': 'foo', 'name': '*Foo*)', 'html': '*Foo*)', 'data-toc-label': '', 'children': []}]

In fact, we already have a test for that.

There are 2 ways we could fix this.

  1. Call unescape on the value in attr_list before returning it. (attr_list.py#L173L-183)
  2. Call unescape on the value assigned to the toc_token['id'] (toc.py#L394).

Option 1 fixes the instance issue while option 2 ensures any future issues are addressed as well. However, option 2 would be redundant in the exiting test. My inclination is to go with option 1 and expect each individual extension to properly unescape its own output.

@pawamoy
Copy link
Contributor Author

pawamoy commented Dec 12, 2024

Option 1 sounds reasonable. I don't expect a lot of extensions to be affected by this.

waylan added a commit to waylan/markdown that referenced this issue Dec 12, 2024
@waylan
Copy link
Member

waylan commented Dec 12, 2024

Actually, I couldn't create a test that failed on the attr_list extension alone. So I went with option 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report. confirmed Confirmed bug report or approved feature request. extension Related to one or more of the included extensions.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants