Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

commonmark/gfm: -raw_html turns tables with <h3> tag into [TABLE]. #10407

Closed
Atemu opened this issue Nov 23, 2024 · 5 comments
Closed

commonmark/gfm: -raw_html turns tables with <h3> tag into [TABLE]. #10407

Atemu opened this issue Nov 23, 2024 · 5 comments
Labels

Comments

@Atemu
Copy link

Atemu commented Nov 23, 2024

Explain the problem.
Include the exact command line you used and all inputs necessary to reproduce the issue. Please create as minimal an example as possible, to help the maintainers isolate the problem. Explain the output you received and how it differs from what you expected.

I tried to convert an HTML blog post to markdown as I usually do for Lemmy posts but this time the table turned into [TABLE].

I narrowed it down to a rather minimal reproducer:

https://pandoc.org/try/?params=%7B%22text%22%3A%22%3Ctable%3E%5Cn%3Ctbody%3E%5Cn%3Ctr%3E%5Cn%3Ctd%3Efoo%3C%2Ftd%3E%5Cn%3Ctd%3E%5Cn%3Ch3%3Eanything%3C%2Fh3%3E%5Cn%3C%2Ftd%3E%5Cn%3Ctd%3Ebar%3C%2Ftd%3E%5Cn%3Ctd%3Ebaz%3C%2Ftd%3E%5Cn%3C%2Ftr%3E%5Cn%3C%2Ftbody%3E%5Cn%3C%2Ftable%3E%5Cn%22%2C%22to%22%3A%22gfm-raw_html%22%2C%22from%22%3A%22html%22%2C%22standalone%22%3Afalse%2C%22embed-resources%22%3Afalse%2C%22table-of-contents%22%3Afalse%2C%22number-sections%22%3Afalse%2C%22citeproc%22%3Afalse%2C%22html-math-method%22%3A%22plain%22%2C%22wrap%22%3A%22auto%22%2C%22highlight-style%22%3Anull%2C%22files%22%3A%7B%7D%2C%22template%22%3Anull%7D

The cause is the <h3> tag. If you remove it, it works as expected.

markdown-raw_html also converts it fine. markdown_strict-raw_html does not.

That shouldn't happen but what I found even more confusing is that pandoc didn't even print a warning while discarding a large amount of textual content. Loss of layout information is expected when converting between different formats of course but the content should never change or be removed without warning.

Pandoc version?
What version of pandoc are you using, on what OS? (If it's not the latest release, please try with the latest release before reporting the issue. Note that many linux distributions have old versions of pandoc in their repositories.)

It's pandoc 3.1.11.1 via Nix (via stackage LTS) but reproduces on https://pandoc.org/try/ (3.5) if you hack in -raw_html.

@Atemu Atemu added the bug label Nov 23, 2024
@jgm
Copy link
Owner

jgm commented Nov 23, 2024

The reason for this is that the markdown table syntax available in gfm/commonmark doesn't allow block-level elements like headings in cells.

So, normally we'd fall back to HTML for this, but you told it not to use raw HTML.

So we fall back to [TABLE].

@jgm
Copy link
Owner

jgm commented Nov 23, 2024

I will add some code that emits a warning about the non-rendered block (though this will only appear with --verbose).

@jgm jgm closed this as completed Nov 23, 2024
jgm added a commit that referenced this issue Nov 23, 2024
...e.g., when `raw_html` is disabled and the table can't be
fit into a supported markdown table format.

Closes #10407.
@Atemu
Copy link
Author

Atemu commented Nov 24, 2024

Thanks!

Would it be possible to make it fall back to regular tables and strip the unsupported heading formatting; keeping the content?

As mentioned, you can't expect pandoc to keep all layout properties but it should at least retain the content of the heading as plain text or perhaps even distinguish it via e.g. emboldening.

@jgm
Copy link
Owner

jgm commented Nov 24, 2024

In some cases it might be possible to do something like that, but in general it's not going to be possible to change arbitrary block-level content to inline content without major mutilation. Better to indicate a problem and let people fix the source.

@Atemu
Copy link
Author

Atemu commented Nov 24, 2024

What do you mean by "major mutilation"? W.r.t. textual content or formatting? As mentioned, the latter would be fine IMV.

Fixing the source isn't always easy to do. My use-case is converting website blog post contents into pure markdown. Going through auto-generated HTML for tables and such and then manually stripping all tags that cause trouble for pandoc isn't the easiest thing in the world, especially when those tables grow larger.

I have absolutely no insight into how pandoc works internally, so I don't know how feasible this would be but couldn't table cell contents be their own format that is a subset of markdown? You'd then work with table contents as if they were their own documents; using the regular conversion machinery to map the formatting constructs into the closest analog. Headings wouldn't exist in this format, so any heading would be mapped to e.g. an emboldened paragraph.

This is of course a lossy operation w.r.t. formatting but as long as it gets the intention across, that's good enough honestly. IIUC this is how the plain format works as it doesn't have any formal formatting features. It generates text that resembles the desired format instead and the output is clear enough there IMHO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants