-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
commonmark/gfm: -raw_html
turns tables with <h3>
tag into [TABLE]
.
#10407
Comments
The reason for this is that the markdown table syntax available in gfm/commonmark doesn't allow block-level elements like headings in cells. So, normally we'd fall back to HTML for this, but you told it not to use raw HTML. So we fall back to |
I will add some code that emits a warning about the non-rendered block (though this will only appear with |
...e.g., when `raw_html` is disabled and the table can't be fit into a supported markdown table format. Closes #10407.
Thanks! Would it be possible to make it fall back to regular tables and strip the unsupported heading formatting; keeping the content? As mentioned, you can't expect pandoc to keep all layout properties but it should at least retain the content of the heading as plain text or perhaps even distinguish it via e.g. emboldening. |
In some cases it might be possible to do something like that, but in general it's not going to be possible to change arbitrary block-level content to inline content without major mutilation. Better to indicate a problem and let people fix the source. |
What do you mean by "major mutilation"? W.r.t. textual content or formatting? As mentioned, the latter would be fine IMV. Fixing the source isn't always easy to do. My use-case is converting website blog post contents into pure markdown. Going through auto-generated HTML for tables and such and then manually stripping all tags that cause trouble for pandoc isn't the easiest thing in the world, especially when those tables grow larger. I have absolutely no insight into how pandoc works internally, so I don't know how feasible this would be but couldn't table cell contents be their own format that is a subset of markdown? You'd then work with table contents as if they were their own documents; using the regular conversion machinery to map the formatting constructs into the closest analog. Headings wouldn't exist in this format, so any heading would be mapped to e.g. an emboldened paragraph. This is of course a lossy operation w.r.t. formatting but as long as it gets the intention across, that's good enough honestly. IIUC this is how the plain format works as it doesn't have any formal formatting features. It generates text that resembles the desired format instead and the output is clear enough there IMHO. |
Explain the problem.
Include the exact command line you used and all inputs necessary to reproduce the issue. Please create as minimal an example as possible, to help the maintainers isolate the problem. Explain the output you received and how it differs from what you expected.
I tried to convert an HTML blog post to markdown as I usually do for Lemmy posts but this time the table turned into
[TABLE]
.I narrowed it down to a rather minimal reproducer:
https://pandoc.org/try/?params=%7B%22text%22%3A%22%3Ctable%3E%5Cn%3Ctbody%3E%5Cn%3Ctr%3E%5Cn%3Ctd%3Efoo%3C%2Ftd%3E%5Cn%3Ctd%3E%5Cn%3Ch3%3Eanything%3C%2Fh3%3E%5Cn%3C%2Ftd%3E%5Cn%3Ctd%3Ebar%3C%2Ftd%3E%5Cn%3Ctd%3Ebaz%3C%2Ftd%3E%5Cn%3C%2Ftr%3E%5Cn%3C%2Ftbody%3E%5Cn%3C%2Ftable%3E%5Cn%22%2C%22to%22%3A%22gfm-raw_html%22%2C%22from%22%3A%22html%22%2C%22standalone%22%3Afalse%2C%22embed-resources%22%3Afalse%2C%22table-of-contents%22%3Afalse%2C%22number-sections%22%3Afalse%2C%22citeproc%22%3Afalse%2C%22html-math-method%22%3A%22plain%22%2C%22wrap%22%3A%22auto%22%2C%22highlight-style%22%3Anull%2C%22files%22%3A%7B%7D%2C%22template%22%3Anull%7D
The cause is the
<h3>
tag. If you remove it, it works as expected.markdown-raw_html
also converts it fine.markdown_strict-raw_html
does not.That shouldn't happen but what I found even more confusing is that pandoc didn't even print a warning while discarding a large amount of textual content. Loss of layout information is expected when converting between different formats of course but the content should never change or be removed without warning.
Pandoc version?
What version of pandoc are you using, on what OS? (If it's not the latest release, please try with the latest release before reporting the issue. Note that many linux distributions have old versions of pandoc in their repositories.)
It's
pandoc 3.1.11.1
via Nix (via stackage LTS) but reproduces on https://pandoc.org/try/ (3.5) if you hack in-raw_html
.The text was updated successfully, but these errors were encountered: