HTML-like tags in PDFs should be escaped in Markdown (and HTML!) output #764

dhdaines · 2025-01-16T22:04:48Z

In the case where the text of a PDF contains <things> like <this>, these get passed through unescaped in Markdown ... and also in HTML! In the case where they are actual HTML tags, well, you get the actual HTML tags, which might not be what you want. If not, well, you get... something.

This can also cause weird issues in some corner cases like the one in the attached document where <snip> (not an HTML tag) gets split across a line break (here it's kind of contrived but I have a real document that does this) and thus becomes <s nip>, causing the rest of the document to be in strikethrough.

testpdf.pdf

To reproduce, run:

docling testpdf.pdf
docling --to html testpdf.html
open testpdf.html

You will see:

I would expect the tags to come through as they do in the original document since it was not HTML... and of course no strikethough :)

Docling version

Docling version: 2.15.1
Docling Core version: 2.14.0
Docling IBM Models version: 3.1.2
Docling Parse version: 3.0.0

Python version

3.10.12

dhdaines added the bug Something isn't working label Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML-like tags in PDFs should be escaped in Markdown (and HTML!) output #764

HTML-like tags in PDFs should be escaped in Markdown (and HTML!) output #764

dhdaines commented Jan 16, 2025

HTML-like tags in PDFs should be escaped in Markdown (and HTML!) output #764

HTML-like tags in PDFs should be escaped in Markdown (and HTML!) output #764

Comments

dhdaines commented Jan 16, 2025

Docling version

Python version