Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML-like tags in PDFs should be escaped in Markdown (and HTML!) output #764

Open
dhdaines opened this issue Jan 16, 2025 · 0 comments
Open
Labels
bug Something isn't working

Comments

@dhdaines
Copy link

In the case where the text of a PDF contains <things> like <this>, these get passed through unescaped in Markdown ... and also in HTML! In the case where they are actual HTML tags, well, you get the actual HTML tags, which might not be what you want. If not, well, you get... something.

This can also cause weird issues in some corner cases like the one in the attached document where <snip> (not an HTML tag) gets split across a line break (here it's kind of contrived but I have a real document that does this) and thus becomes <s nip>, causing the rest of the document to be in strikethrough.

testpdf.pdf

To reproduce, run:

docling testpdf.pdf
docling --to html testpdf.html
open testpdf.html

You will see:

Image

I would expect the tags to come through as they do in the original document since it was not HTML... and of course no strikethough :)

Docling version

Docling version: 2.15.1
Docling Core version: 2.14.0
Docling IBM Models version: 3.1.2
Docling Parse version: 3.0.0

Python version

3.10.12

@dhdaines dhdaines added the bug Something isn't working label Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant