You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the case where the text of a PDF contains <things> like <this>, these get passed through unescaped in Markdown ... and also in HTML! In the case where they are actual HTML tags, well, you get the actual HTML tags, which might not be what you want. If not, well, you get... something.
This can also cause weird issues in some corner cases like the one in the attached document where <snip> (not an HTML tag) gets split across a line break (here it's kind of contrived but I have a real document that does this) and thus becomes <s nip>, causing the rest of the document to be in strikethrough.
In the case where the text of a PDF contains <things> like <this>, these get passed through unescaped in Markdown ... and also in HTML! In the case where they are actual HTML tags, well, you get the actual HTML tags, which might not be what you want. If not, well, you get... something.
This can also cause weird issues in some corner cases like the one in the attached document where <snip> (not an HTML tag) gets split across a line break (here it's kind of contrived but I have a real document that does this) and thus becomes <s nip>, causing the rest of the document to be in strikethrough.
testpdf.pdf
To reproduce, run:
You will see:
I would expect the tags to come through as they do in the original document since it was not HTML... and of course no strikethough :)
Docling version
Docling version: 2.15.1
Docling Core version: 2.14.0
Docling IBM Models version: 3.1.2
Docling Parse version: 3.0.0
Python version
3.10.12
The text was updated successfully, but these errors were encountered: