You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The program pdftohtml (part of Debian package poppler-utils) can extract images and text from a PDF file and write an XML file with the extracted information:
pdftohtml -c -hidden -xml INPUT_PDF OUTPUT_XML
Libraries like pdftabextract use that XML file for further processing.
By transforming ALTO or hOCR XML directly to the pdf2xml XML format, the time consuming intermediate PDF could be avoided.
The text was updated successfully, but these errors were encountered:
zuphilip
added
format
Suggestions for new ocr formats to be included
and removed
enhancement
Any enhancement on the software itself (excluding new transformations)
labels
Dec 30, 2019
The program
pdftohtml
(part of Debian packagepoppler-utils
) can extract images and text from a PDF file and write an XML file with the extracted information:Libraries like pdftabextract use that XML file for further processing.
By transforming ALTO or hOCR XML directly to the
pdf2xml
XML format, the time consuming intermediate PDF could be avoided.The text was updated successfully, but these errors were encountered: