Add support for pdf2xml format #57

stweil · 2017-07-29T08:42:36Z

The program pdftohtml (part of Debian package poppler-utils) can extract images and text from a PDF file and write an XML file with the extracted information:

pdftohtml -c -hidden -xml INPUT_PDF OUTPUT_XML

Libraries like pdftabextract use that XML file for further processing.

By transforming ALTO or hOCR XML directly to the pdf2xml XML format, the time consuming intermediate PDF could be avoided.

The text was updated successfully, but these errors were encountered:

stweil added the enhancement Any enhancement on the software itself (excluding new transformations) label Jul 29, 2017

stweil mentioned this issue Aug 1, 2017

Extract HOCR from searchable PDF ocropus/hocr-tools#117

Open

zuphilip added format Suggestions for new ocr formats to be included and removed enhancement Any enhancement on the software itself (excluding new transformations) labels Dec 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for pdf2xml format #57

Add support for pdf2xml format #57

stweil commented Jul 29, 2017 •

edited

Loading

Add support for pdf2xml format #57

Add support for pdf2xml format #57

Comments

stweil commented Jul 29, 2017 • edited Loading

stweil commented Jul 29, 2017 •

edited

Loading