Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from "HTML-like" to hOCR, an open-standard for OCR results #62

Merged
merged 21 commits into from
Sep 24, 2020
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ notifications:

install:
- sudo apt-get -qq update
- sudo apt-get install libmagickwand-dev ghostscript # required by Wand
- sudo rm -rf /etc/ImageMagick-6/policy.xml # HazyResearch/fonduer#170
- make dev
- pip install coveralls

Expand Down
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Fixed
- [@HiromuHota][HiromuHota]: Fix a bug that an html file is not created at a given path.
([#64](https://github.com/HazyResearch/pdftotree/pull/64))

- [@HiromuHota][HiromuHota]: Switch the output format from "HTML-like" to hOCR.
([#62](https://github.com/HazyResearch/pdftotree/pull/62))

## 0.4.1 - 2020-09-21

Expand Down
26 changes: 12 additions & 14 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ pdftotree

|License| |Stars| |PyPI| |Version| |Issues| |Travis| |Coveralls| |CodeStyle|

**WARNING**: ``pdftotree`` *is experimental code and is NOT stable or maintained. It is not integrated with or supported by Fonduer.*
**WARNING**: ``pdftotree`` *is experimental code and is NOT stable. It is not integrated with or supported by Fonduer.*

Fonduer_ performs knowledge base construction from richly formatted data such
as tables. A crucial step in this process is the construction of the
Expand All @@ -16,8 +16,10 @@ This package is the result of building our own module as replacement to Adobe
Acrobat. Several open source tools are available for pdf to html conversion but
these tools do not preserve the cell structure in a table. Our goal in this
project is to develop a tool that extracts text, figures and tables in a pdf
document and maintains the structure of the document using a tree data
structure.
document and returns them in an easily consumable format.

Up to v0.4.1, pdftotree's output was formatted in its own "HTML-like" format.
From v0.5.0, it conforms to hOCR_, an open-standard format for OCR results.

Dependencies
------------
Expand Down Expand Up @@ -49,19 +51,14 @@ pdftotree
~~~~~~~~~

This is the primary command-line utility provided with this Python package.
This takes a PDF file as input, and produces an HTML-like representation of the
data::
This takes a PDF file as input and produces an hOCR file as output::

usage: pdftotree [options] pdf_file

Script to extract tree structure from PDF files. Takes a PDF as input and
outputs an HTML-like representation of the document's structure. By default,
this conversion is done using heuristics. However, a model can be provided as
a parameter to use a machine-learning-based approach.
Convert PDF into hOCR.

positional arguments:
pdf_file PDF file name for which tree structure needs to be
extracted
pdf_file Path to input PDF file.

optional arguments:
-h, --help show this help message and exit
Expand All @@ -71,12 +68,12 @@ data::
-m MODEL_PATH, --model_path MODEL_PATH
Pretrained model, generated by extract_tables tool
-o OUTPUT, --output OUTPUT
Path where tree structure should be saved. If none,
HTML is printed to stdout.
Path to output hOCR file. If not given, it will be
printed to stdout.
-f FAVOR_FIGURES, --favor_figures FAVOR_FIGURES
Whether figures must be favored over other parts such
as tables and section headers
-V, --visualize Whether to output visualization images for the tree
-V, --visualize Whether to output visualization images
-d, --dry-run Run pdftotree, but do not save any output or print to
console.
-v, --verbose Output INFO level logging.
Expand Down Expand Up @@ -207,3 +204,4 @@ Then you can run our tests::
.. _version file: https://github.com/HazyResearch/pdftotree/blob/master/pdftotree/_version.py
.. _editable mode: https://packaging.python.org/tutorials/distributing-packages/#working-in-development-mode
.. _flake8: http://flake8.pycqa.org/en/latest/
.. _hOCR: http://kba.cloud/hocr-spec/1.2/
20 changes: 10 additions & 10 deletions bin/pdftotree
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/usr/bin/env python
"""Simple commandline interface for parsing PDF to HTML."""
"""Simple commandline interface for parsing PDF to hOCR."""
import argparse
import logging
import os
Expand All @@ -9,10 +9,7 @@ import pdftotree
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="""
Script to extract tree structure from PDF files. Takes a PDF as input
and outputs an HTML-like representation of the document's structure. By
default, this conversion is done using heuristics. However, a model can
be provided as a parameter to use a machine-learning-based approach.
Convert PDF into hOCR.
""",
usage="%(prog)s [options] pdf_file",
)
Expand All @@ -34,19 +31,22 @@ if __name__ == "__main__":
parser.add_argument(
"pdf_file",
type=str,
help="PDF file name for which tree structure needs to be extracted",
help="Path to input PDF file.",
)
parser.add_argument(
"-o",
"--output",
type=str,
help="Path where tree structure should be saved. If none, HTML is printed to stdout.",
help="Path to output hOCR file. If not given, it will be printed to stdout.",
)
parser.add_argument(
"-f",
"--favor_figures",
type=str,
help="Whether figures must be favored over other parts such as tables and section headers",
help="""
Whether figures must be favored over other parts such as tables and section
headers
""",
default="True",
)
parser.add_argument(
Expand Down Expand Up @@ -103,7 +103,7 @@ if __name__ == "__main__":
log.addHandler(ch)

if args.dry_run:
print("This is just a dry run. No HTML will be output.")
print("This is just a dry run. No hOCR will be output.")
args.output = None

# Call the main routine
Expand All @@ -120,4 +120,4 @@ if __name__ == "__main__":
if not args.dry_run:
print(result)
else:
print("HTML output to {}".format(args.output))
print("hOCR output to {}".format(args.output))
Loading