Switch the output format from "HTML-like" to hOCR

Accordingly, change the granularity of leaf nodes from multiple words to single word
HazyResearch · Sep 24, 2020 · 54065ad · 54065ad
1 parent 48393a1
commit 54065ad
Show file tree

Hide file tree

Showing 10 changed files with 168 additions and 176 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -10,6 +10,9 @@ notifications:
 
 install:
   - sudo apt-get -qq update
+  - sudo apt-get install libmagickwand-dev ghostscript  # required by Wand
+  - sudo rm -rf /etc/ImageMagick-6/policy.xml  # HazyResearch/fonduer#170
+  - pip install -U pip
   - make dev
   - pip install coveralls
 
@@ -21,7 +24,6 @@ before_script:
 
 script:
   - coverage run --source=pdftotree -m pytest tests -v -rsXx
-  - python setup.py -q install
   - pdftotree tests/input/112823.pdf
 
 after_success:

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,7 +9,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ### Fixed
 - [@HiromuHota][HiromuHota]: Fix a bug that an html file is not created at a given path.
   ([#64](https://github.com/HazyResearch/pdftotree/pull/64))
-
+- [@HiromuHota][HiromuHota]: Switch the output format from "HTML-like" to hOCR.
+  ([#62](https://github.com/HazyResearch/pdftotree/pull/62))
 
 ## 0.4.1 - 2020-09-21
 

diff --git a/Makefile b/Makefile
@@ -2,7 +2,7 @@ TESTDATA=tests/input
 
 dev: 
 	pip install -r requirements-dev.txt
-	pip install -e .
+	pip install -e . --use-feature=2020-resolver
 	pre-commit install
 
 test: $(TESTDATA)/paleo_visual_model.h5 dev check

diff --git a/README.rst b/README.rst
@@ -3,7 +3,7 @@ pdftotree
 
 |License| |Stars| |PyPI| |Version| |Issues| |Travis| |Coveralls| |CodeStyle|
 
-**WARNING**: ``pdftotree`` *is experimental code and is NOT stable or maintained. It is not integrated with or supported by Fonduer.*
+**WARNING**: ``pdftotree`` *is experimental code and is NOT stable. It is not integrated with or supported by Fonduer.*
 
 Fonduer_ performs knowledge base construction from richly formatted data such
 as tables. A crucial step in this process is the construction of the
@@ -16,8 +16,10 @@ This package is the result of building our own module as replacement to Adobe
 Acrobat. Several open source tools are available for pdf to html conversion but
 these tools do not preserve the cell structure in a table. Our goal in this
 project is to develop a tool that extracts text, figures and tables in a pdf
-document and maintains the structure of the document using a tree data
-structure.
+document and returns them in an easily consumable format.
+
+Up to v0.4.1, pdftotree's output was formatted in its own "HTML-like" format.
+From v0.5.0, it conforms to hOCR_, an open-standard format for OCR results.
 
 Dependencies
 ------------
@@ -49,19 +51,14 @@ pdftotree
 ~~~~~~~~~
 
 This is the primary command-line utility provided with this Python package.
-This takes a PDF file as input, and produces an HTML-like representation of the
-data::
+This takes a PDF file as input and produces an hOCR file as output::
 
     usage: pdftotree [options] pdf_file
 
-    Script to extract tree structure from PDF files. Takes a PDF as input and
-    outputs an HTML-like representation of the document's structure. By default,
-    this conversion is done using heuristics. However, a model can be provided as
-    a parameter to use a machine-learning-based approach.
+    Convert PDF into hOCR.
 
     positional arguments:
-      pdf_file              PDF file name for which tree structure needs to be
-                            extracted
+      pdf_file              Path to input PDF file.
 
     optional arguments:
       -h, --help            show this help message and exit
@@ -71,12 +68,12 @@ data::
       -m MODEL_PATH, --model_path MODEL_PATH
                             Pretrained model, generated by extract_tables tool
       -o OUTPUT, --output OUTPUT
-                            Path where tree structure should be saved. If none,
-                            HTML is printed to stdout.
+                            Path to output hOCR file. If not given, it will be
+                            printed to stdout.
       -f FAVOR_FIGURES, --favor_figures FAVOR_FIGURES
                             Whether figures must be favored over other parts such
                             as tables and section headers
-      -V, --visualize       Whether to output visualization images for the tree
+      -V, --visualize       Whether to output visualization images
       -d, --dry-run         Run pdftotree, but do not save any output or print to
                             console.
       -v, --verbose         Output INFO level logging.
@@ -207,3 +204,4 @@ Then you can run our tests::
 .. _version file: https://github.com/HazyResearch/pdftotree/blob/master/pdftotree/_version.py
 .. _editable mode: https://packaging.python.org/tutorials/distributing-packages/#working-in-development-mode
 .. _flake8: http://flake8.pycqa.org/en/latest/
+.. _hOCR: http://kba.cloud/hocr-spec/1.2/
diff --git a/bin/pdftotree b/bin/pdftotree
@@ -1,5 +1,5 @@
 #!/usr/bin/env python
-"""Simple commandline interface for parsing PDF to HTML."""
+"""Simple commandline interface for parsing PDF to hOCR."""
 import argparse
 import logging
 import os
@@ -9,10 +9,7 @@ import pdftotree
 if __name__ == "__main__":
     parser = argparse.ArgumentParser(
         description="""
-        Script to extract tree structure from PDF files. Takes a PDF as input
-        and outputs an HTML-like representation of the document's structure. By
-        default, this conversion is done using heuristics. However, a model can
-        be provided as a parameter to use a machine-learning-based approach.
+        Convert PDF into hOCR.
         """,
         usage="%(prog)s [options] pdf_file",
     )
@@ -34,19 +31,22 @@ if __name__ == "__main__":
     parser.add_argument(
         "pdf_file",
         type=str,
-        help="PDF file name for which tree structure needs to be extracted",
+        help="Path to input PDF file.",
     )
     parser.add_argument(
         "-o",
         "--output",
         type=str,
-        help="Path where tree structure should be saved. If none, HTML is printed to stdout.",
+        help="Path to output hOCR file. If not given, it will be printed to stdout.",
     )
     parser.add_argument(
         "-f",
         "--favor_figures",
         type=str,
-        help="Whether figures must be favored over other parts such as tables and section headers",
+        help="""
+        Whether figures must be favored over other parts such as tables and section
+        headers
+        """,
         default="True",
     )
     parser.add_argument(
@@ -103,7 +103,7 @@ if __name__ == "__main__":
     log.addHandler(ch)
 
     if args.dry_run:
-        print("This is just a dry run. No HTML will be output.")
+        print("This is just a dry run. No hOCR will be output.")
         args.output = None
 
     # Call the main routine
@@ -120,4 +120,4 @@ if __name__ == "__main__":
         if not args.dry_run:
             print(result)
     else:
-        print("HTML output to {}".format(args.output))
+        print("hOCR output to {}".format(args.output))