archive-hocr-tools

This repository contains a python package to perform hOCR parsing efficiently, and it also contains a set of tools that can help perform operations on and analyse hOCR files.

hocr-combine-stream: A tool to combine many hocr files into a big hocr file while keeping memory usage low. Used internally to combine tesseract per-page results into a larger hocr resulting file for an entire book.
hocr-pagenumbers: A tool to find pagenumbers in multi-page hOCR documents
hocr-fold-chars: A tool to transform a per-character hocr file into a per-word hocr file.
pdf-to-hocr: A tool to take text content embedded in a PDF, and extract it as hOCR format.
See more tools in the ./bin directory, not all have been documented yet.

The python library is called hocr.

Name		Name	Last commit message	Last commit date
Latest commit History 270 Commits
bin		bin
docs		docs
hocr		hocr
test-files		test-files
tests		tests
COPYING		COPYING
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README		README
README.rst		README.rst
conftest.py		conftest.py
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

archive-hocr-tools

About

Licenses found

Releases

Packages

Contributors 5

Languages

License

Licenses found

internetarchive/archive-hocr-tools

Folders and files

Latest commit

History

Repository files navigation

archive-hocr-tools

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages