Skip to content
This repository has been archived by the owner on May 7, 2024. It is now read-only.

KonnexionsGmbH/dcr-core

Repository files navigation

DCR-CORE - Document Content Recognition API - README

Coveralls GitHub GitHub (Pre-)Release GitHub (Pre-)Release Date GitHub commits since latest release

Based on the paper "Unfolding the Structure of a Document using Deep Learning" (Rahman and Finin, 2019), this software project aims to use various software techniques to automatically detect the structure in arbitrary pdf documents and thus make these documents more searchable.

The computer linguistic methods used here assume that the documents to be processed are in pdf format. However, in order to be flexible in the selection of documents with respect to file format, DCR-CORE includes a sophisticated preprocessor mechanism that can convert many of the non pdf formats to pdf format.

From the documents in pdf format, the next steps extract the text with the relevant metadata word by word, line by line, or page by page. In line-by-line extraction, an attempt is made to classify the individual lines and mark them accordingly, so that these line classifications can later be taken into account in token generation.

In the currently last step qualified tokens can be generated, which contain on the one hand information about the localization of the token in the document and on the other hand token classification features like lemma, form, normalization etc..

Please see the Documentation for more detailed information.

1. Features

1.1 General

  • Support for documents in different languages - English as standard.

1.2 Preprocessor

  • Identification of scanned pdf documents with PyMuPDF.
  • Conversion of the scanned pdf documents into a set of jpeg or png files with pdf2image and Poppler.
  • Conversion of the documents of type bmp, gif, jp2, jpeg, png, pnm, tif, tiff or webp to pdf format with Tesseract OCR.
  • Conversion of csv, docx, epub, html, odt, rst or rtf type documents to pdf format with Pandoc and TeX Live.

1.3 Natural Language Processing (NLP)

  • Extract text and metadata from pdf documents with PDFlib TET.
  • Classification of lines in the document, e.g. body, footer, header lines, etc.
  • Sentence-by-sentence determination of the token structure using spaCy.
  • Storage of the analysis results in JSON and XML flat files.

2 Quick start

In addition to Python, the following software packages are required to use DCR-CORE:

Now, to avoid this installation effort, we recommend using the Docker image provided in DockerHub see here.

2.1 Docker Container Administration

Creating and running a new container (Assuming the path prefix for the local data directory mapping is d:/TempMan):

`docker run -it --name dcr-core -v d:/TempMan:/dcr-core/data/inbox_prod konnexionsgmbh/dcr-core:0.9.7`

Restarting the container:

docker start dcr-core

Check the container is running:

docker ps

To access a running container:

docker attach --detach-keys="ctrl-a" dcr-core 

Stopping a running container:

docker stop dcr-core

2.2 Docker Container Usage

Starting Python in the Virtual Environment (inside the dcr-core container):

python3 -m pipenv run python3

Make the dcr_core module available:

from dcr_core import cls_process

Create an instance of the Process class:

process = cls_process.Process()

Process document files:

process.document("data/inbox_prod/<file name>")

3. Directory and File Structure of this Repository

3.1 Directories

Directory Content
.github/workflows GitHub Action workflows.
data Example rule files for document line classification.
docs DCR-CORE documentation files.
scripts Ubuntu and Windows Script for running the application
src Python scripts and PDFlib TET files
tests Scripts and data for pytest.

3.2 Files

File Functionality
.gitignore Configuration of files and folders to be ignored.
.pylintrc Configuration file for pylint.
LICENSE Text of the licence terms.
logging_cfg.yaml Configuration of the Logger functionality.
Makefile Definition of tasks to be excuted with the make command.
MANIFEST.in Source distribution commands for PyPA.
mkdocs.yml Configuration file for MkDocs.
Pipfile Definition of the Python package requirements.
Pipfile.lock Definition of the specific versions of the Python packages.
pyproject.toml Build system requirements according to PEP 518.
README.md This file.
setup.cfg Setup configuration file - see here.
setup.cfg.reference Original setup configuration file.

4. Support

If you need help with DCR-CORE, do not hesitate to get in contact with us!

  • For questions and high-level discussions, use Discussions on GitHub.
  • To report a bug or make a feature request, open an Issue on GitHub.

Please note that we may only provide support for problems / questions regarding core features of DCR-CORE. Any questions or bug reports about features of third-party themes, plugins, extensions or similar should be made to their respective projects. But, such questions are not banned from the Discussions.

Make sure to stick around to answer some questions as well!

5. Links

6. Contributing to DCR-CORE

The DCR-CORE project welcomes, and depends on, contributions from developers and users in the open source community. Please see the Contributing Guide for information on how you can help.

7. Code of Conduct

Everyone who interacts in the DCR-CORE project's codebase, issue trackers, and discussion forums is expected to follow the Code of Conduct.

8. License

Konnexions Public License (KX-PL)