Based on the paper "Unfolding the Structure of a Document using Deep Learning" (Rahman and Finin, 2019), this software project aims to use various software techniques to automatically detect the structure in arbitrary pdf
documents and thus make these documents more searchable.
The computer linguistic methods used here assume that the documents to be processed are in pdf
format.
However, in order to be flexible in the selection of documents with respect to file format, DCR-CORE
includes a sophisticated preprocessor mechanism that can convert many of the non pdf
formats to pdf
format.
From the documents in pdf
format, the next steps extract the text with the relevant metadata word by word, line by line, or page by page. In line-by-line extraction, an attempt is made to classify the individual lines and mark them accordingly, so that these line classifications can later be taken into account in token generation.
In the currently last step qualified tokens can be generated, which contain on the one hand information about the localization of the token in the document and on the other hand token classification features like lemma, form, normalization etc..
Please see the Documentation for more detailed information.
- Support for documents in different languages - English as standard.
- Identification of scanned
pdf
documents with PyMuPDF. - Conversion of the scanned
pdf
documents into a set ofjpeg
orpng
files with pdf2image and Poppler. - Conversion of the documents of type
bmp
,gif
,jp2
,jpeg
,png
,pnm
,tif
,tiff
orwebp
topdf
format with Tesseract OCR. - Conversion of
csv
,docx
,epub
,html
,odt
,rst
orrtf
type documents topdf
format with Pandoc and TeX Live.
- Extract text and metadata from
pdf
documents with PDFlib TET. - Classification of lines in the document, e.g. body, footer, header lines, etc.
- Sentence-by-sentence determination of the token structure using spaCy.
- Storage of the analysis results in JSON and XML flat files.
In addition to Python, the following software packages are required to use DCR-CORE
:
Now, to avoid this installation effort, we recommend using the Docker image provided in DockerHub see here.
Creating and running a new container (Assuming the path prefix for the local data directory mapping is d:/TempMan):
`docker run -it --name dcr-core -v d:/TempMan:/dcr-core/data/inbox_prod konnexionsgmbh/dcr-core:0.9.7`
Restarting the container:
docker start dcr-core
Check the container is running:
docker ps
To access a running container:
docker attach --detach-keys="ctrl-a" dcr-core
Stopping a running container:
docker stop dcr-core
Starting Python in the Virtual Environment (inside the dcr-core
container):
python3 -m pipenv run python3
Make the dcr_core
module available:
from dcr_core import cls_process
Create an instance of the Process
class:
process = cls_process.Process()
Process document files:
process.document("data/inbox_prod/<file name>")
Directory | Content |
---|---|
.github/workflows | GitHub Action workflows. |
data | Example rule files for document line classification. |
docs | DCR-CORE documentation files. |
scripts | Ubuntu and Windows Script for running the application |
src | Python scripts and PDFlib TET files |
tests | Scripts and data for pytest. |
File | Functionality |
---|---|
.gitignore | Configuration of files and folders to be ignored. |
.pylintrc | Configuration file for pylint. |
LICENSE | Text of the licence terms. |
logging_cfg.yaml | Configuration of the Logger functionality. |
Makefile | Definition of tasks to be excuted with the make command. |
MANIFEST.in | Source distribution commands for PyPA. |
mkdocs.yml | Configuration file for MkDocs. |
Pipfile | Definition of the Python package requirements. |
Pipfile.lock | Definition of the specific versions of the Python packages. |
pyproject.toml | Build system requirements according to PEP 518. |
README.md | This file. |
setup.cfg | Setup configuration file - see here. |
setup.cfg.reference | Original setup configuration file. |
If you need help with DCR-CORE
, do not hesitate to get in contact with us!
- For questions and high-level discussions, use Discussions on GitHub.
- To report a bug or make a feature request, open an Issue on GitHub.
Please note that we may only provide support for problems / questions regarding core features of DCR-CORE
.
Any questions or bug reports about features of third-party themes, plugins, extensions or similar should be made to their respective projects.
But, such questions are not banned from the Discussions.
Make sure to stick around to answer some questions as well!
- Official Documentation
- Release Notes
- Discussions (Third-party themes, recipes, plugins and more)
The DCR-CORE
project welcomes, and depends on, contributions from developers and users in the open source community.
Please see the Contributing Guide for
information on how you can help.
Everyone who interacts in the DCR-CORE
project's codebase, issue trackers, and discussion forums is expected to follow the Code of Conduct.