Skip to content

Commit

Permalink
refactoring, documentation
Browse files Browse the repository at this point in the history
Issue #143
  • Loading branch information
rsoika committed Jan 11, 2021
1 parent 1c73b89 commit 8d861f1
Show file tree
Hide file tree
Showing 9 changed files with 147 additions and 338 deletions.
139 changes: 116 additions & 23 deletions imixs-archive-documents/README.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,30 @@
# Imixs-Archive-Documents

*Imixs-Archive-Document* is a sub-project of Imixs-Archive. The project provides Plugins and Adapter classes
to extract textual information from attached documents - including Optical character recognition - during the processing life cycle
of a workitem. This information can be used for further processing or to search for documents.
*Imixs-Archive-Document* is a sub-project of Imixs-Archive. The project provides Services, Plugins and Adapter classes
to extract textual information from attached documents during the processing life cycle of a workitem.
This includes also 0ptical character recognition (OCR).
The extracted textual information can be used for further processing or to search for documents.

## OCR

The *Optical character recognition (OCR)* is based on the [Apache Project 'Tika'](https://tika.apache.org/). The textual information for each attachment is stored as a custom attribute named 'text' into the FileData object. This information can be used by applications to analyse, verify or process textual information of any document type. The OCR processing is implemented by the *TikaDocumentService*.
## Text Extraction

### The OCRDocumentService
The text extraction is mainly based on the [Apache Tika Project](https://tika.apache.org/). The text extraction can be controlled based on a BPMN model
through the corresponding adapter or plug-in class. For a more general and model independent text extraction the OCRDocumentService can be used.

The textual information for each attachment is stored as a custom attribute named 'text' into the FileData object of a workitem. This information can be used by applications to analyse, verify or process textual information of any document type.

The *OCRDocumentService* extracts the textual information from file attachments during the processing life cycle. The service calls the Imixs-Archvie OCRService to extract the text information of a file. The following environment variables are mandatory:
The following environment variable is mandatory:

* OCR\_SERVICE\_ENDPONT - defines the Rest API end-point of the tika server.

* OCR\_SERVICE\_MODE - if set to 'auto' the TikaDocumentService reacts on the CDI event 'BEFORE\_PROCESS' and extracts the data automatically. If set to 'model' the *TikaPlugin* or the *TikaAdapter* can be used in a BPMN model to activate the OCR processing
* OCR\_SERVICE\_ENDPONT - defines the Rest API end-point of an Apache Tika instance.

See also the [Imixs-Archive OCR project](../imixs-archive-ocr/) for further information about the OCR service.

### Auto Processing

OCR processing can be automatically activated for all new attached documents by setting the environment variable *TIKA_SERVICE_MODE* to 'auto'. If the variable is set to 'model' the *TikaPlugin* or the *TikaAdapter* can be used in a BPMN model to activate the OCR processing in specific situations only.
### The OCRDocumentAdapter

The Adapter class *org.imixs.archive.documents.OCRDocumentAdapter* is a signal adapter which can be bound on a specific BPMN event element.

org.imixs.archive.documents.OCRDocumentAdapter

The TikaAdapter allows a more fine grained configuration of OCR processing. The environment variable *TIKA_SERVICE_MODE* must be set to 'model'.


### The OCRDocumentPlugin
Expand All @@ -31,15 +35,8 @@ The TikaPlugin class *org.imixs.archive.documents.OCRDocumentPlugin* can be used

The environment variable *TIKA_SERVICE_MODE* must be set to 'model'.

### The OCRDocumentAdapter

The Adapter class *org.imixs.archive.documents.OCRDocumentAdapter* is a signal adapter which can be bound on a specific BPMN event element.

org.imixs.archive.documents.OCRDocumentAdapter

The TikaAdapter allows a more fine grained configuration of OCR processing. The environment variable *TIKA_SERVICE_MODE* must be set to 'model'.

### OCR Tika Options
### Configuration

Both, the *OCRDocumentPlugin* as also the *OCRDocumentAdapter* can be configured on the BPMN Event level with optional Tika options. The tika options can be configured in the workflow result of the BPMN event element with the tag '*tika*' and the name '*options*'. See the following example:

Expand Down Expand Up @@ -67,16 +64,112 @@ Or for multiple languages:
X-Tika-OCRLanguage: eng+fra"


For more details about the OCR configuration see the section 'OCR' below.


### The OCRDocumentService

The *OCRDocumentService* is a general service to extract the textual information from file attachments during the processing life cycle independent form a BPMN model. The TikaDocumentService reacts on the CDI event 'BEFORE\_PROCESS' and extracts the data automatically.

The environment variable *TIKA_SERVICE_MODE* must be set to 'auto'.
If set to 'model' the *TikaPlugin* or the *TikaAdapter* must be used in a BPMN model to activate the text extraction.


## OCR

The *Optical character recognition (OCR)* is based on the [Apache Project 'Tika'](https://tika.apache.org/).
Tika extracts text from over a thousand different file types including PDF and office documents and supports *Optical character recognition (OCR)* based on the [Tesseract project](https://github.com/tesseract-ocr/tesseract).

To run a Tika Server with Docker, the [official Docker image](https://hub.docker.com/r/apache/tika) can be used:

$ docker run -d -p 9998:9998 apache/tika:1.24.1-full


The *TikaService* EJB provides methods to extract textual information from documents attached to a Workitem. A valid Tika Server endpoint must exist.

### The TikaService

The *TikaService* extracts the textual information from file attachments calling the Tika Rest API Service endpoint. The following environment variables are supported:

* OCR\_SERVICE\_ENDPOINT - defines the Rest API end-point of the Tika server (mandetory).
* OCR\_STRATEGY - Which strategy to use for OCR (AUTO|NO_OCR|OCR_AND_TEXT_EXTRACTION|OCR_ONLY)

With the optional environment variable OCR\_STRATEGY the behavior how text is extracted from a PDF file can be controlled:

**AUTO**
<br />
The best OCR strategy is chosen by the Tika Server itself. This is the default setting.

**NO_OCR**
<br />
OCR processing is disabled and text is extracted only from PDF files including a raw text. If a pdf file does not contain raw text data no text will be extracted!

**OCR_ONLY**
<br />
PDF files will always be OCR scanned even if the pdf file contains text data.

**OCR_AND_TEXT_EXTRACTION**
<br />
OCR processing and raw text extraction is performed. Note: This may result is a duplication of text and the mode is not recommended.

### Tika Options

Out of the box, Apache Tika will start with the default configuration. By providing additional config options
you can specify a custom tika configuration to be used by the tika server.

For example to set the DPI mode call:

@EJB
TikaDocumentService tikaDocumentService;

// define options
List<String> options=new ArrayList<String>();
options.add("X-Tika-PDFocrStrategy=OCR_AND_TEXT_EXTRACTION");
options.add("X-Tika-PDFOcrImageType=RGB"); // support colors
options.add("X-Tika-PDFOcrDPI=72"); // set DPI
options.add("X-Tika-OCRLanguage=eng"); // set english language
// start ocr
tikaDocumentService.extractText(workitem, "TEXT_AND_OCR", options)

**Note:** Options set by this method call overwrite the options defined in a tika config file.

You have various options to configure the Tika server. Find details about how to configure imixs-tika [here](https://github.com/imixs/imixs-docker/tree/master/tika).

- https://cwiki.apache.org/confluence/display/TIKA/TikaServer
- https://cwiki.apache.org/confluence/display/TIKA/TikaOCR
- https://cwiki.apache.org/confluence/display/tika/PDFParser%20(Apache%20PDFBox)

#### Example

In this example configuration the OCR processing will be started with 4 additional tika options.

- X-Tika-PDFOcrImageType=RGB - set color mode
- X-Tika-PDFOcrDPI=72 - set DPI to 72
- X-Tika-OCRLanguage=deu - set OCR language to german


#### Overriding the configured language as part of your request

Different requests may need processing using different language models. These can be specified for specific requests using the X-Tika-OCRLanguage custom header. An example of this is shown below:

X-Tika-OCRLanguage=deu

Or for multiple languages:

X-Tika-OCRLanguage: eng+fra"


For more details about the OCR configuration see the [Imixs-Archive-OCR project](https://github.com/imixs/imixs-archive/tree/master/imixs-archive-ocr).



## Searching Documents

All extracted textual information from attached documents is also searchable by the Imixs search index. The class *org.imixs.workflow.documents.DocumentIndexer* adds the ocr content for each file attachment into the search index.

## The PDF XML Plugin

The plugin class "_org.imixs.workflow.documents.parser.PDFXMLExtractorPlugin_" can be used to extract embedded XML Data from a PDF document and convert the data into a Imixs Workitem. For example the _ZUGFeRD_ defines a standard XML document for invoices.
The plugin class "*org.imixs.workflow.documents.parser.PDFXMLExtractorPlugin*" can be used to extract embedded XML Data from a PDF document and convert the data into a Imixs Workitem. For example the _ZUGFeRD_ defines a standard XML document for invoices.

The plugin can be activated by the BPMN Model with the following result definition:

Expand Down
164 changes: 0 additions & 164 deletions imixs-archive-documents/ZUGFeRD-invoice.xml

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@

import org.eclipse.microprofile.config.inject.ConfigProperty;
import org.imixs.archive.core.SnapshotService;
import org.imixs.archive.ocr.OCRService;
import org.imixs.workflow.ItemCollection;
import org.imixs.workflow.SignalAdapter;
import org.imixs.workflow.engine.WorkflowService;
Expand Down Expand Up @@ -43,11 +42,11 @@ public class OCRDocumentAdapter implements SignalAdapter {
private static Logger logger = Logger.getLogger(OCRDocumentAdapter.class.getName());

@Inject
@ConfigProperty(name = OCRService.ENV_OCR_SERVICE_MODE, defaultValue = "auto")
@ConfigProperty(name = TikaService.ENV_OCR_SERVICE_MODE, defaultValue = "auto")
String serviceMode;

@Inject
OCRService ocrService;
TikaService ocrService;

@Inject
WorkflowService workflowService;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@

import org.eclipse.microprofile.config.inject.ConfigProperty;
import org.imixs.archive.core.SnapshotService;
import org.imixs.archive.ocr.OCRService;
import org.imixs.workflow.ItemCollection;
import org.imixs.workflow.WorkflowContext;
import org.imixs.workflow.engine.plugins.AbstractPlugin;
Expand All @@ -31,10 +30,10 @@ public class OCRDocumentPlugin extends AbstractPlugin {
private static Logger logger = Logger.getLogger(OCRDocumentPlugin.class.getName());

@Inject
OCRService ocrService;
TikaService ocrService;

@Inject
@ConfigProperty(name = OCRService.ENV_OCR_SERVICE_MODE, defaultValue = "auto")
@ConfigProperty(name = TikaService.ENV_OCR_SERVICE_MODE, defaultValue = "auto")
String serviceMode;

@Inject
Expand Down
Loading

0 comments on commit 8d861f1

Please sign in to comment.