Skip to content

Commit

Permalink
Implementation new OCR Module
Browse files Browse the repository at this point in the history
Issue #101
  • Loading branch information
rsoika committed Jun 30, 2020
1 parent d4a8c32 commit 392ac44
Show file tree
Hide file tree
Showing 3 changed files with 574 additions and 0 deletions.
68 changes: 68 additions & 0 deletions imixs-archive-ocr/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Imixs-Archive-OCR

*Imixs-Archive-OCR* is a sub-project of Imixs-Archive. The project provides methods to extract textual information from attached documents - including Optical character recognition - during the processing phase. This information can be used for further processing or to search for documents


## OCR

The *Optical character recognition (OCR)* is based on the [Apache Project 'Tika'](https://tika.apache.org/). The textual information for each attachment is stored as a custom attribute named 'text' into the FileData object. This information can be used by applications to analyse, verify or process textual information of any document type. The OCR processing is implemented by the *TikaDocumentService*.

### The OCRService

The *OCRService* extracts the textual information from file attachments. The service calls the Tika Rest API to extract the text information of a file. The following environment variables are mandatory:

* TIKA\_SERVICE\_ENDPONT - defines the Rest API end-point of the tika server.
* TIKA\_SERVICE\_MODE - if set to 'auto' the TikaDocumentService reacts on the CDI event 'BEFORE\_PROCESS' and extracts the data automatically. If set to 'model' the *TikaPlugin* or the *TikaAdapter* can be used in a BPMN model to activate the OCR processing

See also the [Docker Image Imixs/Tika](https://cloud.docker.com/u/imixs/repository/docker/imixs/tika) for further information


### The OCR MODE

With the optional environment variable TIKA\_OCR\_MODE the OCR behavior can be controlled:

* PDF_ONLY - OCR processing is disabled and text is extracted only from PDF files if available. All other files are ignored
* OCR_ONLY - pdf and all other files are always OCR scanned.
* MIXED - OCR processing is only performed in case no text data can be extracted from a given PDF file (default)

For further configuration see also the docker project [Imixs/tika](https://github.com/imixs/imixs-docker/tree/master/tika).

### Tika Options

Out of the box, Apache Tika will start with the default configuration. By providing additional config options
you can specify a custom tika configuration to be used by the tika server.

For example to set the DPI mode call:

@EJB
TikaDocumentService tikaDocumentService;

// define options
List<String> options=new ArrayList<String>();
options.add("X-Tika-PDFocrStrategy=OCR_AND_TEXT_EXTRACTION");
options.add("X-Tika-PDFOcrImageType=RGB");
options.add("X-Tika-PDFOcrDPI=400");

// start ocr
tikaDocumentService.extractText(workitem, "MIXED", options)

**Note:** Options set by this method call overwrite the options defined in a tika config file.

You have various options to configure the Tika server. Find details about how to configure imixs-tika [here](https://github.com/imixs/imixs-docker/tree/master/tika).




## How to Install

To include the imixs-archive-ocr service the following maven dependency can be added:


<!-- Imixs-Archive OCRService -->
<dependency>
<groupId>org.imixs.workflow</groupId>
<artifactId>imixs-archive-ocr</artifactId>
<scope>compile</scope>
</dependency>


29 changes: 29 additions & 0 deletions imixs-archive-ocr/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.imixs.workflow</groupId>
<artifactId>imixs-archive</artifactId>
<version>2.2.0-SNAPSHOT</version>
</parent>
<artifactId>imixs-archive-ocr</artifactId>


<dependencies>

<!-- Imixs-Workflow dependencies -->
<dependency>
<groupId>org.imixs.workflow</groupId>
<artifactId>imixs-workflow-core</artifactId>
</dependency>


<!-- https://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox -->
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>${apache.pdfbox.version}</version>
</dependency>

</dependencies>
<name>Imixs-Archive OCR</name>
</project>
Loading

0 comments on commit 392ac44

Please sign in to comment.