diff --git a/imixs-archive-documents/README.md b/imixs-archive-documents/README.md index a00e2df6..6c43996d 100644 --- a/imixs-archive-documents/README.md +++ b/imixs-archive-documents/README.md @@ -1,26 +1,30 @@ # Imixs-Archive-Documents -*Imixs-Archive-Document* is a sub-project of Imixs-Archive. The project provides Plugins and Adapter classes - to extract textual information from attached documents - including Optical character recognition - during the processing life cycle - of a workitem. This information can be used for further processing or to search for documents. +*Imixs-Archive-Document* is a sub-project of Imixs-Archive. The project provides Services, Plugins and Adapter classes + to extract textual information from attached documents during the processing life cycle of a workitem. + This includes also 0ptical character recognition (OCR). + The extracted textual information can be used for further processing or to search for documents. -## OCR -The *Optical character recognition (OCR)* is based on the [Apache Project 'Tika'](https://tika.apache.org/). The textual information for each attachment is stored as a custom attribute named 'text' into the FileData object. This information can be used by applications to analyse, verify or process textual information of any document type. The OCR processing is implemented by the *TikaDocumentService*. +## Text Extraction -### The OCRDocumentService +The text extraction is mainly based on the [Apache Tika Project](https://tika.apache.org/). The text extraction can be controlled based on a BPMN model +through the corresponding adapter or plug-in class. For a more general and model independent text extraction the OCRDocumentService can be used. + +The textual information for each attachment is stored as a custom attribute named 'text' into the FileData object of a workitem. This information can be used by applications to analyse, verify or process textual information of any document type. -The *OCRDocumentService* extracts the textual information from file attachments during the processing life cycle. The service calls the Imixs-Archvie OCRService to extract the text information of a file. The following environment variables are mandatory: +The following environment variable is mandatory: - * OCR\_SERVICE\_ENDPONT - defines the Rest API end-point of the tika server. - - * OCR\_SERVICE\_MODE - if set to 'auto' the TikaDocumentService reacts on the CDI event 'BEFORE\_PROCESS' and extracts the data automatically. If set to 'model' the *TikaPlugin* or the *TikaAdapter* can be used in a BPMN model to activate the OCR processing + * OCR\_SERVICE\_ENDPONT - defines the Rest API end-point of an Apache Tika instance. -See also the [Imixs-Archive OCR project](../imixs-archive-ocr/) for further information about the OCR service. - -### Auto Processing -OCR processing can be automatically activated for all new attached documents by setting the environment variable *TIKA_SERVICE_MODE* to 'auto'. If the variable is set to 'model' the *TikaPlugin* or the *TikaAdapter* can be used in a BPMN model to activate the OCR processing in specific situations only. +### The OCRDocumentAdapter + +The Adapter class *org.imixs.archive.documents.OCRDocumentAdapter* is a signal adapter which can be bound on a specific BPMN event element. + + org.imixs.archive.documents.OCRDocumentAdapter + +The TikaAdapter allows a more fine grained configuration of OCR processing. The environment variable *TIKA_SERVICE_MODE* must be set to 'model'. ### The OCRDocumentPlugin @@ -31,15 +35,8 @@ The TikaPlugin class *org.imixs.archive.documents.OCRDocumentPlugin* can be used The environment variable *TIKA_SERVICE_MODE* must be set to 'model'. -### The OCRDocumentAdapter - -The Adapter class *org.imixs.archive.documents.OCRDocumentAdapter* is a signal adapter which can be bound on a specific BPMN event element. - - org.imixs.archive.documents.OCRDocumentAdapter - -The TikaAdapter allows a more fine grained configuration of OCR processing. The environment variable *TIKA_SERVICE_MODE* must be set to 'model'. -### OCR Tika Options +### Configuration Both, the *OCRDocumentPlugin* as also the *OCRDocumentAdapter* can be configured on the BPMN Event level with optional Tika options. The tika options can be configured in the workflow result of the BPMN event element with the tag '*tika*' and the name '*options*'. See the following example: @@ -67,16 +64,112 @@ Or for multiple languages: X-Tika-OCRLanguage: eng+fra" +For more details about the OCR configuration see the section 'OCR' below. + + +### The OCRDocumentService + +The *OCRDocumentService* is a general service to extract the textual information from file attachments during the processing life cycle independent form a BPMN model. The TikaDocumentService reacts on the CDI event 'BEFORE\_PROCESS' and extracts the data automatically. + +The environment variable *TIKA_SERVICE_MODE* must be set to 'auto'. +If set to 'model' the *TikaPlugin* or the *TikaAdapter* must be used in a BPMN model to activate the text extraction. + + +## OCR + +The *Optical character recognition (OCR)* is based on the [Apache Project 'Tika'](https://tika.apache.org/). +Tika extracts text from over a thousand different file types including PDF and office documents and supports *Optical character recognition (OCR)* based on the [Tesseract project](https://github.com/tesseract-ocr/tesseract). + +To run a Tika Server with Docker, the [official Docker image](https://hub.docker.com/r/apache/tika) can be used: + + $ docker run -d -p 9998:9998 apache/tika:1.24.1-full + + +The *TikaService* EJB provides methods to extract textual information from documents attached to a Workitem. A valid Tika Server endpoint must exist. + +### The TikaService + +The *TikaService* extracts the textual information from file attachments calling the Tika Rest API Service endpoint. The following environment variables are supported: + + * OCR\_SERVICE\_ENDPOINT - defines the Rest API end-point of the Tika server (mandetory). + * OCR\_STRATEGY - Which strategy to use for OCR (AUTO|NO_OCR|OCR_AND_TEXT_EXTRACTION|OCR_ONLY) + +With the optional environment variable OCR\_STRATEGY the behavior how text is extracted from a PDF file can be controlled: + +**AUTO** +
+The best OCR strategy is chosen by the Tika Server itself. This is the default setting. + +**NO_OCR** +
+OCR processing is disabled and text is extracted only from PDF files including a raw text. If a pdf file does not contain raw text data no text will be extracted! + +**OCR_ONLY** +
+PDF files will always be OCR scanned even if the pdf file contains text data. + +**OCR_AND_TEXT_EXTRACTION** +
+OCR processing and raw text extraction is performed. Note: This may result is a duplication of text and the mode is not recommended. + +### Tika Options + +Out of the box, Apache Tika will start with the default configuration. By providing additional config options + you can specify a custom tika configuration to be used by the tika server. + +For example to set the DPI mode call: + + @EJB + TikaDocumentService tikaDocumentService; + + // define options + List options=new ArrayList(); + options.add("X-Tika-PDFocrStrategy=OCR_AND_TEXT_EXTRACTION"); + options.add("X-Tika-PDFOcrImageType=RGB"); // support colors + options.add("X-Tika-PDFOcrDPI=72"); // set DPI + options.add("X-Tika-OCRLanguage=eng"); // set english language + // start ocr + tikaDocumentService.extractText(workitem, "TEXT_AND_OCR", options) + +**Note:** Options set by this method call overwrite the options defined in a tika config file. + +You have various options to configure the Tika server. Find details about how to configure imixs-tika [here](https://github.com/imixs/imixs-docker/tree/master/tika). + + - https://cwiki.apache.org/confluence/display/TIKA/TikaServer + - https://cwiki.apache.org/confluence/display/TIKA/TikaOCR + - https://cwiki.apache.org/confluence/display/tika/PDFParser%20(Apache%20PDFBox) + +#### Example + +In this example configuration the OCR processing will be started with 4 additional tika options. + + - X-Tika-PDFOcrImageType=RGB - set color mode + - X-Tika-PDFOcrDPI=72 - set DPI to 72 + - X-Tika-OCRLanguage=deu - set OCR language to german + + +#### Overriding the configured language as part of your request + +Different requests may need processing using different language models. These can be specified for specific requests using the X-Tika-OCRLanguage custom header. An example of this is shown below: + + X-Tika-OCRLanguage=deu + +Or for multiple languages: + + X-Tika-OCRLanguage: eng+fra" + + For more details about the OCR configuration see the [Imixs-Archive-OCR project](https://github.com/imixs/imixs-archive/tree/master/imixs-archive-ocr). + ## Searching Documents All extracted textual information from attached documents is also searchable by the Imixs search index. The class *org.imixs.workflow.documents.DocumentIndexer* adds the ocr content for each file attachment into the search index. ## The PDF XML Plugin -The plugin class "_org.imixs.workflow.documents.parser.PDFXMLExtractorPlugin_" can be used to extract embedded XML Data from a PDF document and convert the data into a Imixs Workitem. For example the _ZUGFeRD_ defines a standard XML document for invoices. +The plugin class "*org.imixs.workflow.documents.parser.PDFXMLExtractorPlugin*" can be used to extract embedded XML Data from a PDF document and convert the data into a Imixs Workitem. For example the _ZUGFeRD_ defines a standard XML document for invoices. The plugin can be activated by the BPMN Model with the following result definition: diff --git a/imixs-archive-documents/ZUGFeRD-invoice.xml b/imixs-archive-documents/ZUGFeRD-invoice.xml deleted file mode 100644 index bf666df7..00000000 --- a/imixs-archive-documents/ZUGFeRD-invoice.xml +++ /dev/null @@ -1,164 +0,0 @@ - - - - - - false - - - urn:ferd:CrossIndustryDocument:invoice:1p0:comfort - - - - 12345 - RECHNUNG - 380 - - 20121031 - - - - - - - - - - - - - - Musteriieferant - - - 12345 - Musterweg 1 - - Musterstadt - DE - - - - - - - - - 123456 - Firma Musterkunde - - 12345 - - Musterstrasse 1 - Musterstadt - DE - - - - - - - - - - 20000101 - - - - - - - - - - - Rechungsnummer: 12345, Rechnungsdatum: 20121031 - - EUR - - - 31 - Überweisung - - - - - - PBNKDEFF - - - - - - 263.22 - VAT - 1385.35 - S - 263.22 - - - - - 3.86 - VAT - 55.14 - S - 55.14 - - - - - - - Zahlungsbedingungen: ... - - 20150205 - - - - - 0.00 - 0.00 - 0.00 - 1440.49 - 267.08 - 1707.57 - 0.00 - 0.00 - - - - - - - - 10 - - - - - - 0.0000 - - - - 0.0000 - - - - - - 0.0000 - - - - 4012345001235 - 11111 - MusterArtikel - - - - - - - diff --git a/imixs-archive-documents/src/main/java/org/imixs/archive/documents/OCRDocumentAdapter.java b/imixs-archive-documents/src/main/java/org/imixs/archive/documents/OCRDocumentAdapter.java index fcc97c7a..7cc6b5b4 100644 --- a/imixs-archive-documents/src/main/java/org/imixs/archive/documents/OCRDocumentAdapter.java +++ b/imixs-archive-documents/src/main/java/org/imixs/archive/documents/OCRDocumentAdapter.java @@ -7,7 +7,6 @@ import org.eclipse.microprofile.config.inject.ConfigProperty; import org.imixs.archive.core.SnapshotService; -import org.imixs.archive.ocr.OCRService; import org.imixs.workflow.ItemCollection; import org.imixs.workflow.SignalAdapter; import org.imixs.workflow.engine.WorkflowService; @@ -43,11 +42,11 @@ public class OCRDocumentAdapter implements SignalAdapter { private static Logger logger = Logger.getLogger(OCRDocumentAdapter.class.getName()); @Inject - @ConfigProperty(name = OCRService.ENV_OCR_SERVICE_MODE, defaultValue = "auto") + @ConfigProperty(name = TikaService.ENV_OCR_SERVICE_MODE, defaultValue = "auto") String serviceMode; @Inject - OCRService ocrService; + TikaService ocrService; @Inject WorkflowService workflowService; diff --git a/imixs-archive-documents/src/main/java/org/imixs/archive/documents/OCRDocumentPlugin.java b/imixs-archive-documents/src/main/java/org/imixs/archive/documents/OCRDocumentPlugin.java index 7840cfb9..422b9751 100644 --- a/imixs-archive-documents/src/main/java/org/imixs/archive/documents/OCRDocumentPlugin.java +++ b/imixs-archive-documents/src/main/java/org/imixs/archive/documents/OCRDocumentPlugin.java @@ -7,7 +7,6 @@ import org.eclipse.microprofile.config.inject.ConfigProperty; import org.imixs.archive.core.SnapshotService; -import org.imixs.archive.ocr.OCRService; import org.imixs.workflow.ItemCollection; import org.imixs.workflow.WorkflowContext; import org.imixs.workflow.engine.plugins.AbstractPlugin; @@ -31,10 +30,10 @@ public class OCRDocumentPlugin extends AbstractPlugin { private static Logger logger = Logger.getLogger(OCRDocumentPlugin.class.getName()); @Inject - OCRService ocrService; + TikaService ocrService; @Inject - @ConfigProperty(name = OCRService.ENV_OCR_SERVICE_MODE, defaultValue = "auto") + @ConfigProperty(name = TikaService.ENV_OCR_SERVICE_MODE, defaultValue = "auto") String serviceMode; @Inject diff --git a/imixs-archive-documents/src/main/java/org/imixs/archive/documents/OCRDocumentService.java b/imixs-archive-documents/src/main/java/org/imixs/archive/documents/OCRDocumentService.java index 381ad773..354d0248 100644 --- a/imixs-archive-documents/src/main/java/org/imixs/archive/documents/OCRDocumentService.java +++ b/imixs-archive-documents/src/main/java/org/imixs/archive/documents/OCRDocumentService.java @@ -9,7 +9,6 @@ import org.eclipse.microprofile.config.inject.ConfigProperty; import org.imixs.archive.core.SnapshotService; -import org.imixs.archive.ocr.OCRService; import org.imixs.workflow.ItemCollection; import org.imixs.workflow.engine.ProcessingEvent; import org.imixs.workflow.exceptions.PluginException; @@ -42,11 +41,11 @@ public class OCRDocumentService { private static Logger logger = Logger.getLogger(OCRDocumentService.class.getName()); @Inject - @ConfigProperty(name = OCRService.ENV_OCR_SERVICE_ENDPOINT) + @ConfigProperty(name = TikaService.ENV_OCR_SERVICE_ENDPOINT) Optional serviceEndpoint; @Inject - @ConfigProperty(name = OCRService.ENV_OCR_SERVICE_MODE, defaultValue = "auto") + @ConfigProperty(name = TikaService.ENV_OCR_SERVICE_MODE, defaultValue = "auto") String serviceMode; @@ -56,7 +55,7 @@ public class OCRDocumentService { @Inject - OCRService ocrService; + TikaService ocrService; /** * React on the ProcessingEvent. This method sends the document content to the diff --git a/imixs-archive-ocr/src/main/java/org/imixs/archive/ocr/OCRService.java b/imixs-archive-documents/src/main/java/org/imixs/archive/documents/TikaService.java similarity index 91% rename from imixs-archive-ocr/src/main/java/org/imixs/archive/ocr/OCRService.java rename to imixs-archive-documents/src/main/java/org/imixs/archive/documents/TikaService.java index c207163d..f30bc5f5 100644 --- a/imixs-archive-ocr/src/main/java/org/imixs/archive/ocr/OCRService.java +++ b/imixs-archive-documents/src/main/java/org/imixs/archive/documents/TikaService.java @@ -1,4 +1,4 @@ -package org.imixs.archive.ocr; +package org.imixs.archive.documents; import java.io.BufferedReader; import java.io.IOException; @@ -26,18 +26,25 @@ /** * The OCRService extracts the textual information from document attachments of - * a workitem. + * a workitem and stores the data into the $file attribute 'text'. *

- * The text information is stored in the $file attribute 'text'. + * For the text extraction the services sends the content of a document to an + * instance of a Apache Tika server via the Rest API. The environment variable + * OCR_STRATEGY defines how PDF files will be scanned. Possible values are: + *

    + *
  • AUTO - The best OCR strategy is chosen by the Tika Server itself. This is + * the default setting.
  • + *
  • NO_OCR - OCR processing is disabled and text is extracted only from PDF + * files including a raw text. If a pdf file does not contain raw text data no + * text will be extracted!
  • + *
  • OCR_ONLY - PDF files will always be OCR scanned even if the pdf file + * contains text data.
  • + *
  • OCR_AND_TEXT_EXTRACTION - OCR processing and raw text extraction is + * performed. Note: This may result is a duplication of text and the mode is not + * recommended.
  • *

    - * For PDF files with textual content the PDFBox api is used. In other cases, - * the method sends the content via a Rest API to the tika server for OCR - * processing. The environment variable OCR_PDF_MODE defines how PDF files will - * be scanned. Possible values are TEXT_ONLY | OCR_ONLY | TEXT_AND_OCR (default) - *

    - * For OCR processing the service expects a valid Rest API end-point defined by - * the Environment Parameter 'TIKA_SERVICE_ENDPONT'. If the TIKA_SERVICE_ENDPONT - * is not set, then the service will be skipped. + * The service expects a valid Rest API end-point to an instance of a Tika Server defined by + * the Environment Parameter 'TIKA_SERVICE_ENDPONT'. *

    * The environment parameter 'TIKA_SERVICE_MODE' must be set to 'auto' to enable * the service. @@ -48,7 +55,7 @@ * @author rsoika */ @Stateless -public class OCRService { +public class TikaService { public static final String FILE_ATTRIBUTE_TEXT = "text"; public static final String DEFAULT_ENCODING = "UTF-8"; @@ -62,7 +69,7 @@ public class OCRService { public static final String OCR_STRATEGY_OCR_ONLY = "OCR_ONLY"; public static final String OCR_STRATEGY_AUTO = "AUTO"; // default - private static Logger logger = Logger.getLogger(OCRService.class.getName()); + private static Logger logger = Logger.getLogger(TikaService.class.getName()); @Inject @ConfigProperty(name = ENV_OCR_SERVICE_ENDPOINT) @@ -122,11 +129,10 @@ public void extractText(ItemCollection workitem, ItemCollection snapshot, String // validate OCR MODE.... if ("AUTO, NO_OCR, OCR_ONLY, OCR_AND_TEXT_EXTRACTION".indexOf(ocrStategy) == -1) { - throw new PluginException(OCRService.class.getSimpleName(), PLUGIN_ERROR, + throw new PluginException(TikaService.class.getSimpleName(), PLUGIN_ERROR, "Invalid TIKA_OCR_MODE - expected one of the following options: NO_OCR | OCR_ONLY | OCR_AND_TEXT_EXTRACTION"); } - // if the options did not already include the X-Tika-PDFOcrStrategy than we add // it now... boolean hasPDFOcrStrategy = options.stream() @@ -142,7 +148,7 @@ public void extractText(ItemCollection workitem, ItemCollection snapshot, String logger.info("...... Tika Option = " + opt); } } - + long l = System.currentTimeMillis(); // List currentDmsList = DMSHandler.getDmsList(workitem); List files = workitem.getFileData(); @@ -173,7 +179,7 @@ public void extractText(ItemCollection workitem, ItemCollection snapshot, String fileData.setAttribute(FILE_ATTRIBUTE_TEXT, list); } catch (IOException e) { - throw new PluginException(OCRService.class.getSimpleName(), PLUGIN_ERROR, + throw new PluginException(TikaService.class.getSimpleName(), PLUGIN_ERROR, "Unable to scan attached document '" + fileData.getName() + "'", e); } } @@ -205,7 +211,7 @@ public String doORCProcessing(FileData fileData, List options) throws IO // read the Tika Service Enpoint if (!serviceEndpoint.isPresent() || serviceEndpoint.get().isEmpty()) { logger.severe( - "No OCR_SERVICE_ENDPOINT is missing - OCRprocessing not supported without a valid tika server endpoint!"); + "No OCR_SERVICE_ENDPOINT is missing - OCR processing not supported without a valid tika server endpoint!"); return null; } @@ -438,6 +444,4 @@ private String adaptContentType(FileData fileData) { return contentType; } - - } \ No newline at end of file diff --git a/imixs-archive-ocr/README.md b/imixs-archive-ocr/README.md deleted file mode 100644 index c22d77e9..00000000 --- a/imixs-archive-ocr/README.md +++ /dev/null @@ -1,99 +0,0 @@ -# Imixs-Archive-OCR - -*Imixs-Archive-OCR* provides a service component to extract textual information from documents attached to a Workitem. The text extraction is based on [Apache Tika](https://tika.apache.org/). To use this module a Tika Server endpoint must exist. -You can run a Tika Server with the [official Docker image](https://hub.docker.com/r/apache/tika): - - $ docker run -d -p 9998:9998 apache/tika:1.24.1-full - -Tika extracts text from over a thousand different file types including PDF and office documents and supports *Optical character recognition (OCR)* based on the [Tesseract project](https://github.com/tesseract-ocr/tesseract). The text content extracted by this service is stored in the $file attribute 'text' and can be used for further analysis, verifying or processing within a business process. The project is decoupled form the Imixs-Workflow Engine so that it can be used independently in other projects too. - - -### The OCRService - -The *OCRService* extracts the textual information from file attachments calling the Tika Rest API Service endpoint. The following environment variables are supported: - - * OCR\_SERVICE\_ENDPOINT - defines the Rest API end-point of the Tika server (mandetory). - * OCR\_STRATEGY - Which strategy to use for OCR (AUTO|NO_OCR|OCR_AND_TEXT_EXTRACTION|OCR_ONLY) - -With the optional environment variable OCR\_STRATEGY the behavior how text is extracted from a PDF file can be controlled: - -**AUTO** -
    -The best OCR strategy is chosen by the Tika Server itself. This is the default setting. - -**NO_OCR** -
    -OCR processing is disabled and text is extracted only from PDF files including a raw text. If a pdf file does not contain raw text data no text will be extracted! - -**OCR_ONLY** -
    -PDF files will always be OCR scanned even if the pdf file contains text data. - -**OCR_AND_TEXT_EXTRACTION** -
    -OCR processing and raw text extraction is performed. Note: This may result is a duplication of text and the mode is not recommended. - -### Tika Options - -Out of the box, Apache Tika will start with the default configuration. By providing additional config options - you can specify a custom tika configuration to be used by the tika server. - -For example to set the DPI mode call: - - @EJB - TikaDocumentService tikaDocumentService; - - // define options - List options=new ArrayList(); - options.add("X-Tika-PDFocrStrategy=OCR_AND_TEXT_EXTRACTION"); - options.add("X-Tika-PDFOcrImageType=RGB"); // support colors - options.add("X-Tika-PDFOcrDPI=72"); // set DPI - options.add("X-Tika-OCRLanguage=eng"); // set english language - // start ocr - tikaDocumentService.extractText(workitem, "TEXT_AND_OCR", options) - -**Note:** Options set by this method call overwrite the options defined in a tika config file. - -You have various options to configure the Tika server. Find details about how to configure imixs-tika [here](https://github.com/imixs/imixs-docker/tree/master/tika). - - - https://cwiki.apache.org/confluence/display/TIKA/TikaServer - - https://cwiki.apache.org/confluence/display/TIKA/TikaOCR - - https://cwiki.apache.org/confluence/display/tika/PDFParser%20(Apache%20PDFBox) - -#### Example - -In this example configuration the OCR processing will be started with 4 additional tika options. - - - X-Tika-PDFOcrImageType=RGB - set color mode - - X-Tika-PDFOcrDPI=72 - set DPI to 72 - - X-Tika-OCRLanguage=deu - set OCR language to german - - -#### Overriding the configured language as part of your request - -Different requests may need processing using different language models. These can be specified for specific requests using the X-Tika-OCRLanguage custom header. An example of this is shown below: - - X-Tika-OCRLanguage=deu - -Or for multiple languages: - - X-Tika-OCRLanguage: eng+fra" - - -For more details about the OCR configuration see the [Imixs-Archive-OCR project](https://github.com/imixs/imixs-archive/tree/master/imixs-archive-ocr). - - - -## How to Install - -To include the imixs-archive-ocr service the following maven dependency can be added: - - - - - org.imixs.workflow - imixs-archive-ocr - compile - - - \ No newline at end of file diff --git a/imixs-archive-ocr/pom.xml b/imixs-archive-ocr/pom.xml deleted file mode 100644 index 5cbf208b..00000000 --- a/imixs-archive-ocr/pom.xml +++ /dev/null @@ -1,21 +0,0 @@ - - 4.0.0 - - org.imixs.workflow - imixs-archive - 2.2.9-SNAPSHOT - - imixs-archive-ocr - - - - - - - org.imixs.workflow - imixs-workflow-core - - - - Imixs-Archive OCR - \ No newline at end of file diff --git a/pom.xml b/pom.xml index 1fb83c03..153bcd7a 100644 --- a/pom.xml +++ b/pom.xml @@ -10,7 +10,6 @@ imixs-archive-api imixs-archive-service imixs-archive-documents - imixs-archive-ocr imixs-archive-importer