- * The text information is stored in the $file attribute 'text'.
+ * For the text extraction the services sends the content of a document to an
+ * instance of a Apache Tika server via the Rest API. The environment variable
+ * OCR_STRATEGY defines how PDF files will be scanned. Possible values are:
+ *
+ * - AUTO - The best OCR strategy is chosen by the Tika Server itself. This is
+ * the default setting.
+ * - NO_OCR - OCR processing is disabled and text is extracted only from PDF
+ * files including a raw text. If a pdf file does not contain raw text data no
+ * text will be extracted!
+ * - OCR_ONLY - PDF files will always be OCR scanned even if the pdf file
+ * contains text data.
+ * - OCR_AND_TEXT_EXTRACTION - OCR processing and raw text extraction is
+ * performed. Note: This may result is a duplication of text and the mode is not
+ * recommended.
*
- * For PDF files with textual content the PDFBox api is used. In other cases,
- * the method sends the content via a Rest API to the tika server for OCR
- * processing. The environment variable OCR_PDF_MODE defines how PDF files will
- * be scanned. Possible values are TEXT_ONLY | OCR_ONLY | TEXT_AND_OCR (default)
- *
- * For OCR processing the service expects a valid Rest API end-point defined by
- * the Environment Parameter 'TIKA_SERVICE_ENDPONT'. If the TIKA_SERVICE_ENDPONT
- * is not set, then the service will be skipped.
+ * The service expects a valid Rest API end-point to an instance of a Tika Server defined by
+ * the Environment Parameter 'TIKA_SERVICE_ENDPONT'.
*
* The environment parameter 'TIKA_SERVICE_MODE' must be set to 'auto' to enable
* the service.
@@ -48,7 +55,7 @@
* @author rsoika
*/
@Stateless
-public class OCRService {
+public class TikaService {
public static final String FILE_ATTRIBUTE_TEXT = "text";
public static final String DEFAULT_ENCODING = "UTF-8";
@@ -62,7 +69,7 @@ public class OCRService {
public static final String OCR_STRATEGY_OCR_ONLY = "OCR_ONLY";
public static final String OCR_STRATEGY_AUTO = "AUTO"; // default
- private static Logger logger = Logger.getLogger(OCRService.class.getName());
+ private static Logger logger = Logger.getLogger(TikaService.class.getName());
@Inject
@ConfigProperty(name = ENV_OCR_SERVICE_ENDPOINT)
@@ -122,11 +129,10 @@ public void extractText(ItemCollection workitem, ItemCollection snapshot, String
// validate OCR MODE....
if ("AUTO, NO_OCR, OCR_ONLY, OCR_AND_TEXT_EXTRACTION".indexOf(ocrStategy) == -1) {
- throw new PluginException(OCRService.class.getSimpleName(), PLUGIN_ERROR,
+ throw new PluginException(TikaService.class.getSimpleName(), PLUGIN_ERROR,
"Invalid TIKA_OCR_MODE - expected one of the following options: NO_OCR | OCR_ONLY | OCR_AND_TEXT_EXTRACTION");
}
-
// if the options did not already include the X-Tika-PDFOcrStrategy than we add
// it now...
boolean hasPDFOcrStrategy = options.stream()
@@ -142,7 +148,7 @@ public void extractText(ItemCollection workitem, ItemCollection snapshot, String
logger.info("...... Tika Option = " + opt);
}
}
-
+
long l = System.currentTimeMillis();
// List currentDmsList = DMSHandler.getDmsList(workitem);
List files = workitem.getFileData();
@@ -173,7 +179,7 @@ public void extractText(ItemCollection workitem, ItemCollection snapshot, String
fileData.setAttribute(FILE_ATTRIBUTE_TEXT, list);
} catch (IOException e) {
- throw new PluginException(OCRService.class.getSimpleName(), PLUGIN_ERROR,
+ throw new PluginException(TikaService.class.getSimpleName(), PLUGIN_ERROR,
"Unable to scan attached document '" + fileData.getName() + "'", e);
}
}
@@ -205,7 +211,7 @@ public String doORCProcessing(FileData fileData, List options) throws IO
// read the Tika Service Enpoint
if (!serviceEndpoint.isPresent() || serviceEndpoint.get().isEmpty()) {
logger.severe(
- "No OCR_SERVICE_ENDPOINT is missing - OCRprocessing not supported without a valid tika server endpoint!");
+ "No OCR_SERVICE_ENDPOINT is missing - OCR processing not supported without a valid tika server endpoint!");
return null;
}
@@ -438,6 +444,4 @@ private String adaptContentType(FileData fileData) {
return contentType;
}
-
-
}
\ No newline at end of file
diff --git a/imixs-archive-ocr/README.md b/imixs-archive-ocr/README.md
deleted file mode 100644
index c22d77e9..00000000
--- a/imixs-archive-ocr/README.md
+++ /dev/null
@@ -1,99 +0,0 @@
-# Imixs-Archive-OCR
-
-*Imixs-Archive-OCR* provides a service component to extract textual information from documents attached to a Workitem. The text extraction is based on [Apache Tika](https://tika.apache.org/). To use this module a Tika Server endpoint must exist.
-You can run a Tika Server with the [official Docker image](https://hub.docker.com/r/apache/tika):
-
- $ docker run -d -p 9998:9998 apache/tika:1.24.1-full
-
-Tika extracts text from over a thousand different file types including PDF and office documents and supports *Optical character recognition (OCR)* based on the [Tesseract project](https://github.com/tesseract-ocr/tesseract). The text content extracted by this service is stored in the $file attribute 'text' and can be used for further analysis, verifying or processing within a business process. The project is decoupled form the Imixs-Workflow Engine so that it can be used independently in other projects too.
-
-
-### The OCRService
-
-The *OCRService* extracts the textual information from file attachments calling the Tika Rest API Service endpoint. The following environment variables are supported:
-
- * OCR\_SERVICE\_ENDPOINT - defines the Rest API end-point of the Tika server (mandetory).
- * OCR\_STRATEGY - Which strategy to use for OCR (AUTO|NO_OCR|OCR_AND_TEXT_EXTRACTION|OCR_ONLY)
-
-With the optional environment variable OCR\_STRATEGY the behavior how text is extracted from a PDF file can be controlled:
-
-**AUTO**
-
-The best OCR strategy is chosen by the Tika Server itself. This is the default setting.
-
-**NO_OCR**
-
-OCR processing is disabled and text is extracted only from PDF files including a raw text. If a pdf file does not contain raw text data no text will be extracted!
-
-**OCR_ONLY**
-
-PDF files will always be OCR scanned even if the pdf file contains text data.
-
-**OCR_AND_TEXT_EXTRACTION**
-
-OCR processing and raw text extraction is performed. Note: This may result is a duplication of text and the mode is not recommended.
-
-### Tika Options
-
-Out of the box, Apache Tika will start with the default configuration. By providing additional config options
- you can specify a custom tika configuration to be used by the tika server.
-
-For example to set the DPI mode call:
-
- @EJB
- TikaDocumentService tikaDocumentService;
-
- // define options
- List options=new ArrayList();
- options.add("X-Tika-PDFocrStrategy=OCR_AND_TEXT_EXTRACTION");
- options.add("X-Tika-PDFOcrImageType=RGB"); // support colors
- options.add("X-Tika-PDFOcrDPI=72"); // set DPI
- options.add("X-Tika-OCRLanguage=eng"); // set english language
- // start ocr
- tikaDocumentService.extractText(workitem, "TEXT_AND_OCR", options)
-
-**Note:** Options set by this method call overwrite the options defined in a tika config file.
-
-You have various options to configure the Tika server. Find details about how to configure imixs-tika [here](https://github.com/imixs/imixs-docker/tree/master/tika).
-
- - https://cwiki.apache.org/confluence/display/TIKA/TikaServer
- - https://cwiki.apache.org/confluence/display/TIKA/TikaOCR
- - https://cwiki.apache.org/confluence/display/tika/PDFParser%20(Apache%20PDFBox)
-
-#### Example
-
-In this example configuration the OCR processing will be started with 4 additional tika options.
-
- - X-Tika-PDFOcrImageType=RGB - set color mode
- - X-Tika-PDFOcrDPI=72 - set DPI to 72
- - X-Tika-OCRLanguage=deu - set OCR language to german
-
-
-#### Overriding the configured language as part of your request
-
-Different requests may need processing using different language models. These can be specified for specific requests using the X-Tika-OCRLanguage custom header. An example of this is shown below:
-
- X-Tika-OCRLanguage=deu
-
-Or for multiple languages:
-
- X-Tika-OCRLanguage: eng+fra"
-
-
-For more details about the OCR configuration see the [Imixs-Archive-OCR project](https://github.com/imixs/imixs-archive/tree/master/imixs-archive-ocr).
-
-
-
-## How to Install
-
-To include the imixs-archive-ocr service the following maven dependency can be added:
-
-
-
-
- org.imixs.workflow
- imixs-archive-ocr
- compile
-
-
-
\ No newline at end of file
diff --git a/imixs-archive-ocr/pom.xml b/imixs-archive-ocr/pom.xml
deleted file mode 100644
index 5cbf208b..00000000
--- a/imixs-archive-ocr/pom.xml
+++ /dev/null
@@ -1,21 +0,0 @@
-
- 4.0.0
-
- org.imixs.workflow
- imixs-archive
- 2.2.9-SNAPSHOT
-
- imixs-archive-ocr
-
-
-
-
-
-
- org.imixs.workflow
- imixs-workflow-core
-
-
-
- Imixs-Archive OCR
-
\ No newline at end of file
diff --git a/pom.xml b/pom.xml
index 1fb83c03..153bcd7a 100644
--- a/pom.xml
+++ b/pom.xml
@@ -10,7 +10,6 @@
imixs-archive-api
imixs-archive-service
imixs-archive-documents
- imixs-archive-ocr
imixs-archive-importer