Skip to content

Commit

Permalink
docu
Browse files Browse the repository at this point in the history
  • Loading branch information
rsoika committed Nov 6, 2020
1 parent c545fc8 commit d339c8c
Show file tree
Hide file tree
Showing 3 changed files with 29 additions and 9 deletions.
23 changes: 21 additions & 2 deletions imixs-archive-documents/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,9 +45,28 @@ Both, the *OCRDocumentPlugin* as also the *OCRDocumentAdapter* can be configured
<!-- Tika Options -->
<tika name="options">X-Tika-PDFocrStrategy=OCR_AND_TEXT_EXTRACTION</tika>
<tika name="options">X-Tika-PDFOcrImageType=RGB</tika>
<tika name="options">X-Tika-PDFOcrDPI=400</tika>
<tika name="options">X-Tika-PDFOcrDPI=72</tika>
<tika name="options">X-Tika-OCRLanguage=eng+deu</tika>

In this example configuration the OCR processing will be started with 3 additional tika options. For more details about the OCR configuration see the [Imixs-Archive-OCR project](https://github.com/imixs/imixs-archive/tree/master/imixs-archive-ocr).
In this example configuration the OCR processing will be started with 4 additional tika options.

- X-Tika-PDFOcrImageType=RGB - set color mode
- X-Tika-PDFOcrDPI=72 - set DPI to 72
- X-Tika-OCRLanguage=deu - set OCR language to german


#### Overriding the configured language as part of your request

Different requests may need processing using different language models. These can be specified for specific requests using the X-Tika-OCRLanguage custom header. An example of this is shown below:

X-Tika-OCRLanguage=deu

Or for multiple languages:

X-Tika-OCRLanguage: eng+fra"


For more details about the OCR configuration see the [Imixs-Archive-OCR project](https://github.com/imixs/imixs-archive/tree/master/imixs-archive-ocr).


## Searching Documents
Expand Down
13 changes: 7 additions & 6 deletions imixs-archive-ocr/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
# Imixs-Archive-OCR

*Imixs-Archive-OCR* is a sub-project of Imixs-Archive. The project provides methods to extract textual information from documents
attached to a Workitem. The text content of attachments is either extracted by the PDFBox API or by optical character recognition (OCR). This text content is stored in the $file attribute 'text' and can be use for further processing or to search for document content.
*Imixs-Archive-OCR* is a sub-project of Imixs-Archive. The project is decoupled form the Imixs-Workflow Engine and provides a service component to extract textual information from documents attached to a Workitem. The text content of attachments is either extracted by the PDFBox API or by optical character recognition (OCR). This text content is stored in the $file attribute 'text' and can be use for further processing or to search for document content.


## OCR
Expand Down Expand Up @@ -43,17 +42,19 @@ For example to set the DPI mode call:
// define options
List<String> options=new ArrayList<String>();
options.add("X-Tika-PDFocrStrategy=OCR_AND_TEXT_EXTRACTION");
options.add("X-Tika-PDFOcrImageType=RGB");
options.add("X-Tika-PDFOcrDPI=400");

options.add("X-Tika-PDFOcrImageType=RGB"); // support colors
options.add("X-Tika-PDFOcrDPI=72"); // set DPI
options.add("X-Tika-OCRLanguage=eng"); // set english language
// start ocr
tikaDocumentService.extractText(workitem, "TEXT_AND_OCR", options)

**Note:** Options set by this method call overwrite the options defined in a tika config file.

You have various options to configure the Tika server. Find details about how to configure imixs-tika [here](https://github.com/imixs/imixs-docker/tree/master/tika).


- https://cwiki.apache.org/confluence/display/TIKA/TikaServer
- https://cwiki.apache.org/confluence/display/TIKA/TikaOCR
- https://cwiki.apache.org/confluence/display/tika/PDFParser%20(Apache%20PDFBox)


## How to Install
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ public void extractText(ItemCollection workitem, ItemCollection snapshot, String
// validate OCR MODE....
if ("TEXT_ONLY, OCR_ONLY, TEXT_AND_OCR".indexOf(pdfMode) == -1) {
throw new PluginException(OCRService.class.getSimpleName(), PLUGIN_ERROR,
"Invalid TIKA_OCR_MODE - exprected one of the following options: TEXT_ONLY | OCR_ONLY | TEXT_AND_OCR");
"Invalid TIKA_OCR_MODE - expected one of the following options: TEXT_ONLY | OCR_ONLY | TEXT_AND_OCR");
}

long l = System.currentTimeMillis();
Expand Down

0 comments on commit d339c8c

Please sign in to comment.