Apache Tika Server is a server edition of Apache Tika.
Apache Tika is a content detection and analysis framework. It allows users to easy text-extraction for thousand different file types (such as PPT, XLS, and PDF) in the single interface. Tika useful for search engine indexing, content analysis, translation, and much more.
Tika is a project of the Apache Software Foundation and was formely a subproject of Apache Lucene.
wikipedia.org/wiki/Apache_Tika
$ docker run -d -p 9998:9998 --name some-tika kujira/tika
...via docker-compose
Example docker-compose.yml
for kujira/tika
:
version: '3.1'
services:
tika:
image: kujira/tika
restart: always
ports:
- 9998:9998
Tika does not use data store.
Tika exposes 9998
port in the container. Just add -p 9998:9998
to the docker run
arguments and then access either http://localhost:9998
or http://host-ip:9998
in a browser.
When you start the kujira/tika
image, you can adjust the configuration of the instance by passing one or more environment variables on the docker run
command line.
Set a fraction value to determine text file is csv or not in. Default is 0.5
.
Set true
to enable HTML script extraction. Default is false
.
Set true
to enable VBA macro extraction. Default is false
.
Set true
to enable deleted content extraction. Default is false
.
Set true
to enable moved content extraction. Default is false
.
Set true
to enable moved content extraction. Default is false
.
Set false
to disable header and footer extraction. Default is true
.
Set true
to enable missing rows extraction. Default is false
.
Set false
to disable slide note extraction. Default is true
.
Set false
to disable slide master extraction. Default is true
.
Set false
to disable concatenate phonetic(aka. furigana) extraction. Default is true
.
When true, 山田太郎 will be extracted to 山田太郎ヤマダタロウ.
Set true
to enable SAX docx and pptx extraction. Default is false
.
Sets the format of date string. Default is yyyy-mm-dd
.
You can find custom date format here.
Set Tesseract OCR language model name. Default is eng
. You can join several model names with +
character, like this: eng+deu+fra
Supported model names
deu
(German)eng
(English)fra
(French)ita
(Itarian)jpn
(Japanese)jpn_vert
(Japanese Vertical)spa
(Spanish)
Set the timeout of tesseract ocr execution. Default is 120
.
Set true
to automatic rotate image if needs. Default is false
.
Set true
to enable bookmark extraction. Default is false
.
Set true
to enable annotation extraction. Default is false
.
Set an OCR stragegy string. Default is no_ocr
.
OCR strategy values
no_ocr
(default)ocr_only
ocr_and_text
auto