Apache Tika is a content detection and analysis framework, written in Java, stewarded at the Apache Software Foundation. It detects and extracts metadata and text from over a thousand different file types, and as well as providing a Java library, has Server and Command Line editions suitable for use from other programming languages.
FROM butter/tika-server:1.14
ENTRYPOINT java -jar /usr/local/bin/tika.jar -h 0.0.0.0 -c /tika-config.xml
View license information for Tika.