docker-tesseract

Tesseract in a Docker Container. The purpose of this project is to build the latest version of Tesseract OCR and make it available in a Docker Container. Tesseract OCR can be installed using a package management tool (like APT) instead, but this may not be the latest version and build of Tesseract OCR, which may contain additional desired features.

To mimimize the size of the Docker image, the project is built in 2 parts. Vagrant is used to install the dev libraries and tools needed to build Leptonica and Tesseract OCR. The built files are then zipped and used to create a Docker image. This eliminates the need to install the dev libraries and tools in the docker image.

##Stack

The project contains 2 parts that use the following:

Ubuntu Trust (14.04.x)
Virtualbox with Guest Additions (4.3.18)
Vagrant
Docker
libpng12-dev
libjepg62-dev
libtiff4-dev
automake
checkinstall
build-essential
libtool
libpng12-0
libjpeg62
libjpeg8
libtiff5

##Build

Install Virtualbox and Vagrant if needed
Copy vagrant/Vagrantfile.default to ./Vagrantfile
Note that the file name changes from Vagrantfile.default to Vagrantfile
Review Vagrantfile to make sure that the configuration is valid and make changes if needed. This is why a Vagrantfile isn't included in the project, and Vagrantfile.default is provided in a subfolder instead
To successfully build the project, the OS used for Vagrant and Docker needs to match
Run 'vagrant up' in the project root
This will also run 'vagrant/provision.sh' which is used to provision the Vagrant VM
Tesseract OCR requires Leptonica. Different versions of Tesseract require different versions of Leptonica. For example, Tesseract OCR v3.03 requires Leptonica 1.70+. This project currently builds Tesseract OCR v3.04 with Leptonica v1.71
To avoid errors during provisioning, the source files for Tesseract OCR and Leptioncia are stored in '.tar.gz' files in the 'vendor/source/' folder
Tesseract OCR requires traineddata files to successfully OCR images. A default set of traineddata files can be found in the 'vendor/tessdata/' folder. These files are copied to '/usr/local/share/tessdata/' once Tesseract OCR has been built successfully duringing provisioning
During provisioning, once Leptionica and Tesseract OCR have been successfully built and the traineddata files have been copied, 'vendor/provision.sh' will create a compressed archive file of '/usr/local' at 'vendor/build/usr-local.tar.gz'
Run 'sudo docker build .'
The image libraries (libpng12-0, libjpeg62, libjpeg8, and libtiff5) need to be installed for Tesseract OCR to work (found in Dockerfile)

##Updating Leptonica and Tesseract OCR Source Files and Adding Tesseract Traineddata

To update the Leptonica and Tesseract OCR source files, download the desired '.tar.gz' file and save to 'vendr/source/'. The 'provision.sh' file will also need to be updated, since the downloaded filenames may be different, and the folder the files are extracted to may be different.

Leptonica - http://www.leptonica.org/
Tesseract OCR - https://code.google.com/p/tesseract-ocr/

To add Tesseract OCR traineddata files, copy the traineddata files to 'vendor/tessdata/' and the traineddata files will be added during the build of the Docker image.

##Using the Container

Run 'sudo docker run -i -t -v <host absolute path>:<docker path> <image id>'
Run 'sudo docker pull onesysadmin/tesseract' or use a Dockerfile with 'FROM onesysadmin/tesseract'
Build the image as described above if changes to the default configuration provided are needed

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
vagrant		vagrant
vendor		vendor
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

docker-tesseract

About

Releases

Packages

Contributors 2

Languages

License

onesysadmin/docker-tesseract

Folders and files

Latest commit

History

Repository files navigation

docker-tesseract

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages