Overview

ocrodjvu is a wrapper for OCR systems that allows you to perform OCR on DjVu files.

Example

$ wget -q 'https://sources.debian.org/data/main/o/ocropus/0.3.1-3/data/pages/alice_1.png'
$ gm convert -threshold 50% 'alice_1.png' 'alice.pbm'
$ cjb2 'alice.pbm' 'alice.djvu'
$ ocrodjvu --in-place 'alice.djvu'
Processing 'alice.djvu':
- Page #1
$ djvused -e print-txt 'alice.djvu'
(page 0 0 2488 3507
 (column 470 2922 1383 2978
  (para 470 2922 1383 2978
   (line 470 2922 1383 2978
    (word 470 2927 499 2976 "1")
    (word 588 2926 787 2978 "Down")
    (word 817 2925 927 2977 "the")
    (word 959 2922 1383 2976 "Rabbit-Hole"))))
 (column 451 707 2076 2856
  (para 463 2626 2076 2856
   (line 465 2803 2073 2856
    (word 465 2819 569 2856 "Alice")
    (word 592 2819 667 2841 "was")
    (word 690 2808 896 2854 "beginning")
⋮

Requisites

The following software is needed to run ocrodjvu:

Python 2.7
an OCR engine:
- OCRopus 0.2, 0.3 or 0.3.1
- Cuneiform ≥ 0.7
- Ocrad ≥ 0.10
- GOCR ≥ 0.40
- Tesseract ≥ 2.00
DjVuLibre ≥ 3.5.21
python-djvulibre
subprocess32
lxml ≥ 2.0

Additionally, some optional features require the following software:

PyICU ≥ 1.0.1 — required for the --word-segmentation=uax29 option
html5lib — required for the --html5 option

The following software is needed to rebuild the manual pages from source:

xsltproc
DocBook XSL stylesheets

Acknowledgment

ocrodjvu development was supported by the Polish Ministry of Science and Higher Education's grant no. N N519 384036 (2009–2012, https://bitbucket.org/jsbien/ndt).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README

README

Overview

Example

Requisites

Acknowledgment

Files

README

Latest commit

History

README

File metadata and controls

Overview

Example

Requisites

Acknowledgment