Error with french language with OCR #107

chrisz · 2012-12-02T18:11:48Z

My OCR language is configured as "French" but, when I scan document, I see a TesseractError on the console :

Extracting boxes ...
Exception in thread Thread-7:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 504, in run
self.__target(_self.__args, *_self.__kwargs)
File "/home/chris/tmp/paperwork/src/paperwork/frontend/workers.py", line 44, in __wrapper
self.do(**kwargs)
File "/home/chris/tmp/paperwork/src/paperwork/frontend/mainwindow.py", line 415, in do
self.__scan_progress_cb)
File "/home/chris/tmp/paperwork/src/paperwork/backend/img/doc.py", line 257, in scan_single_page
self.__add_img(img, ocrlang, resolution, scanner_calibration, callback)
File "/home/chris/tmp/paperwork/src/paperwork/backend/img/doc.py", line 233, in __add_img
scanner_calibration, callback)
File "/home/chris/tmp/paperwork/src/paperwork/backend/img/page.py", line 373, in make
(bmpfile, txt, boxes) = self.__ocr(outfiles, ocrlang, callback)
File "/home/chris/tmp/paperwork/src/paperwork/backend/img/page.py", line 353, in __ocr
lang=ocrlang, builder=pyocr.builders.WordBoxBuilder())
File "/usr/lib/python2.7/site-packages/pyocr/tesseract.py", line 225, in image_to_string
raise TesseractError(status, errors)
TesseractError: (1, 'Error opening data file /usr/share/tessdata/eng.traineddata\nPlease make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.\nFailed loading language 'eng'\nTesseract couldn't load any languages!\nCould not initialize tesseract.\n')

Paperwork wants to do OCR with english language but I have configured french in settings window. I don't have data for english language.

$ ls /usr/share/tessdata/
configs fra.traineddata tessconfigs

$ cat ~/.paperwork.conf
[Global]

[OCR]
lang = fra
ocrtime = 21.6917388439

I tried to deactivate OCR and enable french language but the problem is still there.

jflesch · 2012-12-09T14:31:24Z

I can't reproduce this bug. Sorry.

Here is what I've tried:

Adding a print in pyocr to make sure Paperwork specify 'fra' as the wanted language (pyocr/src/tesseract.py:run_tesseract()) : Check.
'mv /usr/local/share/tessdata/eng.traineddata /usr/local/share/tessdata/old.eng.traineddata' to make sure Tesseract never tries to use 'eng' when we specify 'fra' : Check.

I advise you to check a few things on your side:

Make sure you're using Tesseract v3 and not Tesseract v2 (tesseract --version). Preferably Tesseract v3.01
Make sure you've the french training data installed. Otherwise I assume Tesseract might want to fall back on english ones. Note however that Paperwork shouldn't display 'French' in the settings window if the french data aren't installed. It it does anyway, please fill another bug report.
Make sure you don't have any warnings in Paperwork verbose. For instance, please look for "Warning: Failed to figure out system language" (you shouldn't get it since you've configured explicitly the language to use. However if it pops up, it would give us a good hint).
Make sure you don't have a 'paperwork.conf' (without the first dot) wandering around in the directory from which you run Paperwork. It would be used instead of your ~/.paperwork.conf
Try what I tried above and see what you get

jflesch · 2013-01-25T10:14:34Z

Is there anything new regarding this issue ? Otherwise, since I can't reproduce it, I'm going to close it.

chrisz · 2013-01-25T10:43:15Z

You can close it. I can't do further tests for a long time.

jflesch closed this as completed Jan 25, 2013

Nadrazhul mentioned this issue Jan 14, 2018

Cannot select OCR language other than English #735

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with french language with OCR #107

Error with french language with OCR #107

chrisz commented Dec 2, 2012

jflesch commented Dec 9, 2012

jflesch commented Jan 25, 2013

chrisz commented Jan 25, 2013

Error with french language with OCR #107

Error with french language with OCR #107

Comments

chrisz commented Dec 2, 2012

jflesch commented Dec 9, 2012

jflesch commented Jan 25, 2013

chrisz commented Jan 25, 2013