Skip to content
This repository has been archived by the owner on Dec 18, 2019. It is now read-only.

Error with french language with OCR #107

Closed
chrisz opened this issue Dec 2, 2012 · 3 comments
Closed

Error with french language with OCR #107

chrisz opened this issue Dec 2, 2012 · 3 comments
Labels

Comments

@chrisz
Copy link

chrisz commented Dec 2, 2012

My OCR language is configured as "French" but, when I scan document, I see a TesseractError on the console :

Extracting boxes ...
Exception in thread Thread-7:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 504, in run
self.__target(_self.__args, *_self.__kwargs)
File "/home/chris/tmp/paperwork/src/paperwork/frontend/workers.py", line 44, in __wrapper
self.do(**kwargs)
File "/home/chris/tmp/paperwork/src/paperwork/frontend/mainwindow.py", line 415, in do
self.__scan_progress_cb)
File "/home/chris/tmp/paperwork/src/paperwork/backend/img/doc.py", line 257, in scan_single_page
self.__add_img(img, ocrlang, resolution, scanner_calibration, callback)
File "/home/chris/tmp/paperwork/src/paperwork/backend/img/doc.py", line 233, in __add_img
scanner_calibration, callback)
File "/home/chris/tmp/paperwork/src/paperwork/backend/img/page.py", line 373, in make
(bmpfile, txt, boxes) = self.__ocr(outfiles, ocrlang, callback)
File "/home/chris/tmp/paperwork/src/paperwork/backend/img/page.py", line 353, in __ocr
lang=ocrlang, builder=pyocr.builders.WordBoxBuilder())
File "/usr/lib/python2.7/site-packages/pyocr/tesseract.py", line 225, in image_to_string
raise TesseractError(status, errors)
TesseractError: (1, 'Error opening data file /usr/share/tessdata/eng.traineddata\nPlease make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.\nFailed loading language 'eng'\nTesseract couldn't load any languages!\nCould not initialize tesseract.\n')

Paperwork wants to do OCR with english language but I have configured french in settings window. I don't have data for english language.

$ ls /usr/share/tessdata/
configs fra.traineddata tessconfigs

$ cat ~/.paperwork.conf
[Global]

[OCR]
lang = fra
ocrtime = 21.6917388439

I tried to deactivate OCR and enable french language but the problem is still there.

@jflesch
Copy link
Member

jflesch commented Dec 9, 2012

I can't reproduce this bug. Sorry.

Here is what I've tried:

  • Adding a print in pyocr to make sure Paperwork specify 'fra' as the wanted language (pyocr/src/tesseract.py:run_tesseract()) : Check.
  • 'mv /usr/local/share/tessdata/eng.traineddata /usr/local/share/tessdata/old.eng.traineddata' to make sure Tesseract never tries to use 'eng' when we specify 'fra' : Check.

I advise you to check a few things on your side:

  • Make sure you're using Tesseract v3 and not Tesseract v2 (tesseract --version). Preferably Tesseract v3.01
  • Make sure you've the french training data installed. Otherwise I assume Tesseract might want to fall back on english ones. Note however that Paperwork shouldn't display 'French' in the settings window if the french data aren't installed. It it does anyway, please fill another bug report.
  • Make sure you don't have any warnings in Paperwork verbose. For instance, please look for "Warning: Failed to figure out system language" (you shouldn't get it since you've configured explicitly the language to use. However if it pops up, it would give us a good hint).
  • Make sure you don't have a 'paperwork.conf' (without the first dot) wandering around in the directory from which you run Paperwork. It would be used instead of your ~/.paperwork.conf
  • Try what I tried above and see what you get

@jflesch
Copy link
Member

jflesch commented Jan 25, 2013

Is there anything new regarding this issue ? Otherwise, since I can't reproduce it, I'm going to close it.

@chrisz
Copy link
Author

chrisz commented Jan 25, 2013

You can close it. I can't do further tests for a long time.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants