-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tesseract thresholding eliminates text on bright background colors #1990
Comments
I wonder why does tesseract use its own otsu binarization and not leptonica's one. You might want to test this image with leptonica's binarization options. It's also possible that your issue is cause by some filter done before or after the binarization. |
I used gimp for image thresholding and gave the binary image to tesseract. Output:
|
related: #242 (comment) |
Here's Tesseract's otsu: https://github.com/tesseract-ocr/tesseract/blob/57970443b42b/src/ccstruct/otsuthr.cpp |
@amitdo Could you show me how to use gimp to change image threshold? a command? thanks |
Gimp is a GUI tool. You might want to try imagemagick which is command line tool instead. |
CC: @jbreiden |
https://github.com/DanBloomberg/leptonica/blob/master/src/binarize.c
|
With my patch in #3418, and the eng.traineddata from best I get:
I used |
Fixed in #3418. |
Environment
Current Behavior:
When tesseract is run on the attached image, the text on highlight backgrounds is missing from output.
The thresholder blacks out the text (this is tessinput.tif):
Expected Behavior:
Thresholder should treat highlights as background so that Tesseract recognizes all of the text.
The text was updated successfully, but these errors were encountered: