-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LSTM: User patterns do not work #403
Comments
Some other reports of user-patterns and user-words not working https://groups.google.com/forum/#!topic/tesseract-ocr/5vFqVcJmHnM http://stackoverflow.com/questions/17209919/tesseract-user-patterns Has anyone tried this? Does it work? |
Question:
BTW, this could behave differently for base tesseract vs LSTM. |
I can tell you that in the Tesseract forum many users ask about these files. They are disappointed that there is no effect on accuracy when using them with their input. The input is usually not a document but something like receipt, passport, car license plate, with a small set of known words/patterns. |
In addition to the cases mentioned by Amit, there are users who would like
to use the user_words dictionary in addition to Tesseract's wordlist,
some examples of user words could be client names, industry specific
terminology eg. Medical or pharmaceutical.
Is it possible to allow for both kinds of scenarios, based on some config/variable?
|
@theraysmith Ray, please also see for discussion regarding a user looking for encrypted user words list to use with tesseract. |
Handle pattern by code. It is the only best way and anle customize easily
Hint: Use your input result and regular expression checking online regular expression testing page. It will be great help. Hope you solve this |
reg "recognize only from user word list" |
need to recognize words of medications ( Rare words that are most likely not included in the training data). |
Any updates? |
This is really needed. How can one fix this? Where to start? |
When working on a fix for char whitelisting, @Shreeshrii and I discussed how user words/patterns could be reactivated in Tesseract 4 with LSTM models, too. This prompted me to work on a solution – see #2324. (Please review!) Here is the relevant discussion leading up to it:
|
There's some way to use user patterns in 4.0 or we'll have to wait for a new version? |
@KilianSillero Yes there is, just checkout the recent master and you will have the user words and user patterns facilities (as documented in the manpage) at your disgression. The above mentioned fix is not quite satisfactory yet, in that the effect might be small, but these are larger issues to be dealt with in general terms. |
@zdenop please close! (If users still have problems with beam narrowness or want to make patterns exclusive, those should be discussed as separate issues.) |
ref: https://groups.google.com/forum/#!msg/tesseract-ocr/S9CIK3jOMWw/vVBZULrJ9xcJ
I tried using bazaar config for user patterns suggested in above post ( \A\A\d\d\d\A\A
) with the latest windows binary. It does not seem to work. Does the functionality work on linux?
input, output and config files attached. I added.txt extension to bazaar and eng.user-patterns in order to upload it here.
OUTPUT
patternbazaar.txt
bazaar.txt
eng.user-patterns.txt
The text was updated successfully, but these errors were encountered: