Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSTM: User patterns do not work #403

Closed
Shreeshrii opened this issue Aug 30, 2016 · 16 comments
Closed

LSTM: User patterns do not work #403

Shreeshrii opened this issue Aug 30, 2016 · 16 comments
Assignees

Comments

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Aug 30, 2016

ref: https://groups.google.com/forum/#!msg/tesseract-ocr/S9CIK3jOMWw/vVBZULrJ9xcJ

I tried using bazaar config for user patterns suggested in above post ( \A\A\d\d\d\A\A
) with the latest windows binary. It does not seem to work. Does the functionality work on linux?

input, output and config files attached. I added.txt extension to bazaar and eng.user-patterns in order to upload it here.

patterntest

OUTPUT

0011917
OX345PT
PT7895M
BA409QT
OMOOKM
WE4321M

OOLI9T7
OX345PT
PT789SM
BA409QT
OMOOKMI
WE432LM

OOLI9T7
OX345PT
PT7898M
BA409QT
OMOOKMI
WE432LM


patternbazaar.txt

bazaar.txt
eng.user-patterns.txt

@Shreeshrii
Copy link
Collaborator Author

Some other reports of user-patterns and user-words not working

https://groups.google.com/forum/#!topic/tesseract-ocr/5vFqVcJmHnM

http://stackoverflow.com/questions/17209919/tesseract-user-patterns

Has anyone tried this? Does it work?

@theraysmith
Copy link
Contributor

Question:
There are 2 ways these things could work:

  1. FORCE the output to match the provided pattern(s) and/or word(s). With this option, you can't get anything else out, whatever is in the image.
  2. Use the user-patterns and user-words as a hint. Other things could be output, if it thinks it is more likely. The hint can be made stronger, but there will always be inputs that produce something outside of the patterns supplied.
    Which is it to be?
    Can someone familiar with the above discussions please summarize for me, and if the consensus is 1 above, then it could be made to happen, or else it might be possible to increase the strength of the hint.

BTW, this could behave differently for base tesseract vs LSTM.

@amitdo
Copy link
Collaborator

amitdo commented Dec 7, 2016

I can tell you that in the Tesseract forum many users ask about these files. They are disappointed that there is no effect on accuracy when using them with their input.

The input is usually not a document but something like receipt, passport, car license plate, with a small set of known words/patterns.

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Dec 8, 2016 via email

@Shreeshrii
Copy link
Collaborator Author

@theraysmith Ray, please also see
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/IUtQfIGZVdA/dm0-2n4DCQAJ

for discussion regarding a user looking for encrypted user words list to use with tesseract.

@quocpt
Copy link

quocpt commented Jan 26, 2017

Handle pattern by code. It is the only best way and anle customize easily

  1. Teseract firstly have to process whole image anyway. We can not do anything to this.
  2. Then they process pattern by their code (i assumed it is bad). We bypass this step
  3. Get all result and hadle by regular expression in code. All input is in text or digits so it will be fast, dont worry.

Hint: Use your input result and regular expression checking online regular expression testing page. It will be great help.

Hope you solve this

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Feb 13, 2017

@Shreeshrii
Copy link
Collaborator Author

please see https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/wnlJcF4zIvU/4cIt9f2iCgAJ

need to recognize words of medications ( Rare words that are most likely not included in the training data).

@Shreeshrii Shreeshrii changed the title User patterns using bazaar config do not work LSTM: User patterns do not work Feb 22, 2017
@galharth
Copy link

Any updates?

@msklvsk
Copy link

msklvsk commented Nov 26, 2018

This is really needed. How can one fix this? Where to start?

@bertsky
Copy link
Contributor

bertsky commented Mar 14, 2019

When working on a fix for char whitelisting, @Shreeshrii and I discussed how user words/patterns could be reactivated in Tesseract 4 with LSTM models, too. This prompted me to work on a solution – see #2324. (Please review!)

Here is the relevant discussion leading up to it:

Do these changes also fix #960 ?

It does not seem so. Results for your pattern example from #403 are still unaffected, regardless of whether I use a config file or the --user-patterns option. Stracing confirms that Tesseract never attempts to open the pattern file, it just goes straight after the output file, once the traineddata itself is loaded.

Looking into this, it appears that LSTMRecognizer::Load is responsible, and it does not call LoadDictionary unless its first option (lang) is non-null, which in turn Tesseract::init_tesseract_lang_data will not give unless lstm_use_matrix=1. But that does not help either!

Looking deeper, only Dict::LoadLSTM is in the current callgraph, but we would need Dict::Load to read the user words and user patterns, build a trie from them and add to the other dawgs. This can only come from Tesseract::init_tesseract_lm, and that from TessBaseAPI::InitLangMod, which has a nice disclaimer comment above:

//TODO(amit): Adapt to lstm

I now believe TessBaseAPI::InitLangMod / Tesseract::init_tesseract_lm are actually dead ends and should be removed. As to LSTMRecognizer::LoadDictionary, it would simply be a matter of replacing Dict::LoadLSTM with the old Dict::Load, but there is one missing link: The LSTMRecognizer never gets to see the runtime variables of the Tesseract instance, and CCUtil has no interface to set or initialize its params_ member.

Perhaps it would be best to pass tesseract_->params() to the constructor of LSTMRecognizer, and add a (delegating) constructor to both LSTMRecognizer and CCUtil which takes a ParamsVectors* .

Or is there some reason to keep the member params of Tesseract and LSTMRecognizer different? If so, which params besides user_patterns_file and user_words_file should be copied?

@theraysmith : Ray, can you reply to @bertsky ? This is important fix for tesseract 4.x... (cc: @jbreiden )

@bertsky
Copy link
Contributor

bertsky commented Mar 15, 2019

Ok, #2324 failed, but here comes #2328.

@KilianSillero
Copy link

There's some way to use user patterns in 4.0 or we'll have to wait for a new version?

@bertsky
Copy link
Contributor

bertsky commented Apr 15, 2019

@KilianSillero Yes there is, just checkout the recent master and you will have the user words and user patterns facilities (as documented in the manpage) at your disgression.

The above mentioned fix is not quite satisfactory yet, in that the effect might be small, but these are larger issues to be dealt with in general terms.

@bertsky
Copy link
Contributor

bertsky commented Apr 15, 2019

@zdenop please close! (If users still have problems with beam narrowness or want to make patterns exclusive, those should be discussed as separate issues.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants