LSTM: User patterns do not work #403

Shreeshrii · 2016-08-30T12:42:22Z

ref: https://groups.google.com/forum/#!msg/tesseract-ocr/S9CIK3jOMWw/vVBZULrJ9xcJ

I tried using bazaar config for user patterns suggested in above post ( \A\A\d\d\d\A\A
) with the latest windows binary. It does not seem to work. Does the functionality work on linux?

input, output and config files attached. I added.txt extension to bazaar and eng.user-patterns in order to upload it here.

OUTPUT

0011917
OX345PT
PT7895M
BA409QT
OMOOKM
WE4321M

OOLI9T7
OX345PT
PT789SM
BA409QT
OMOOKMI
WE432LM

OOLI9T7
OX345PT
PT7898M
BA409QT
OMOOKMI
WE432LM

patternbazaar.txt

bazaar.txt
eng.user-patterns.txt

The text was updated successfully, but these errors were encountered:

Shreeshrii · 2016-09-08T06:15:35Z

Some other reports of user-patterns and user-words not working

https://groups.google.com/forum/#!topic/tesseract-ocr/5vFqVcJmHnM

http://stackoverflow.com/questions/17209919/tesseract-user-patterns

Has anyone tried this? Does it work?

theraysmith · 2016-12-07T21:31:11Z

Question:
There are 2 ways these things could work:

FORCE the output to match the provided pattern(s) and/or word(s). With this option, you can't get anything else out, whatever is in the image.
Use the user-patterns and user-words as a hint. Other things could be output, if it thinks it is more likely. The hint can be made stronger, but there will always be inputs that produce something outside of the patterns supplied.
Which is it to be?
Can someone familiar with the above discussions please summarize for me, and if the consensus is 1 above, then it could be made to happen, or else it might be possible to increase the strength of the hint.

BTW, this could behave differently for base tesseract vs LSTM.

amitdo · 2016-12-07T22:13:54Z

I can tell you that in the Tesseract forum many users ask about these files. They are disappointed that there is no effect on accuracy when using them with their input.

The input is usually not a document but something like receipt, passport, car license plate, with a small set of known words/patterns.

Shreeshrii · 2016-12-08T03:29:50Z

In addition to the cases mentioned by Amit, there are users who would like to use the user_words dictionary in addition to Tesseract's wordlist, some examples of user words could be client names, industry specific terminology eg. Medical or pharmaceutical. Is it possible to allow for both kinds of scenarios, based on some config/variable?

Shreeshrii · 2016-12-08T14:58:42Z

@theraysmith Ray, please also see
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/IUtQfIGZVdA/dm0-2n4DCQAJ

for discussion regarding a user looking for encrypted user words list to use with tesseract.

quocpt · 2017-01-26T03:33:56Z

Handle pattern by code. It is the only best way and anle customize easily

Teseract firstly have to process whole image anyway. We can not do anything to this.
Then they process pattern by their code (i assumed it is bad). We bypass this step
Get all result and hadle by regular expression in code. All input is in text or digits so it will be fast, dont worry.

Hint: Use your input result and regular expression checking online regular expression testing page. It will be great help.

Hope you solve this

Shreeshrii · 2017-02-13T07:54:43Z

@theraysmith

Please see https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/p80qyGvVvP4/Rd1hlof3CAAJ

reg "recognize only from user word list"

Shreeshrii · 2017-02-18T18:14:42Z

please see https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/wnlJcF4zIvU/4cIt9f2iCgAJ

need to recognize words of medications ( Rare words that are most likely not included in the training data).

Shreeshrii · 2017-02-22T03:16:47Z

Also see: https://groups.google.com/d/msgid/tesseract-ocr/ab28b50f-d592-4f48-b813-c03451c4dbb0%40googlegroups.com?utm_medium=email&utm_source=footer

galharth · 2017-12-30T12:21:18Z

Any updates?

msklvsk · 2018-11-26T22:26:27Z

This is really needed. How can one fix this? Where to start?

bertsky · 2019-03-14T22:14:58Z

When working on a fix for char whitelisting, @Shreeshrii and I discussed how user words/patterns could be reactivated in Tesseract 4 with LSTM models, too. This prompted me to work on a solution – see #2324. (Please review!)

Here is the relevant discussion leading up to it:

Do these changes also fix #960 ?

It does not seem so. Results for your pattern example from #403 are still unaffected, regardless of whether I use a config file or the --user-patterns option. Stracing confirms that Tesseract never attempts to open the pattern file, it just goes straight after the output file, once the traineddata itself is loaded.

Looking into this, it appears that LSTMRecognizer::Load is responsible, and it does not call LoadDictionary unless its first option (lang) is non-null, which in turn Tesseract::init_tesseract_lang_data will not give unless lstm_use_matrix=1. But that does not help either!

Looking deeper, only Dict::LoadLSTM is in the current callgraph, but we would need Dict::Load to read the user words and user patterns, build a trie from them and add to the other dawgs. This can only come from Tesseract::init_tesseract_lm, and that from TessBaseAPI::InitLangMod, which has a nice disclaimer comment above:

//TODO(amit): Adapt to lstm

I now believe TessBaseAPI::InitLangMod / Tesseract::init_tesseract_lm are actually dead ends and should be removed. As to LSTMRecognizer::LoadDictionary, it would simply be a matter of replacing Dict::LoadLSTM with the old Dict::Load, but there is one missing link: The LSTMRecognizer never gets to see the runtime variables of the Tesseract instance, and CCUtil has no interface to set or initialize its params_ member.

Perhaps it would be best to pass tesseract_->params() to the constructor of LSTMRecognizer, and add a (delegating) constructor to both LSTMRecognizer and CCUtil which takes a ParamsVectors* .

Or is there some reason to keep the member params of Tesseract and LSTMRecognizer different? If so, which params besides user_patterns_file and user_words_file should be copied?

@theraysmith : Ray, can you reply to @bertsky ? This is important fix for tesseract 4.x... (cc: @jbreiden )

bertsky · 2019-03-15T15:19:36Z

Ok, #2324 failed, but here comes #2328.

KilianSillero · 2019-04-12T06:12:46Z

There's some way to use user patterns in 4.0 or we'll have to wait for a new version?

bertsky · 2019-04-15T09:58:32Z

@KilianSillero Yes there is, just checkout the recent master and you will have the user words and user patterns facilities (as documented in the manpage) at your disgression.

The above mentioned fix is not quite satisfactory yet, in that the effect might be small, but these are larger issues to be dealt with in general terms.

bertsky · 2019-04-15T10:08:58Z

@zdenop please close! (If users still have problems with beam narrowness or want to make patterns exclusive, those should be discussed as separate issues.)

zdenop assigned theraysmith Dec 7, 2016

Shreeshrii mentioned this issue Jan 13, 2017

How to get the unicharset back out from the lstm? #653

Closed

Shreeshrii changed the title ~~User patterns using bazaar config do not work~~ LSTM: User patterns do not work Feb 22, 2017

wosiu mentioned this issue May 30, 2017

user pattern/dict does not work at all #960

Closed

This was referenced Apr 30, 2018

user_words_suffix not working #1538

Closed

RFC: Tesseract 4.0.0 – open tasks #1423

Closed

drothlis mentioned this issue Nov 29, 2018

Ubuntu 18.04 support: Tesseract 4, libcec stb-tester/stb-tester#536

Merged

11 tasks

dmypstl mentioned this issue Feb 9, 2019

Turning on legacy OCR engine mode ropensci/tesseract#39

Closed

bertsky mentioned this issue Mar 11, 2019

trying to add tessedit_char_whitelist etc. again: #2294

Merged

This was referenced Mar 14, 2019

trying to add user words/patterns again: #2324

Closed

trying to add user words/patterns again: #2328

Merged

KilianSillero mentioned this issue Mar 25, 2019

Is userPatterns working? thiagoalessio/tesseract-ocr-for-php#158

Closed

zdenop closed this as completed Apr 15, 2019

bozhodimitrov mentioned this issue Jun 28, 2019

Cant make user-words work madmaze/pytesseract#206

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LSTM: User patterns do not work #403

LSTM: User patterns do not work #403

Shreeshrii commented Aug 30, 2016 •

edited

Loading

Shreeshrii commented Sep 8, 2016

theraysmith commented Dec 7, 2016

amitdo commented Dec 7, 2016

Shreeshrii commented Dec 8, 2016 via email •

edited

Loading

Shreeshrii commented Dec 8, 2016

quocpt commented Jan 26, 2017

Shreeshrii commented Feb 13, 2017 •

edited

Loading

Shreeshrii commented Feb 18, 2017

Shreeshrii commented Feb 22, 2017

galharth commented Dec 30, 2017

msklvsk commented Nov 26, 2018

bertsky commented Mar 14, 2019

bertsky commented Mar 15, 2019

KilianSillero commented Apr 12, 2019

bertsky commented Apr 15, 2019 •

edited

Loading

bertsky commented Apr 15, 2019

LSTM: User patterns do not work #403

LSTM: User patterns do not work #403

Comments

Shreeshrii commented Aug 30, 2016 • edited Loading

Shreeshrii commented Sep 8, 2016

theraysmith commented Dec 7, 2016

amitdo commented Dec 7, 2016

Shreeshrii commented Dec 8, 2016 via email • edited Loading

Shreeshrii commented Dec 8, 2016

quocpt commented Jan 26, 2017

Shreeshrii commented Feb 13, 2017 • edited Loading

Shreeshrii commented Feb 18, 2017

Shreeshrii commented Feb 22, 2017

galharth commented Dec 30, 2017

msklvsk commented Nov 26, 2018

bertsky commented Mar 14, 2019

bertsky commented Mar 15, 2019

KilianSillero commented Apr 12, 2019

bertsky commented Apr 15, 2019 • edited Loading

bertsky commented Apr 15, 2019

Shreeshrii commented Aug 30, 2016 •

edited

Loading

Shreeshrii commented Dec 8, 2016 via email •

edited

Loading

Shreeshrii commented Feb 13, 2017 •

edited

Loading

bertsky commented Apr 15, 2019 •

edited

Loading