Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

different accuracy between paper and competition website #13

Open
bliu3650 opened this issue May 24, 2019 · 19 comments
Open

different accuracy between paper and competition website #13

bliu3650 opened this issue May 24, 2019 · 19 comments

Comments

@bliu3650
Copy link

Hi Author,
Great work from you, and thanks for the sharing.
I noted that the accuracy of your best model on IC13 is 93.6% in the paper, while it's 95.98% on the robust reading competition website.
Could you please explain about this difference?
Thanks.

@bliu3650 bliu3650 changed the title difference accuracy between paper and competition website different accuracy between paper and competition website May 24, 2019
@ku21fan
Copy link
Contributor

ku21fan commented May 24, 2019

Good question!
I had waited for this question.
(so I would not close this issue for the people who come form ICDAR website)

3 major points make different accuracy between our paper and ICDAR challenge.

  1. In our paper, we only used the images which contain alphanumeric label in MJSynth and SynthText.
    For ICDAR challenge, training/evaluation datasets are different.
    Evaluation dataset of ICDAR challenge contains special characters such as '!', '?'
    but training dataset in our paper does not contain special characters.
    To compensate special characters, we generated more synthetic data and used it as the training dataset.
    We also knew that real data improve the accuracy, thus we used additional real data for ICDAR challenge.

  2. We conducted hyper-parameter tuning (ex. channel size of feature extraction, hidden size of LSTM).
    We used a bigger model for ICDAR challenge.

  3. We used ADAM optimizer instead of ADADELTA.

Best,
Baek.

@bliu3650
Copy link
Author

@ku21fan Understood. Thanks for the clarification.
May us know the amount of extra generated/real data, and also the size of your bigger model for that ICDAR challenge? Thanks.

@ku21fan
Copy link
Contributor

ku21fan commented May 28, 2019

@brianliu3650
Yes, we used extra about 10M generated data and about 200K real data.

[Bigger model configuration]
channel size of feature extraction: 1024
hidden size of BiLSTM: 1024 (or 512 would be enough)

P.S. we used different character sets from our paper (--sensitive mode for ICDAR challenge), thus we needed more training data.

opt.character = string.printable[:-6] # same with ASTER setting (use 94 char).

--sensitive mode results in

opt.character = 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~

Best

@tjdevWorks
Copy link
Contributor

@ku21fan
You have done awesome work on both text detection and recognition sides and even loved the demo you have hosted online.
Can you please share some tips on synthetic data generation like the various noises you try to simulate and your thought process behind this task.
Thanks.

@ku21fan
Copy link
Contributor

ku21fan commented Jun 22, 2019

@tjdevWorks Thank you for your attention to our works.
Actually, I did only recognition side :D
Detection side is done by Youngmin et al.

We referred to MJSynth and SynthText, and their code to generate synthetic data.
One basic tip is that a default hyperparameter setting of their code would be not that good enough for other purposes, thus testing with various hyperparameter for your intention would be needed.

Best

@hoainamken
Copy link

hoainamken commented Jun 25, 2019

@ku21fan
Thank you for sharing your awesome works. I checked your demo with Japanese words(https://demo.ocr.clova.ai/), it works really well actually, even with printed and handwritten characters. Out of curiosity, did you also use synthetic data for training/validating the model for detecting Japanese words? Do you have any tips or tricks behind that?
Thanks

@ku21fan
Copy link
Contributor

ku21fan commented Jun 26, 2019

@hoainamken Thank you for your attention to our works :)
I am not sure that text detection part used synthetic data for Japanese.
(it is Youngmin's part plz ask to him :D)
For text recognition part, yes we used synthetic data for Japanese words.
As same as the above comment, we referred to MJSynth and SynthText, and their code to generate synthetic data.
Basically, we used their code, with our materials such as vocabulary/corpus for Japanese.

Best

@hoainamken
Copy link

hoainamken commented Jun 27, 2019

@ku21fan
Thank you so much.
I am working on the recognition part for Japanese. My model accuracy is low, says ~60% on the testing image(real-world image), I applied CRAFT for text detection beforehand. In this case, I always wonder whether the low performance comes from:

  1. Not having enough synthetic data or
  2. The model is too complex while the number of words used for training is small(my toy project has just around 150 words and I have generated around 7500 synthetic images, use only 1 font and 3 backgrounds, added noise such as Gaussian, median filter, sharpen, smooth. The images are cropped by the word length).

Should I generate more synthetic data or reduce the complexity of the model instead. May I ask what would you do in this situation?
My model uses configuration as below: Transformation None --FeatureExtraction VGG --SequenceModeling BiLSTM --Prediction CTC
Best

@ku21fan
Copy link
Contributor

ku21fan commented Jun 27, 2019

@hoainamken
If I was in your situation, I would try 2 things first.

  1. Generate more words with diverse fonts and backgrounds, since 150 words too small compared to MJSynth and SynthText.
  2. While generating more data, try our best model, --Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn, since It usually has higher accuracy than your current configuration (CRNN).

Hope it helps :)

@hoainamken
Copy link

hoainamken commented Jul 1, 2019

@ku21fan
Thanks a lot for your help. I tried your suggestions and it helps me improve the accuracy significantly.
I only have one more question. Since I am creating synthetic data to predict only 150 sentences, I have created exactly the same 150 sentences for synthetic data but with different fonts and backgrounds.
The model after being trained performs well on two separate sentences(as expected), but when the image contains two sentences, it could not.
For example: if predicting image A(シャウエッセン) and image B(御堂筋事件) separately, the accuracy is 100%, when it comes to predicting image C (シャウエッセン 御堂筋事件), it failed.
I think the reason may come from using RNN model (BiLSTM), "御堂筋事件" has never been learned to stay after "シャウエッセン".
imageA
predicted: シャウエッセン
actual: シャウエッセン
imageB
predicted: 御堂筋事件
actual:御堂筋事件
imageC
predicted: けしゴム(消しゴム)冷蔵庫
actual: シャウエッセン 御堂筋事件
My question is, when generating synthetic data, in the case of English recognization, each synthetic image has one word, it can be "school", "teacher" or "student"..etc. But in the case of Japanese, words are not separated by the white space as it is in English, how do you generate synthetic data from the corpus of Japanese?
Sorry for keep commenting on this issue.
Best.

@ku21fan
Copy link
Contributor

ku21fan commented Jul 3, 2019

@hoainamken
This repository runs for an academic purpose, not for business.
So, I’m afraid that we can not answer all of your questions.

@hoainamken
Copy link

@ku21fan
No worries, thank you so much anyway. Once again, great works!

@WenmuZhou
Copy link

what is the learning rate for adam, In addition, I noticed that the learning rate decay is not used when training.

@klarajanouskova
Copy link

You have mentioned that opt.character = 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

However, the output of your method on the competition website seems to contain characters not present in the list, like chinese characters, for example:

https://rrc.cvc.uab.es/?ch=2&com=evaluation&view=method_sample&task=3&m=50493&gtv=10&file=1&eval=1&sample=673

Did you train the model on multi-language datasets, too?

@ku21fan
Copy link
Contributor

ku21fan commented Feb 12, 2020

@klarajanouskova Hello,

No, I did not train with this model on multi-language dataset.

The character that you mentioned is '몲'
Firstly, I used [UNK] token for the unknown character, which is commented out here
(since we didn't use [UNK] token in the paper experiments).

Secondly, I just replace [UNK] token with '몲', because [UNK] is counted as 5 characters.
In other words, '몲' is just the result of simple post-processing to count [UNK] as 1 character.
Thus, instead of '몲', the other characters such as '1' or 'a' or 'b' would be also possible,
but for the strict evaluation, I wanted the character which is not in opt.character, so I used '몲'.
('몲' is Korean, I shortened the '모르겠음' = 'don't know' as '몲')

Best

@klarajanouskova
Copy link

@ku21fan Thanks a lot for the explanation!

@zhongqiang92
Copy link

zhongqiang92 commented Jun 29, 2020

@ku21fan
Thanks a lot for your help. I tried your suggestions and it helps me improve the accuracy significantly.
I only have one more question. Since I am creating synthetic data to predict only 150 sentences, I have created exactly the same 150 sentences for synthetic data but with different fonts and backgrounds.
The model after being trained performs well on two separate sentences(as expected), but when the image contains two sentences, it could not.
For example: if predicting image A(シャウエッセン) and image B(御堂筋事件) separately, the accuracy is 100%, when it comes to predicting image C (シャウエッセン 御堂筋事件), it failed.
I think the reason may come from using RNN model (BiLSTM), "御堂筋事件" has never been learned to stay after "シャウエッセン".
imageA
predicted: シャウエッセン
actual: シャウエッセン
imageB
predicted: 御堂筋事件
actual:御堂筋事件
imageC
predicted: けしゴム(消しゴム)冷蔵庫
actual: シャウエッセン 御堂筋事件
My question is, when generating synthetic data, in the case of English recognization, each synthetic image has one word, it can be "school", "teacher" or "student"..etc. But in the case of Japanese, words are not separated by the white space as it is in English, how do you generate synthetic data from the corpus of Japanese?
Sorry for keep commenting on this issue.
Best.
hi,can you tell me ,how do you overcome it,i have try use long sense in train data

@yxgnahz
Copy link

yxgnahz commented Jul 21, 2020

Hi, Thanks for your awesome work. I'd really like to know where I can find the generation code for MJSynth and SynText? And could you share how you modify the code to generate the training data that you use in ICDAR contest?

Best.

@choudhurym
Copy link

choudhurym commented Aug 31, 2020

Hello @ku21fan,

I have scanned images of electronic theses and dissertations (ETDs) and it contains the typewritten text. I used this website (https://github.com/clovaai/deep-text-recognition-benchmark) to perform OCR. Based on the instruction, it seems it only works on ICDAR and Imdb datasets. Correct me if I am wrong. I tried demo.py on the scanned ETDs and it returns word per ETDs with a low confidence score. If I am not using the right URL, could you please provide me the link which does general OCR?

I also found this website (https://clova.ai/ocr) which does the general OCR. So, is the General OCR not released yet?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants