Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve yor.traineddata for Yoruba #89

Open
Shreeshrii opened this issue Aug 23, 2017 · 9 comments
Open

Improve yor.traineddata for Yoruba #89

Shreeshrii opened this issue Aug 23, 2017 · 9 comments

Comments

@Shreeshrii
Copy link
Contributor

@theraysmith

See https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/RF1rk3-z4uo/noQzBWbuCAAJ

Message from @Timilehin copied below

I am working on a side project in yoruba that might be helpful. It predicts the right diacritics on unmarked yoruba words. I imagine you could also run the OCR allowing only unmarked characters as output (maybe reduce the height of the scan window so it doesn't see the diacritics) and then pipe the marked characters through the tool I'm building and use the output as a fallback for when the image recognition is not sure.

My project right now needs more training data to make the model more robust. It is very tough to find properly marked yoruba text on the internet. I have physical books and some scanned pdfs in archive.org that I would want to transform to text but the yor.traineddata doesn't seem robust enough. It makes many mistakes such as ọdọ instead of ẹdẹ.
Other times, it just spits out gibberish.
What can I provide to help make yor.traineddata much better and what quantity? (e.g. 200 (pages) images of yourba text and the yoruba text it contains).I think both projects an reinforce each other. I look forward to hearing back.

link to proj -> https://github.com/Timilehin/Yoruba-Intonator

@Shreeshrii Shreeshrii changed the title Add support for Yoruba Improve yor.traineddata for Yoruba Aug 23, 2017
@Shreeshrii
Copy link
Contributor Author

http://crubadan.org/languages/yo

for Yoruba - An Crúbadán - Corpus Building for Minority Languages

@Timilehin
Copy link

Thanks @Shreeshrii for creating an issue for this. I looked at the crubadan corpus. Most of the urls it scrapes from contain Yoruba that is not properly marked. Given the high noise to signal ratio, I don't think it will be good to train with that (or most web scraped data).

I currently have 2 websites that reliably always have properly marked Yoruba. I am thinking of taking screen shots of the text and also passing in the text in text form. I think this will be a good starting point to improve the model. Does this idea sound good?

@amitdo
Copy link

amitdo commented Aug 23, 2017

Making screenshots is not very useful. You need the text itself. A web crawler is what you need to use.

Please list the URLs of those two sites.

Did you try to extract the wordlist from the yor traineddata and examine it?

@Timilehin
Copy link

Timilehin commented Aug 23, 2017

@amitdo I meant my last message in the context of useful training data for tesseract' yor.traineddata, not my project. Please confirm that this OCR system takes in only text and not also images to train its models to predict what texts an image contains. .

The urls are:

  1. http://www.theyorubablog.com
  2. https://www.jw.org/yo/
  3. https://yo.m.wikipedia.org/wiki/Èdè_Yorùbá

Wikipedia (3) only has marked Yoruba for that first page. Every page it links to (and every other page on yo.wikipedia.com that I've seen) is not properly marked. This is not the case for 1 and 2.

@amitdo
Copy link

amitdo commented Aug 23, 2017

The images for trained data are created by the text2image tool. It renders images from text files using variety of digital fonts.

@Timilehin
Copy link

Ah, I see. I probably should have read the docs more carefully. But that's very interesting. I won't have thought to do that.
In that case, you can have all my hand picked, fresh and fully marked yoruba corpus (harvested from those three sites) here ->
https://github.com/Timilehin/Yoruba-Intonator/blob/master/yoruba_sentences.txt

The only thing to note is that I broke them down into one sentence per line. I hope that doesn't affect the model. I will keep adding more as I find them.

@Timilehin
Copy link

Any updates on this? Anything I can be doing on my end?

@Shreeshrii
Copy link
Contributor Author

I am hoping that @theraysmith will include your resources for his next training.

@Timilehin
Copy link

Any updates on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants