Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hocr-clean #120

Open
zuphilip opened this issue Mar 25, 2018 · 2 comments
Open

hocr-clean #120

zuphilip opened this issue Mar 25, 2018 · 2 comments

Comments

@zuphilip
Copy link
Collaborator

Go through all ocr-elements and delete empty elements and possibly also elements with spaces only. Either do this recursive or start with the top elements and look at the textContent.

@amitdo
Copy link
Contributor

amitdo commented Mar 27, 2018

This should be fixed in tesseract.

@stweil
Copy link
Collaborator

stweil commented Mar 28, 2018

Yes, we think so, too. That would fix new hOCR files. Nevertheless hocr-clean would be useful for existing hOCR files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants