Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rotated text lines in hOCR output #148

Open
stweil opened this issue Mar 8, 2019 · 5 comments
Open

Rotated text lines in hOCR output #148

stweil opened this issue Mar 8, 2019 · 5 comments

Comments

@stweil
Copy link
Collaborator

stweil commented Mar 8, 2019

This image contains a full page of vertical text lines. The hOCR ouput which was created by Tesseract 4.0 has no direct indicator which text lines are horizontal or vertical.

It might be interesting to have a filter program which detects the line orientation from the hOCR data by interpreting the coordinates of the bounding boxes.

A similar algorithm would be needed for rendering of the OCR results, for example in PDF output created by hocr-pdf or by Tesseract or in hocrjs.

See also issue #54.

@zuphilip
Copy link
Collaborator

zuphilip commented Mar 8, 2019

This image contains a full page of vertical text lines.

Let us be more precise here: The lines are rotated by 90 degree clock-wise.

The hOCR output which was created by Tesseract 4.0 has no direct indicator which text lines are horizontal or vertical.

Well, but that should be improved first. I think that this rotation should be indicated by textangle property, see http://kba.cloud/hocr-spec/1.2/#textangle, but @kba might know better than I do.

In the Japanese text the lines are not rotated but the text direction is from top-to-bottom.

@stweil stweil changed the title Non horizontal text lines in hOCR output Rotated text lines in hOCR output Mar 8, 2019
@stweil
Copy link
Collaborator Author

stweil commented Mar 8, 2019

That spec says "angle in degrees by which textual content has been rotate[d] relative to the rest of the page". I think this is not very precise and helpful, because for the two pages in question, both pages would have the default value (0 °) as each line has the same rotation as "the rest of the page".

@zuphilip
Copy link
Collaborator

zuphilip commented Mar 8, 2019

Tesseract 3.05 used to add textangle property, see e.g. https://raw.githubusercontent.com/zuphilip/ocr-fileformat-samples/3590006039022801e3847f67feb085b3872585be/samples/hocr/1.1/452114306.hocr . What happened with that?

I agree that the specs are not that clear about the details, see also kba/hocr-spec#101.

@stweil
Copy link
Collaborator Author

stweil commented Mar 8, 2019

That's an important hint. You are right, the old hOCR for the same image includes the textangle property. I'll open an issue for Tesseract.

@stweil
Copy link
Collaborator Author

stweil commented Jun 28, 2023

hocr-extract-images currently ignores the textangle property, so line images with rotated text don't get rotated into a horizontal line (which is required for training).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants