Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recognize: skip tiny or bin-empty lines, too #76

Merged
merged 1 commit into from
Sep 16, 2022

Conversation

bertsky
Copy link
Contributor

@bertsky bertsky commented Aug 19, 2022

No description provided.

@mikegerber mikegerber self-assigned this Sep 15, 2022
@mikegerber mikegerber self-requested a review September 15, 2022 16:36
@mikegerber mikegerber added the enhancement New feature or request label Sep 15, 2022
Copy link
Collaborator

@mikegerber mikegerber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I sympathize with these changes: Is this filter not something that would be useful in other processors, too? I.e. go to the core library?

@kba Your thoughts on this?

@mikegerber
Copy link
Collaborator

It could also go to a generic filter-cleaner processor.

@bertsky
Copy link
Contributor Author

bertsky commented Sep 16, 2022

It could also go to a generic filter-cleaner processor.

Indeed. It makes sense to do that as a postprocessing step, depending on the workflow (not all segmenters will need it). I could easily add something along these lines to ocrd-segment-repair.

Until then, let's consider this a robustness feature. (ocrd_tesserocr already does the same filtering.)

@mikegerber
Copy link
Collaborator

Yeah, I had intended to merge it, I just need to run a quick test later

@kba
Copy link
Member

kba commented Sep 16, 2022

While I sympathize with these changes: Is this filter not something that would be useful in other processors, too? I.e. go to the core library?

Such a filter would make sense to have in core but where would we add it? To the get_TextLine() method in the PAGE API?

@bertsky
Copy link
Contributor Author

bertsky commented Sep 16, 2022

Such a filter would make sense to have in core but where would we add it? To the get_TextLine() method in the PAGE API?

If in core as part of OcrdPage, I'd rather place it on the producer side: TextLineType constructor and .set_Coords(). (Where it's easy to check the bbox dimensions, but checking for binary-empty is challenging – it would involve looking up the line's binarized AlternativeImage, or its parent's etc.) Perhaps even just CoordsType constructor and .set_points() (which we already patch for another reason).

Also, what about the GdsCollector mechanism?

@mikegerber mikegerber merged commit eb48dcb into OCR-D:master Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants