Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major restructuring #84

Merged
merged 77 commits into from
Oct 28, 2016
Merged

Major restructuring #84

merged 77 commits into from
Oct 28, 2016

Conversation

kba
Copy link
Owner

@kba kba commented Oct 25, 2016

I went ahead and aggressively restructured and expanded the spec over the weekend. This is a big change and touches a lot of issues but since I was in the flow, I decided to just keep going.

Snapshot of current commit: https://rawgit.com/kba/hocr-spec/gen-defs/1.2/index.html

New top level structure

  • Terminology
    ** Define element/property/capability
    • Define relations
    • Define grammar for properties
  • elements of hocr
    • Define categories
    • Subsection by category (not definitive but for grouping/readability)
  • properties
  • encoding guidelines (with the bits and pieces on how to markup)
  • Metadata
    • Includes HTML Markup ("Formats")
    • Includes Profiles

For the formal definition of elements/properties, created a YAML file that contains info on relations, examples, grammar, categories. Using python script and templates, generate definition lists for each element/property and include in spec.

Still lots to do but it's in a state where I'd love to get feedback.

kba added 30 commits October 23, 2016 12:50
- Clarify what they are and provide anchors that definition lists can link to
- Classify element-property association levels
- Grammar for property serialization in title= attributes
- Define anchros to class/title attributes in HTML spec
- Clarify what they are and provide anchors that definition lists can link to
- Classify element-property association levels
- Grammar for property serialization in title= attributes
- Define anchros to class/title attributes in HTML spec
@amitdo
Copy link
Collaborator

amitdo commented Oct 25, 2016

As a general note, it looks great and very professional !

@zuphilip
Copy link
Collaborator

Some remarks:

  • The notes in section 2.2 are IMO more like examples.
  • I am not sure if there are really properties that are "required" in a strong way. It looks that currently bbox is the only one property which is required everywhere. But actually if one uses poly then the bbox will not be used. Originally bbox is just a "generally recommended" property.
  • In element boxes I would suggest to just have one point "properties" an make the further distinction inside that, e.g.
    properties
  • I wouldn't do a separate section for the OCR engine specific elements. I think it is much better to discuss them in their context. You could move the general remarks up somewhere and create a section with ocr_column, ocr_carea, ocrx_block and another one with ocr_line and ocrx_line and finally move the ocrx_word up a little.
  • I am a little skeptical that the classifictions for the properties are useful. Maybe, we should rather try to indicate the elements on which this property can be used?

@amitdo
Copy link
Collaborator

amitdo commented Oct 27, 2016

I suggest to move down the 'Logical Elements' section. It is less significant than the other sections and no OCR engine that we know implements them currently.

@amitdo
Copy link
Collaborator

amitdo commented Oct 27, 2016

About the grouping of properties (like scan_res and x_scanner). My suggestion is to break this grouping, and add 'related properties' for some of the properties instead.

@kba
Copy link
Owner Author

kba commented Oct 27, 2016

The notes in section 2.2 are IMO more like examples.

How are they examples?

I am not sure if there are really properties that are "required" in a strong way. It looks that currently bbox is the only one property which is required everywhere. But actually if one uses poly then the bbox will not be used. Originally bbox is just a "generally recommended" property

Granted, bbox is not required for all elements, but it doesn't make sense to have an ocr_carea without bbox or poly. We could also link to 'bbox or poly' or similar.

I am a little skeptical that the classifictions for the properties are useful.

Can you elaborate? Originally, the spec listed the properties under the category of elements. That led to duplication (e.g. ocr_separator being in floats and typesetting). Now, they are grouped in those categories but can be listed in other categories as well. The list is just everything I could think of, but could be reduced. It makes sense IMHO to be able to say: "ocr_line/ocrx_line can contain any inline properties"

Maybe, we should rather try to indicate the elements on which this property can be used?

You get these if you click on the dfn in the heading for a property. From the perspective of a hOCR processor, it makes more sense to iterate the elements and parse the properties according to the element definition IMHO rather than the other way around.

@zuphilip
Copy link
Collaborator

Section 2.2: The abstract description is followed by a specific example with ocr_page, bbox, ocrp_poly. However, it is not yet showing what is described above. Maybe we can extend it to an example with a note, i.e.

An hOCR element (in the following: element) is any HTML tag with a class attribute that contains exactly one class name that starts with ocr_ or ocrx_. Non-OCR related HTML content must not use class names that begin with ocr_ or ocrx_.

Example: <span class="ocr_page">
Note: When referring to an HTML tag with class ocr_page, this spec uses the notation <ocr_page>

@amitdo
Copy link
Collaborator

amitdo commented Oct 27, 2016

An hOCR element (in the following: element) is...

Is that proper English?

https://www.quora.com/What-is-a-more-modern-way-to-say-hereinafter-referred-to-as

@kba
Copy link
Owner Author

kba commented Oct 28, 2016

These issues seem already pretty detailed and it's a big PR already. I'll merge this and create issues for the wording/notation/property classification if it's okay with you.

@kba
Copy link
Owner Author

kba commented Oct 28, 2016

@amitdo @zuphilip I created issues for those remarks that have not yet been adressed. Feel free to create more if I forgot something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants