Terminology: Glyphs, characters, codepoints #87

kba · 2016-10-26T14:05:07Z

None of these are "per-glyph" because "glyph" isn't a uniquely defined
concept independent of font. As far as hOCR is concerned, you need to
output information per codepoint. There is no single correct way of doing
that: it depends on the script, the encoding, and the OCR engine.

For bounding boxes (or cuts) on accented Western scripts, my recommendation
would be: (1) view the whole accented letter as a single glyph, (2) use
normalized unicode with composed characters, (3) if a single glyph
corresponds to multiple codepoints, output a bounding box for the first
codepoint and output empty bounding boxes for the remaining codepoints.

We should define it and s/character/codepoint in the spec.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terminology: Glyphs, characters, codepoints #87

Terminology: Glyphs, characters, codepoints #87

kba commented Oct 26, 2016 •

edited

Loading

Terminology: Glyphs, characters, codepoints #87

Terminology: Glyphs, characters, codepoints #87

Comments

kba commented Oct 26, 2016 • edited Loading

kba commented Oct 26, 2016 •

edited

Loading