-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Glyphs (IMPACT) #26
Comments
Some more thoughts on this issue:
<Glyph ID="P1_ST00001_G04" CONTENT="m" HPOS="262" VPOS="223" WIDTH="9" HEIGHT="24" GC="0.7">
<Variant VC="0.2">rn</Variant>
</Glyph>
<charParams l="317" t="397" r="331" b="426" ... charConfidence="100">t</charParams>
<charParams l="448" t="212" r="504" b="276" ... charConfidence="98" >e</charParams> |
I've been implementing ALTO support for the OCR pipeline at the Digital Humanities chair at the University of Leipzig and being able to encode results on the lowest recognition granularity would enable us to reduce conversion losses from our "native" TEI facsimile format. Float confidence values between 0 and 1 would also fit better in the general data model, as the different semantics for the CC field in the String tag seem rather arbitrary (it is also impossible to associate characters and confidences if the "unit" of recognition is unknown, e.g. for multi-codepoint glyphs). If I understand the schema correctly, significant figures are unspecified as long as values fit in a 32bit float and we are fine with that and any sane parser should be able to deal with arbitrary precision inputs. Lastly, the correct terminology for anything a human recognizes as a single character on the page according to Unicode TR29 is grapheme cluster. We are using this term throughout our documentation although it somewhat breaks down when an engine produces non-printable output, e.g. combining diacritic + character as two separate outputs instead of a single one. |
Good morning and many thanks for the input. We are currently discussing on this feature and were already discussing about the terminology. |
Thanks. Terminology-wise glyph seems to be in more widespread use while grapheme cluster has the benefit of being well defined by Unicode. As I said both terms don't encompass all corner cases and I'm not going to start a religious war over which one is better. On another note hOCR does not allow usable glyph level encoding as the encoding schemes ('cuts' and 'x_boxes') use the same list syntax that makes alignment between glyphs and confidences from the CC attribute impossible in some cases. |
The tech calls have taken place on 17th and 21st of March. Here the summary of the topics discussed and according conclusions / proposals for the final changes:
<!-- Option 1: keep simple and have multiple characters in variants without further information -->
<Glyph ID="P1_ST00003_G03" CONTENT="m" HPOS="253" VPOS="223" WIDTH="9" HEIGHT="24" GC="0.9">
<Variant VC="0.7">rn</Variant>
<Variant VC="0.1">iii</Variant>
</Glyph>
<!-- Option 2: keep simple and have multiple characters in variants, but adding coordinates (optional) further more -->
<Glyph ID="P1_ST00003_G03" CONTENT="m" HPOS="253" VPOS="223" WIDTH="9" HEIGHT="24" GC="0.9">
<!-- multiple chars in a variant?-->
<Variant CONTENT="rn" HPOS="253" VPOS="223" WIDTH="9" HEIGHT="24" GC="0.9"/>
<Variant CONTENT="iii" HPOS="253" VPOS="223" WIDTH="9" HEIGHT="24" GC="0.9"/>
</Glyph>
<!-- Option 3: usage of same logic from file references in METS with seq and par to outline possible combinations -->
<Glyph ID="P1_ST00003_G03" CONTENT="m" HPOS="253" VPOS="223" WIDTH="9" HEIGHT="24" GC="0.9">
<!-- multiple chars in a variant?-->
<par>
<seq>
<Variant VC="0.7">r</Variant>
<Variant VC="0.7">n</Variant>
</seq>
<seq>
<Variant VC="0.7">i</Variant>
<Variant VC="0.7">i</Variant>
<Variant VC="0.7">i</Variant>
</seq>
<Variant VC="0.1">n</Variant>
</par>
</Glyph>
<!-- Option 4: grouping glyphs in additional level above, here -->
<Variant ID="var_opt1">
<Glyph ID="P1_ST00003_G03" CONTENT="m" HPOS="253" VPOS="223" WIDTH="9" HEIGHT="24" GC="0.9">
<Variant VC="0.7">r</Variant>
</Glyph>
</Variant>
<Variant ID="var_opt2">
<Glyph ID="P1_ST00003_G03" CONTENT="n" HPOS="253" VPOS="223" WIDTH="9" HEIGHT="24" GC="0.9">
<Variant VC="0.1">i</Variant>
<Variant VC="0.1">i</Variant>
</Glyph>
<Glyph ID="P1_ST00003_G03" CONTENT="i" HPOS="253" VPOS="223" WIDTH="9" HEIGHT="24" GC="0.9">
<Variant VC="0.1">l</Variant>
</Glyph>
</Variant>
<Variant ID="var_opt3">
<Glyph ID="P1_ST00003_G03" CONTENT="i" HPOS="253" VPOS="223" WIDTH="9" HEIGHT="24" GC="0.9">
<Variant VC="0.1">l</Variant>
</Glyph>
<Glyph ID="P1_ST00003_G03" CONTENT="i" HPOS="253" VPOS="223" WIDTH="9" HEIGHT="24" GC="0.9">
<Variant VC="0.1">l</Variant>
</Glyph>
<Glyph ID="P1_ST00003_G03" CONTENT="i" HPOS="253" VPOS="223" WIDTH="9" HEIGHT="24" GC="0.9">
<Variant VC="0.1">l</Variant>
</Glyph>
</Variant>
<!-- Option 5: grouping glyphs in additional level above, here -->
<Glyph ID="P1_ST00003_G03" CONTENT="m" HPOS="253" VPOS="223" WIDTH="9" HEIGHT="24" GC="0.9">
<!-- multiple chars in a variant?-->
<Variant VC="0.7" VAR_TYPE="Part1" VARContent="rn">r</Variant>
<Variant VC="0.7" VAR_TYPE="Part2" VARContent="rn">n</Variant>
</Glyph> All having taken part on the calls thanks for the input and time. Please comment and correct if something is not matching your understanding or in case of mistakes. |
ABBYY full output for variants <wordRecVariants>
<wordRecVariant wordFromDictionary="0" wordNormal="0" wordNumeric="1" wordIdentifier="1" wordPenalty="0" meanStrokeWidth="40">
<variantText>18e<charParams l="1977" t="197" r="1994" b="237" charConfidence="21" serifProbability="100">
<charRecVariants>
<charRecVariant charConfidence="25" serifProbability="255">i</charRecVariant>
<charRecVariant charConfidence="21" serifProbability="100">1</charRecVariant>
<charRecVariant charConfidence="21" serifProbability="100">I</charRecVariant>
<charRecVariant charConfidence="21" serifProbability="100">l</charRecVariant>
<charRecVariant charConfidence="21" serifProbability="100">Î</charRecVariant>
<charRecVariant charConfidence="21" serifProbability="100">Ï</charRecVariant>
<charRecVariant charConfidence="21" serifProbability="100">î</charRecVariant>
<charRecVariant charConfidence="21" serifProbability="100">ï</charRecVariant>
<charRecVariant charConfidence="13" serifProbability="255">!</charRecVariant>
<charRecVariant charConfidence="13" serifProbability="255">{</charRecVariant>
<charRecVariant charConfidence="13" serifProbability="255">A</charRecVariant>
<charRecVariant charConfidence="13" serifProbability="255">a</charRecVariant>
<charRecVariant charConfidence="13" serifProbability="255">À</charRecVariant>
<charRecVariant charConfidence="13" serifProbability="255">Â</charRecVariant>
<charRecVariant charConfidence="13" serifProbability="255">à</charRecVariant>
<charRecVariant charConfidence="13" serifProbability="255">â</charRecVariant>
</charRecVariants>1</charParams>
<charParams l="1977" t="197" r="1994" b="237" charConfidence="88" serifProbability="75">
<charRecVariants>
<charRecVariant charConfidence="88" serifProbability="75">8</charRecVariant>
<charRecVariant charConfidence="19" serifProbability="73">S</charRecVariant>
<charRecVariant charConfidence="19" serifProbability="73">s</charRecVariant>
<charRecVariant charConfidence="16" serifProbability="43">B</charRecVariant>
<charRecVariant charConfidence="16" serifProbability="43">b</charRecVariant>
<charRecVariant charConfidence="15" serifProbability="255">3</charRecVariant>
<charRecVariant charConfidence="13" serifProbability="255">6</charRecVariant>
</charRecVariants>8</charParams>
<charParams l="1977" t="197" r="1994" b="237" charConfidence="40" serifProbability="40">
<charRecVariants>
<charRecVariant charConfidence="50" serifProbability="100">6</charRecVariant>
<charRecVariant charConfidence="40" serifProbability="40">e</charRecVariant>
<charRecVariant charConfidence="40" serifProbability="40">è</charRecVariant>
<charRecVariant charConfidence="40" serifProbability="40">é</charRecVariant>
<charRecVariant charConfidence="40" serifProbability="40">ê</charRecVariant>
<charRecVariant charConfidence="40" serifProbability="40">ë</charRecVariant>
<charRecVariant charConfidence="29" serifProbability="255">fi</charRecVariant>
<charRecVariant charConfidence="24" serifProbability="255">®</charRecVariant>
<charRecVariant charConfidence="16" serifProbability="27">8</charRecVariant>
<charRecVariant charConfidence="15" serifProbability="43">B</charRecVariant>
<charRecVariant charConfidence="15" serifProbability="43">b</charRecVariant>
<charRecVariant charConfidence="14" serifProbability="32">S</charRecVariant>
<charRecVariant charConfidence="14" serifProbability="32">s</charRecVariant>
<charRecVariant charConfidence="13" serifProbability="255">g</charRecVariant>
</charRecVariants>e</charParams>
</variantText>
</wordRecVariant>
<wordRecVariant wordFromDictionary="0" wordNormal="0" wordNumeric="0" wordIdentifier="0" wordPenalty="7" meanStrokeWidth="40">
<variantText>I8e<charParams l="1977" t="197" r="1994" b="237" charConfidence="21" serifProbability="100">
<charRecVariants>
<charRecVariant charConfidence="25" serifProbability="255">i</charRecVariant>
<charRecVariant charConfidence="21" serifProbability="100">1</charRecVariant>
<charRecVariant charConfidence="21" serifProbability="100">I</charRecVariant>
<charRecVariant charConfidence="21" serifProbability="100">l</charRecVariant>
<charRecVariant charConfidence="21" serifProbability="100">Î</charRecVariant>
<charRecVariant charConfidence="21" serifProbability="100">Ï</charRecVariant>
<charRecVariant charConfidence="21" serifProbability="100">î</charRecVariant>
<charRecVariant charConfidence="21" serifProbability="100">ï</charRecVariant>
<charRecVariant charConfidence="13" serifProbability="255">!</charRecVariant>
<charRecVariant charConfidence="13" serifProbability="255">{</charRecVariant>
<charRecVariant charConfidence="13" serifProbability="255">A</charRecVariant>
<charRecVariant charConfidence="13" serifProbability="255">a</charRecVariant>
<charRecVariant charConfidence="13" serifProbability="255">À</charRecVariant>
<charRecVariant charConfidence="13" serifProbability="255">Â</charRecVariant>
<charRecVariant charConfidence="13" serifProbability="255">à</charRecVariant>
<charRecVariant charConfidence="13" serifProbability="255">â</charRecVariant>
</charRecVariants>I</charParams>
<charParams l="1977" t="197" r="1994" b="237" charConfidence="88" serifProbability="75">
<charRecVariants>
<charRecVariant charConfidence="88" serifProbability="75">8</charRecVariant>
<charRecVariant charConfidence="19" serifProbability="73">S</charRecVariant>
<charRecVariant charConfidence="19" serifProbability="73">s</charRecVariant>
<charRecVariant charConfidence="16" serifProbability="43">B</charRecVariant>
<charRecVariant charConfidence="16" serifProbability="43">b</charRecVariant>
<charRecVariant charConfidence="15" serifProbability="255">3</charRecVariant>
<charRecVariant charConfidence="13" serifProbability="255">6</charRecVariant>
</charRecVariants>8</charParams>
<charParams l="1977" t="197" r="1994" b="237" charConfidence="40" serifProbability="40">
<charRecVariants>
<charRecVariant charConfidence="50" serifProbability="100">6</charRecVariant>
<charRecVariant charConfidence="40" serifProbability="40">e</charRecVariant>
<charRecVariant charConfidence="40" serifProbability="40">è</charRecVariant>
<charRecVariant charConfidence="40" serifProbability="40">é</charRecVariant>
<charRecVariant charConfidence="40" serifProbability="40">ê</charRecVariant>
<charRecVariant charConfidence="40" serifProbability="40">ë</charRecVariant>
<charRecVariant charConfidence="29" serifProbability="255">fi</charRecVariant>
<charRecVariant charConfidence="24" serifProbability="255">®</charRecVariant>
<charRecVariant charConfidence="16" serifProbability="27">8</charRecVariant>
<charRecVariant charConfidence="15" serifProbability="43">B</charRecVariant>
<charRecVariant charConfidence="15" serifProbability="43">b</charRecVariant>
<charRecVariant charConfidence="14" serifProbability="32">S</charRecVariant>
<charRecVariant charConfidence="14" serifProbability="32">s</charRecVariant>
<charRecVariant charConfidence="13" serifProbability="255">g</charRecVariant>
</charRecVariants>e</charParams>
</variantText>
</wordRecVariant>
</wordRecVariants> |
If we agree to focuse on how to record variants for ONE OCR engine (as discussed during the meeting), the case n. 5 doesn't exist. Let's suppose an engine has to segment "my" and split the word in 2 glyphs. It will eventually output variants 'm' / 'n' / 'h' for the first glyph, but never a 2 glyphs variant. For the second glyph: 'y' / 'j'. And at the word level, this engine will propose word variants: "my" / "ny" / "nj" / "ny"... Another engine will segment "my" in 3 glyphs, and maybe the variants for the first one will be: 'i' / 'r' |
#26 , further adapted annotation for String/ALTERNATIVE for clarification of difference
The adaptions are taken to version 3.2 draft schema. I ask everyone to review and accept / reject the change |
In principal okay but our main developer had the following comments: Glyph variants: The main glyphs are restricted to length 1 but variants to length 3. This could be a bit inconvenient when dealing with OCR results. Say FineReader returns 5 options, some with length 1 and some longer. What happens if the first one is not of length 1, does the ALTO exporter tool then check if there is one with length 1 among the other options and change the order? And why three? For Latin that would probably cover most cases, but for other scripts there might be longer ones. HYP: Should variants also be considered for hyphens? |
ACCEPT |
3 similar comments
ACCEPT |
ACCEPT |
ACCEPT |
ACCEPT |
1 similar comment
ACCEPT |
accept |
1 similar comment
accept |
Accept (comment 27 Oct 2016 to be raised as a new issue) |
Stephan, |
Included in v4.0. |
Submitter: Impact
Submitted: 2013-02
use case
Modern OCR software stores information on glyph level. A glyph is essentially a character or ligature. Each character has its own coordinate information and must be separately addressable as a distinct object. Correction and verification processes can be carried out for individual characters. Post-OCR analysis of the text as well as adaptive OCR algorithm must be able to record information on character level.
In order to reproduce the decision of the OCR software, optional characters must be recorded. These are called variants. The OCR software evaluates each variant and picks the one with the highest confidence score as the glyph. The confidence score expresses how confident the OCR software is that a single glyph had been recognized correctly.
implementation
Glyphs are recorded in the element. This element is optional and a child element of . The glyph element may have a element (see above). The (recognized) character of the glyph is stored in the CONTENT attribute.
The glyph’s CONTENT attribute is no replacement for the string’s CONTENT attribute. Due to post-processing steps such as correction the values of both attributes may be inconsistent.
Each element may have an optional VALID attribute. This attribute may only have one of the following three values:
•“s” - expresses that the glyph is a suspicious character. The OCR software is not confident that it has recognized the glyph correctly.
•“r” – the character has been rejected; the OCR is confident that this character is not the glyph.
•“c” - The OCR software is not confident that it has recognized the glyph correctly.
Each may have one or more elements. Each variant represents an option for the glyph that the OCR software could have chosen. The element’s VC attribute records a float value between 0 and 1 that expresses the level of confidence for the variant where is 1 is confident. This attribute is optional. If it is not available, the default value for the variant is “0”. The VC attribute’s semantic is similar to the WC attribute for the element.
example
Proposed change (inital draft):
The text was updated successfully, but these errors were encountered: