-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ALTO renderer: move to v4, add Glyphs #2815
base: main
Are you sure you want to change the base?
Conversation
The second commit adds |
Which software uses the additional information in the ALTO file? |
As for current tools, I don't know. (v4 is still pretty new.) In principle, any software that wants to have precise coordinates of individual characters, and with the second commit, notably, post-processing software. With visualizers like PageViewer (which will very likely also show Glyphs for ALTO soon) this is also a nice option for debugging. |
@stweil do you want me to make glyph and/or variant output optional, let's say with a variable |
My primary focus regarding ALTO is best support for those tools which require ALTO data, especially for the DFG viewer. Storing ALTO and delivering it via HTTP works best with small files, so I would not want to add information which is not useful. Issue #2700 is an example of important information which is currently missing. Of course it should be possible to record all OCR results including glyphs and alternate choices. But isn't hOCR sufficient for that? |
That's but one of many equally valid use-cases. Besides, why not compress the xml when sending? (The impact of these extra annotations on compressed file size should be marginal.) Regardless, the proposed extra option should completely take care of this – whether or not this is to be enabled by default would be debatable.
I didn't start off to solve everything ALTO here or prioritize. Besides, your #2705 already solves that, doesn't it?
hOCR is inadequate for many reasons. Besides, ALTO does have that representation itself – why ignore it? Different output renderers should not compete with each other IMO – they should each try to provide the best they can. |
It already does now! |
Partial CI failure (on macos) is unrelated (it cannot find |
@bertsky, I just tested the new code. The size of the ALTO output for a single page increased from 40091 to 223133 byte, mainly because now each glyph gets its own XML element. I think the default should be close to the old output, that means no glyphs and compatible to the DFG viewer. Do we require a new parameter to enable more detailed output, or would it be sufficient to use |
@stweil you still did not elaborate on why exactly the output is incompatible now. Is it really file size? (I cannot believe this.) Or rather the v4 namespace? (Then we need an option for/against that.)
This is completely unrelated I am afraid. If the issue is really size, not namespace, then it should be something like |
@bertsky, @zdenop, that was caused by a software update of the Travis build infrastructure which also replaced cmake by a newer version. Tesseract's build cache still used the old cmake link which was no longer valid. I cleaned the cache, so it passes now. |
@bertsky The ALTO glyph data is not relevant for presentation in viewers. Increasing file size by such magnitudes is per se not preferable because of something that has no value for web users. As you mentioned, it can be a handy debugging feature, which is only relevant in development context, so having a flag disabled by default makes sense IMHO. |
@M3ssman thanks for reviving the discussion!
You mean for document presentation scenarios like DFG-Viewer I guess. But viewers could also be targeting evaluation (e.g. showing errors/differences between different versions visually) and GT production. I'll gladly add the extra parameter. But before I do, can someone please confirm the issue with DFG-Viewer is not the v4 namespace? (Because if it is, then it would make more sense to make the parameter about v3 vs v4 instead of glyphs or not.) |
- use TextBlock, Illustration, GraphicalElement (not just TextBlock), as appropriate for the internal block types - do not enter RIL_TEXTLINE, RIL_WORD, RIL_SYMBOL and ChoiceIterator on anything other than TextBlocks - refactor loop to make it more readable
1dab4e0
to
7d9437f
Compare
Sorry, I just rebased to current master.
|
I agree that glyph information should not be in the ALTO output per default. Concerning the v3-vs.-v4 question, I will try to reach out to the DFG viewer team. |
The DFG Viewer completely ignores the namespace and parses ALTO files only down to the textline level, so it should handle v4 well. Do you have some example files I can use to verify? (Side note: While displaying the fulltext in the DFG-Viewer should work for v4, indexing in Kitodo.Presentation is broken because of a currently fixed namespace URI. We'll have to make this more flexible: kitodo/kitodo-presentation#488) |
Ok, so assuming the namespace URI will be more flexible in Kitodo.Presentation, does that already warrant producing v4 here, or does backwards-compatibility (for supposedly lots of old viewer installations) still triumph? (In the former case, I would add an option |
I'd prefer backwards compatibility! So +1 for |
What about using |
I will. Sorry about the delay! |
@bertsky Is this PR ready to merge? |
@Shreeshrii no, I still have to add |
@bertsky, do you remember this PR? |
@amitdo, yes I do. Apologies for keeping you all waiting for so long. I first have to bisect lots of uncommited changes, which include an implementation of PSM_AUTO_ONLY and fixes to the page/result iterator functions (to avoid missing segments or stopping short under certain rare conditions). |
About the implementation of PSM_AUTO_ONLY. Please consider doing it in another PR. |
Of course! I'll factor all said changes into separate PRs and test them thoroughly before publishing. |
Last chance before the final release of 5.0.0. |
Thanks. IIUC this board contains conflicting statements about this PR (which contains the feature of producing image and table regions in ALTO): Anyway, I will try to get the above discussed blockers out of the way quickly. |
@bertsky : we would like to release 5.3.0 in mid of December. Can you finish this PR for it? |
@zdenop I'll revisit soon, yes. |
This adds
RIL_SYMBOL
bboxes and text to the ALTO output viaGlyph
, which was introduced with v4, hence the namespace update. Looking at the changelog of the schema XSD, I don't see any problems in terms of backwards incompatibility (at least for the features we have been using so far here).The output seems to always validate on what I have seen so far. But there might be surprises. @jakesebright or @stweil maybe you want to take a look?
I don't know of any tools that can actually visualize ALTO Glyphs yet. PageViewer does not render them yet AFAICT. I have no Aletheia though.