-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Processing history #39
Comments
Minimally common vocabulary is needed for processingStepType |
For referencing the processing IDs on the elements I propose to add a list of IDs space separated as done in METS for the DMDID's. <xsd:attribute name="DMDID" type="xsd:IDREFS" use="optional"/> Also I would like to recommend to have at least the type and description of processing node as mandatory. Also the datetime could be mandatory as it has quite important information what and when it was done. |
Minutes of the technical call 27-07-2016 I. Change OcrProcessing to Processing and preProcessingStep, ocrProcessingStep, postProcessingStep to generic processingStep with ProcessingStepType and required ID attribute.
Instead of ProcessingDateTime, there should be ProcessingStartDateTime and ProcessingEndDateTime (mandatory). Duration can then be inferred if needed.
Follow example of METS with space separated IDs - e.g. <TextLine ID="ID069" [...] PROCESSINGREFS="ID001 ID002 ID003 ID004 ID005"> II. Add optional attributes COR (CORRECTEDBY), VER (VERIFIEDBY) for all elements.
Further discussion and examples are required. III. Common vocabulary for processingStepType
Look at other examples (interoperability). Keep it practical. |
Practical use cases I recently encountered:
As I wrote in another comment, we should warn people to store these informations in the document manifest (METS, etc.), at the higher possible level, if the same processing is applied to all the ALTO files of a specific document. But some kind of processing are interesting to described locally (eg which text blocks have been corrected). |
Minutes of the technical call 30-03-2017
|
Hi, I'm interested in participating in discussions on this topic. I'm new to the topic of data provenance aside having used systems in the past that provided some form of it. |
Initial draft for changes listed above: https://github.com/altoxml/schema/tree/master/v4 |
Possibly of interest for more sophisticated provenance/processing history tracking: |
Some processing histories may not be simple sequential pipelines and may require a more general graph structure. As mentioned in today's board call, some OCR post-correction schemes provide examples of such processing: merging results from multiple OCR engines post-correction using multiple information sources coalescing information from multiple page images and their OCR results If it is desired that the results of such processing be represented in ALTO, then a more general provenance scheme capable of representing graph-structured dependencies might be required, such as that referred to by Clemens in his Aug 2 comment. |
Included in v4.0. |
Recently, several feature requests were submitted that relate to the recording of processing information in ALTO (see #13, #27, #36, #35 for in-depth information). In an attempt to consolidate and harmonize the requests, this issue shall serve as the main point of discussion from now on.
Features requested:
Example: Use Tesseract's page segmentation with Ocropus's recognition, so that TextLine elements are sourced from one ProcessingStep (Ocropus), but their text content from another one (Tesseract).
The text was updated successfully, but these errors were encountered: