Processing history #39

cneud · 2016-06-16T16:16:26Z

Recently, several feature requests were submitted that relate to the recording of processing information in ALTO (see #13, #27, #36, #35 for in-depth information). In an attempt to consolidate and harmonize the requests, this issue shall serve as the main point of discussion from now on.

Features requested:

Change OCRProcessing to generic Processing (Add Processing to replace OCRProcessing #13, Provenance for OCRProcessing/Processing and Content #35).
Change preProcessingStep, ocrProcessingStep, postProcessingStep to generic processingStep with processingStepType element to record the type of processing performed (Add Processing to replace OCRProcessing #13).
Add required attribute ID to ProcessingStepType (Add Processing to replace OCRProcessing #13, Process Result tracking (IMPACT) #27, Provenance for OCRProcessing/Processing and Content #35).
Add optional attributes COR (CORRECTEDBY), VER (VERIFIEDBY) for all elements. The attributes are holding a list of references (using the ID attribute) to all processingStepType entries which have changed the original value (Process Result tracking (IMPACT) #27).
Being able to link elements to a particular processingStep (Provenance for OCRProcessing/Processing and Content #35).
Example: Use Tesseract's page segmentation with Ocropus's recognition, so that TextLine elements are sourced from one ProcessingStep (Ocropus), but their text content from another one (Tesseract).
Common vocabulary of processingStepType attribute values to increase interoperability (Vocabulary for ProcessingStepDescriptions #36)

jukervin · 2016-07-21T12:47:57Z

Minimally common vocabulary is needed for processingStepType

Jo-CCS · 2016-07-27T13:26:52Z

For referencing the processing IDs on the elements I propose to add a list of IDs space separated as done in METS for the DMDID's.

<xsd:attribute name="DMDID" type="xsd:IDREFS" use="optional"/>

Also I would like to recommend to have at least the type and description of processing node as mandatory. Also the datetime could be mandatory as it has quite important information what and when it was done.

cneud · 2016-07-28T12:45:12Z

Minutes of the technical call 27-07-2016

I. Change OcrProcessing to Processing and preProcessingStep, ocrProcessingStep, postProcessingStep to generic processingStep with ProcessingStepType and required ID attribute.

Is there a need to explicitely define order/sequence of processing?
Does ProcessingDateTime define start or end of processing?
Possibility to derive duration of processing?

Instead of ProcessingDateTime, there should be ProcessingStartDateTime and ProcessingEndDateTime (mandatory). Duration can then be inferred if needed.

How to represent sequence of processingStep?

Follow example of METS with space separated IDs - e.g.

<TextLine ID="ID069" [...] PROCESSINGREFS="ID001 ID002 ID003 ID004 ID005">

II. Add optional attributes COR (CORRECTEDBY), VER (VERIFIEDBY) for all elements.

Example newspaper digitisation: only headlines manually corrected
Rather just change CS (correction status) attribute?
This would result in mixing of metadata & content
Difficulty of inheritance - can child elements inherit processingStep from parent? For which use cases? E.g. Binarization (yes) vs. Rotation (no).

Further discussion and examples are required.

III. Common vocabulary for processingStepType

E.g. PREMIS uses generic eventType or METS agent
Huge diversity of processingStep
Few, very generic categories, e.g. relating to image/text/annotations(tags)?
Need better understanding of use cases.

Look at other examples (interoperability). Keep it practical.

jpmoreux · 2016-09-02T09:29:06Z

Practical use cases I recently encountered:

correction of rotation on images
correction of curvature on images
localized text correction (newspaper headlines)
semantic annotation : named entities recognition, signature recognition (authors, illustrators)
structure extraction : ALTO -> EPUB -> ALTO with some logical structure embedded

As I wrote in another comment, we should warn people to store these informations in the document manifest (METS, etc.), at the higher possible level, if the same processing is applied to all the ALTO files of a specific document. But some kind of processing are interesting to described locally (eg which text blocks have been corrected).

cneud · 2017-03-30T14:55:09Z

Minutes of the technical call 30-03-2017

omit optional attributes COR (CORRECTEDBY), VER (VERIFIEDBY) and rather express this via processingStepType vocabulary instead
introduce renaming of OcrProcessing to Processing and preProcessingStep, ocrProcessingStep, postProcessingStep to generic processingStep with ProcessingStepType and add required ID attribute concurrently to current history tracking, which will be declared deprecated
add common vocabulary comprising the types ContentGeneration, ContentModification, PreOperation, PostOperation, Other and referencing a list of IDs as in METS DMDID's
@cneud to create a new draft schema with these changes before next board call

acpopat · 2017-08-02T00:19:35Z

Hi, I'm interested in participating in discussions on this topic. I'm new to the topic of data provenance aside having used systems in the past that provided some form of it.

cneud · 2017-08-02T12:19:15Z

Initial draft for changes listed above: https://github.com/altoxml/schema/tree/master/v4

cneud · 2017-08-02T12:20:32Z

Possibly of interest for more sophisticated provenance/processing history tracking:
https://www.w3.org/TR/prov-overview/

acpopat · 2017-10-19T23:51:26Z

Some processing histories may not be simple sequential pipelines and may require a more general graph structure. As mentioned in today's board call, some OCR post-correction schemes provide examples of such processing:

merging results from multiple OCR engines

post-correction using multiple information sources

coalescing information from multiple page images and their OCR results

If it is desired that the results of such processing be represented in ALTO, then a more general provenance scheme capable of representing graph-structured dependencies might be required, such as that referred to by Clemens in his Aug 2 comment.

cneud · 2018-04-24T14:58:36Z

Included in v4.0.

cneud added 1 submitted processing history labels Jun 16, 2016

cneud self-assigned this Jun 16, 2016

cneud added 2 discussion and removed 1 submitted labels Jun 16, 2016

This was referenced Jun 16, 2016

Vocabulary for ProcessingStepDescriptions #36

Closed

Provenance for OCRProcessing/Processing and Content #35

Closed

Process Result tracking (IMPACT) #27

Closed

Add Processing to replace OCRProcessing #13

Closed

cneud mentioned this issue Aug 2, 2017

Glyphs (IMPACT) #26

Closed

cneud added 3 review 6 voting and removed 2 discussion 3 review labels Nov 29, 2017

Jo-CCS added 7 public comment and removed 6 voting labels Jan 22, 2018

cneud mentioned this issue Apr 24, 2018

Capturing complex workflow provenance #47

Open

cneud closed this as completed Apr 24, 2018

cneud added 8 published and removed 7 public comment labels Apr 24, 2018

cneud mentioned this issue Sep 25, 2020

Display document page metadata qurator-spk/dinglehopper#16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processing history #39

Processing history #39

cneud commented Jun 16, 2016 •

edited

Loading

jukervin commented Jul 21, 2016

Jo-CCS commented Jul 27, 2016 •

edited by cneud

Loading

cneud commented Jul 28, 2016

jpmoreux commented Sep 2, 2016

cneud commented Mar 30, 2017

acpopat commented Aug 2, 2017

cneud commented Aug 2, 2017

cneud commented Aug 2, 2017

acpopat commented Oct 19, 2017

cneud commented Apr 24, 2018

Processing history #39

Processing history #39

Comments

cneud commented Jun 16, 2016 • edited Loading

jukervin commented Jul 21, 2016

Jo-CCS commented Jul 27, 2016 • edited by cneud Loading

cneud commented Jul 28, 2016

jpmoreux commented Sep 2, 2016

cneud commented Mar 30, 2017

acpopat commented Aug 2, 2017

cneud commented Aug 2, 2017

cneud commented Aug 2, 2017

acpopat commented Oct 19, 2017

cneud commented Apr 24, 2018

cneud commented Jun 16, 2016 •

edited

Loading

Jo-CCS commented Jul 27, 2016 •

edited by cneud

Loading