Skip to content

Ingest workflow

hochla-simon edited this page May 10, 2019 · 55 revisions

This page describes the transformation process of SIP package to AIP package using the BPM process called Ingest workflow. Ingest workflow consists of mandatory and voluntary BPM tasks that represent the elementary parts of the transformation process. Compulsory tasks are grouped together in the subprocess labeled finalize ingest, other tasks (from format identification tool to fixity generator) are optional. The SIP processsing starts in the init event and finishes in the ingest success event in case of success or ingest error event in case of unsuccess.

The picture shows the BPM process definition of resources/bpmn/ingestWorkflow.bpmn opened in Camunda BPMN software.

Preprocessing

Just before the execution of the BPM process starts, the SIP package is preprocessed. This includes the following steps:

    1. verification of the hash of the incoming SIP package to the hash value supplied in the *.sums file
    1. copying of SIP package content to workspace
    1. computation of fixity for files in the root folder (the METS file of SIP typically does not store fixities for files as info.xml, METS.xml, nameOfPackage.md5 located in the root folder so they need to be computed additionally)
    1. computation of file size for every file of the SIP
    1. creation or assignment of existing authorial package and SIP package according to the determined level of versioning
    1. initialization of Ingest workflow and the BPM process variables

Versioning

ARClib automatically determines when the SIP or XML versioning needs to be performed. In case the extracted authorial id in combination with the producer profile id matches an existing authorial package in database, the versioning is triggered, otherwise new authorial package is created and no versioning is performed.

a) The XML versioning is performed if: the SIP with the highest version number from the SIPs belonging the authorial package has the same checksum as the checksum of the incoming SIP

b) The SIP versioning is performed if: the SIP with the highest version number from the SIPs belonging the authorial package has a different checksum as the checksum of the incoming SIP

BPM tasks

Compulsory tasks

ArclibXML extractor

This task extracts metadata from original SIP using XSLT specified in a SIP profile.

From the XML files of the SIP package ARCLib extracts the specified metadata and produces the primary version of ARClibXml. The process of the extraction is defined using XSLT template that is stored in the SIP profile together with the path to the SIP META XML of the SIP (e.g. main METS file).

The sample XSLT template located at resources/sipProfiles/comprehensiveSipProfile.xsl provides the following mapping (it can be used as the starting point when defining custom templates).

Identity mapping:

XPath Value
/METS:mets/@LABEL /METS:mets/@LABEL
/METS:mets/@TYPE /METS:mets/@TYPE
/METS:mets/METS:metsHdr/METS:agent /METS:mets/METS:metsHdr/METS:agent
/METS:mets/METS:dmdSec /METS:mets/METS:dmdSec

Aggregated mapping:

Elements ARCLIB:formats, ARCLIB:devices, ARCLIB:eventAgents, ARCLIB:ImageCaptureMetadata and ARCLIB:creatingApplications are computed using aggregation. The source values are extracted from files located in folder amdSec with filenames matching regex amd*.xml. For the details see the template.

ArclibXML generator

This task generates additional parts of ARCLibXML using the extracted metadata and the values computed during the ingest workflow process. It appends these parts to the extracted metadata from the task ARCLibXML extractor.

The ArclibXML generation consists of the following phases:

It consists of the following phases:

  1. if necessary, changing of the mets namespace prefix to upper case METS
  2. adding METS:OBJID
  3. filling METS:metsHdr, the element METS:metsHdr must exist in the XML created during the metadata extraction using XSLT (in the first phase)
  4. adding SIP and XML versions and related SIP and XML
  5. adding premis:agents and respective premis:events
  6. adding premis:object for whole package
  7. adding METS:fileSec
  8. adding METS:structMap

The generated ARCLibXML is validated using ArclibXmlValidator, validation process consists of six parts:

  1. XML schema validation - METS, ARCLIB_XSD, PREMIS
  2. Checking existence of required nodes - src/main/resources/arclibXmlValidationChecks.txt
  3. Checking content of node with AIP id - checking the consistence with the value in entity saved in database
  4. Checking content of node with authorial id - checking the consistence with the value in entity saved in database
  5. Checking content of node with physicalVersionNumber - checking the consistence with the value in entity saved in database
  6. Checking content of node with physicalVersionOf - consistence with the value in entity saved in database

In case of success the ARCLib Xml is stored to index using the state PROCESSED.

This table describes the detailed mapping of generated values to the elements of ARCLibXml.

XPath Value
/METS:mets/@OBJID id of SIP package - UUID
/METS:mets/METS:metsHdr/@CREATEDATE date and time of start of ingest workflow process
/METS:mets/METS:metsHdr/@LASTMODDATE date and time of finish of ingest workflow process
/METS:mets/METS:metsHdr/METS:altRecordID[@TYPE='original SIP identifier'] authorial id
/METS:mets/METS:metsHdr/@ID XML ID, e.g. ARCLIIB_000000004
/METS:mets/METS:metsHdr/METS:agent/@ROLE CREATOR
/METS:mets/METS:metsHdr/METS:agent/@TYPE ORGANIZATION
/METS:mets/METS:metsHdr/METS:agent/METS:name name of the producer
/METS:mets/METS:dmdSec/METS:mdWrap[@MDTYPE='DC']
/METS:xmlData/dcterms:sipVersionNumber
number of the SIP version
/METS:mets/METS:dmdSec/METS:mdWrap[@MDTYPE='DC']<br/METS:xmlData/dcterms:sipVersionOf ID of the previous version SIP
/METS:mets/METS:dmdSec/METS:mdWrap[@MDTYPE='DC']<br/METS:xmlData/dcterms:xmlVersionNumber number of the XML version
/METS:mets/METS:dmdSec/METS:mdWrap[@MDTYPE='DC']<br/METS:xmlData/dcterms:xmlVersionOf ID of the previous version XML
/METS:mets/METS:amdSec/METS:techMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/premis:object/premis:objectIdentifier/premis:objectIdentifierType
local
/METS:mets/METS:amdSec/METS:techMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/premis:object/premis:objectIdentifier/premis:objectIdentifierValue
obj-package
/METS:mets/METS:amdSec/METS:techMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/premis:object/premis:objectCharacteristics/premis:compositionLevel
0
/METS:mets/METS:amdSec/METS:techMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/premis:object/premis:objectCharacteristics/premis:fixity
/premis:messageDigestAlgorithm
MD5 / CRC32 / SHA-512
/METS:mets/METS:amdSec/METS:techMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/premis:object/premis:objectCharacteristics/premis:fixity
/premis:messageDigest
value of the fixity
/METS:mets/METS:amdSec/METS:techMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/premis:object/premis:objectCharacteristics/premis:size
size of the SIP in bytes
/METS:mets/METS:amdSec/METS:techMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/premis:object/premis:objectCharacteristics/premis:format
/premis:formatDesignation/premis:formatName
application/zip
/METS:mets/METS:amdSec/METS:techMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/premis:object/premis:objectCharacteristics/premis:format
/premis:formatRegistry/premis:formatRegistryName
PRONOM
/METS:mets/METS:amdSec/METS:techMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/premis:object/premis:objectCharacteristics/premis:format
/premis:formatRegistry/premis:formatRegistryKey
x-fmt/263
/METS:mets/METS:amdSec/METS:techMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/premis:object/premis:linkingEventIdentifier/
premis:linkingEventIdentifierType
EventId
/METS:mets/METS:amdSec/METS:techMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/premis:object/premis:linkingEventIdentifier/
premis:linkingEventIdentifierValue
one of: ingestion_event, validation_event, quarantine_event, message_digest_calculation_event, metadata_extraction_event, fixity_check_event, format_check_event, format_identifiation_event, virus_check_event, metadata_modification_event
/METS:mets/METS:amdSec/METS:techMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/ARCLIB:formats/ARCLIB:format/ARCLIB:formatRegistryKey
x-fmt/263
/METS:mets/METS:amdSec/METS:techMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/ARCLIB:formats/ARCLIB:format/ARCLIB:formatRegistryName
PRONOM
/METS:mets/METS:amdSec/METS:techMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/ARCLIB:formats/ARCLIB:format/ARCLIB:creatingApplicationName
DROID format identification tool
/METS:mets/METS:amdSec/METS:techMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/ARCLIB:formats/ARCLIB:format/ARCLIB:creatingApplicationVersion
application version, e.g. DROID: version: 6.4, Signature files: 1. Type: Container Version: 20171130 File name: container-signature-20171130.xml 2. Type: Binary Version: 93 File name: DROID_SignatureFile_V93.xml
/METS:mets/METS:amdSec/METS:techMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/ARCLIB:formats/ARCLIB:format/ARCLIB:dateCreatedByApplication
date of identification, e.g. 2018-07-30
/METS:mets/METS:amdSec/METS:techMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/ARCLIB:formats/ARCLIB:format/ARCLIB:fileCount
count of files with the format
/METS:mets/METS:amdSec/METS:digiprovMD/@ID ID of the agent, e.g. AGENT_003
/METS:mets/METS:amdSec/METS:digiprovMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/premis:agent/premis:agentIdentifierType
AgentID
/METS:mets/METS:amdSec/METS:digiprovMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/premis:agent/premis:agentIdentifierValue
unique identifier of the agent, e.g. agent_DROID
/METS:mets/METS:amdSec/METS:digiprovMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/premis:agent/premis:agentName
name of the agent, e.g. DROID, Clam Av
/METS:mets/METS:amdSec/METS:digiprovMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/premis:agent/premis:agentType
type of agent, e.g. software
/METS:mets/METS:amdSec/METS:digiprovMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/premis:agent/premis:agentNote
other information about the agent, e.g. the version number
/METS:mets/METS:amdSec/METS:digiprovMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/premis:event/premis:eventIdentifier/premis:eventIdentifierValue
unique identifier of the event, e.g. EVENT_7
/METS:mets/METS:amdSec/METS:digiprovMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/premis:event/premis:eventIdentifier/premis:eventIdentifierType
EventId
METS:mets/METS:amdSec/METS:digiprovMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/premis:event/premis:linkingAgentIndentifier/
premis:linkingAgentIdentifierType
AgentId
METS:mets/METS:amdSec/METS:digiprovMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/premis:event/premis:linkingAgentIndentifier/
premis:linkingAgentIdentifierValue
unique identifier of the agent, e.g. agent_DROID, agent_ClamAv
/METS:mets/METS:amdSec/METS:digiprovMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/premis:event/premis:eventDetail
non comulsory additional information, e.g. XML was modified from the reason: XYZ
/METS:mets/METS:amdSec/METS:digiprovMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/premis:event/premis:eventType
one of: ingestion, validation, fixity check, format identification, metadata extraction, quarantine, message digest calculation, virus check, metadata modification
/METS:mets/METS:amdSec/METS:digiprovMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/premis:event/premis:eventDateTime
e.g. 2018-04-08T10:14:41Z
/METS:mets/METS:amdSec/METS:digiprovMD/METS:mdWrap[@MDTYPE='PREMIS']
/METS:xmlData/premis:event/premis:eventOutcomeInformation/premis:eventOutcome
successful or unsuccessful
/METS:mets/METS:fileSec/METS:fileGrp/@USE file
/METS:mets/METS:fileSec/METS:fileGrp/METS:file/@ID object identifier, e.g. obj-003
/METS:mets/METS:fileSec/METS:fileGrp/METS:file/METS:FLocat/@LOCTYPE OTHER
/METS:mets/METS:fileSec/METS:fileGrp/METS:file/METS:FLocat/@xlink:href file path in scope of the SIP
/METS:mets/METS:fileSec/METS:fileGrp/METS:file/@CHECKSUMTYPE checksum type, e.g. MD5
/METS:mets/METS:fileSec/METS:fileGrp/METS:file/@CHECKSUM checksum value
/METS:mets/METS:fileSec/METS:fileGrp/METS:file/@SIZE size in bytes
/METS:mets/METS:structMap/@ID Physical_Structure
/METS:mets/METS:structMap/@TYPE PHYSICAL
/METS:mets/METS:structMap/METS:div/@TYPE Archival Information Package
/METS:mets/METS:structMap/*{METS:div}/METS:div/@LABEL path to the directory
/METS:mets/METS:structMap/*{METS:div}/METS:fptr/@FILEID id of the object, e.g. obj-021

Archival storage

This task stores AIP to archival storage. The archival storage is accessed using REST interface and depending on the level of versioning of SIP it is differentiated between calling the endpoint for XML update and the endpoint for storing of AIP. In case of the debugging mode set to active, an internal debugging version of archival storage is used instead of the real archival storage. The Workflow definition file specifies a failed job retry time cycle on this task to ensure that ARCLib repeatedly tries to store the AIP package to archival storage. This handles the temporary break downs of the connection to Archival storage. The retry time cycle is specified in the format R5/PT1M where the part before slash denotes the number of retry times and the parts after slash states the time to wait between the retries.

Storage success verifier

This task verifies that archival storage has succeeded to persist the SIP (or to update the XML respectively). Archival storage is asked for the state of the AIP. In case the state is:

a) ARCHIVED:

  1. ingest workflow state and indexed ARCLibXml document state are set to PERSISTED (ingest workflow states states are further described at the end of this page, ARCLibXml document states are described in WIKI page Usage/Aip search)
  2. JMS message is sent to Coordinator to inform the batch that the ingest workflow process has finished
  3. SIP content is deleted from workspace
  4. SIP content is deleted from transfer area

b) PROCESSING or PRE_PROCESSING: this state indicates that AIP has been successfully transfered to Archival Storage, but Archival storage has not yet saved AIP to all the storage services (ZFS, CEPH etc.), BPM variable aipSavedCheckRetries is decremented and the Storage success verifier task is repeated again after waiting for the time specified in the variable aipSavedCheckTimeout

c) ARCHIVAL_FAILURE or ROLLED_BACK: this state indicates that AIP has not been successfully transfered to Archival Storage, BPM variable aipStoreRetries is decremented and the BPM process execution returns back to the Archival storage task to repeat the AIP storage after waiting for the time specified in the variable aipStoreTimeout

Non compulsory tasks

Each of these tasks is optional (although it is not ensured that the resulting ARCLibXml will be valid after misssing out some of these tasks) and it is possible to place them at an arbitrary place in within the Ingest workflow pipeline.

Fixity generator

This task generates three types of fixity for the whole SIP (.zip archive with the SIP content) using algorithms MD5, Sha512 and Crc32. The result is written in ARCLibXml to the premis:objectCharacteristics element of premis:object and the respective event is recorded in the premis:event of type message digest calculation.

Format identification tool

This task performs the format analysis of files in SIP. For every file of SIP it uniquely determines a single file format. The result is written in ARClibXml to the element ARCLIB:formats as the aggregated formats and the tool used in the identification is written as premis:agent (tool name, tool version). The version in case of DROID contains also the signature files. Moreover, the associated events are written as premis:events. This includes the event of a successful identification and also the events when the identification tool failed to uniquely determine the format.

In some cases the identification tool identifies a file with multiple formats. It is possible to resolve the format ambiguity with the predefined values specified in Workflow config. If the predefined values helped to resolve the ambiguity, the config used is written to respective premis:event created specially for the ambiguity problem. If there are any files that are unable to be determied by the config, an new incident is created that contains information about the problematic files. After that Arclib waits for the user to provide a new config that manually resolves the format conflic.

The type of format identification tool is initialized based on the Workflow config. Currently there is only single format identification tool: DROID.

Format identification with DROID consists of these stages:

  1. run profile: DROID is passed with variables to perform recursive search -R, to create profile from SIP -a and -p to save result to the specified file

  2. export profile: DROID is called to export the result of the specified profile to a CSV file with one row for each format for each file profiled (if a file has multiple format identifications, then a separate row will be written out for each identification made)

  3. parse CSV: from the CSV file with the exported profile ARCLib parses out the values of the specified column (e.g. PUID)

Fixity checker

This task verifies fixity of files specified in SIP META XML (e.g. main METS). There are three types of errors that can occur during the verification:

  1. some file has invalid checksum
  2. some file is missing
  3. there is an unsupported checksum type specified in the SIP META XML

It is possible to set any of these three types of errors to be ignored using the Workflow config. The error and its subsequent solution is later written to the ARCLibXml as a respective premis:event. The fixity checker supports two types of SIP package types: METS, BAGIT.

Antivirus

Scans SIP package for viruses. Type of antivirus software is initialized in the Workflow config. Currently the only supported tool is ClamAV. In case of a virus found, it is performed one of the actions depending on the configuration in Workflow config:

  1. IGNORE: the infected files are ignored and Ingest workflow process continues
  2. QUARANTINE: the infected files are moved to the quarantine and the Ingest workflow process is stopped
  3. CANCEL: the Ingest workflow process is stopped

The error and its subsequent solution is later written to the ARCLibXml as a respective premis:event.

Validator

Validates SIP using the given validation profile. If the validation has failed, corresponding error is thrown with the reason of the validation failure and the Ingest workflow process is stopped. There are three types of checks in a validation profile:

  1. check for existence of specified files
  2. validation against XSD schema of specified XML files
  3. checks of values of some nodes specified by XPath in the XML files on a specified file path

In case of a validation error the ingest workflow is canceled. It is needed to change the validation profile or to ingest an altered sip package.

Error handling tasks

Bpm error handler

Bpm error handler is a task for executing the specified routines (relative to the given Ingest workflow) after an error occurs that is unable to be resolved using an altered Workflow config.

Storage error handler

Similar to Bpm error handler, executed specifically if AIP storage failed for too many times or Archival storage takes too long to process the AIP.

Workflow config

The ingest workflow process is able to be configured with the provided JSON config that specifies the parameters for the particular BPM tasks.

Sample JSON config (contains all possible configuration parameters):

{"fixityCheck": [{ "continueOnMissingFiles": true, "continueOnUnsupportedChecksumType": true, "continueOnInvalidChecksums": true }],"antivirus":[ {"type":"CLAMAV","cmd":["clamscan","-r"],"infectedSipAction":"QUARANTINE"}], "formatIdentification":[{"type":"DROID","parsedColumn": "PUID", "pathsAndFormats":[ {"filePath":"this/is/a/filepath", "format":"fmt/101"}, {"filePath":"this/is/another/filepath", "format":"fmt/993"} ]}]}

If there are more instances of the same task in the Workflow definition, e.g. multiple Antivirus instances, it is needed to write separate configuration for every instance. The configuration is mapped to the instances in the order as they are mentioned in JSON.

E.g {"antivirus":[ {"type":"CLAMAV","cmd":["clamscan","-r"],"infectedSipAction":"QUARANTINE"}, {"type":"CLAMAV","cmd":["clamscan","-r"],"infectedSipAction":"IGNORE"}]} would mean that the first CLAMAV instance in the Workflow definition is configured to QUARANTINE infected files while the second one to IGNORE them.

Currently there are three tasks there are configurable by workflow config: Format identification tool, Fixity checker and Antivirus.

Format identification tool

JSON path Description Supported values
/formatIdentification/type type of format identification tool currently only one supported type: DROID
/formatIdentification/parsedColumn column parsed from the CSV file with results of the identification created by DROID one of the values specified in cz.cas.lib.arclib.service.formatIdentification.droid.CsvResultColumn
/formatIdentification/pathsAndFormats file paths to files and respective predefined formats -
/formatIdentification/pathsAndFormats/filePath regex specifying file paths to files e.g. this/is/a/filepath
/formatIdentification/pathsAndFormats/format PUID of the predefined format e.g. fmt/993

Fixity checker

JSON path Description Supported values
/fixityCheck/continueOnInvalidChecksums whether the Ingest workflow process should continue if a file with invalid checksum is found true / false
/fixityCheck/continueOnUnsupportedChecksumType whether the Ingest workflow process should continue if an unsupported checksum type is found in the SIP METADATA file true / false
/fixityCheck/continueOnMissingFiles whether the Ingest workflow process should continue if some file specified in SIP META XML is missing true / false

Antivirus

JSON path Description Supported values
/antivirus/infectedSipAction action to perform if an infected file is found one of QUARANTINE, IGNORE, CANCEL
/antivirus/type type of antivirus tool currently only single one: CLAMAV
/antivirus/cmd antivirus executable, with full path if not in $PATH variable, with switches

Error handling

There the following types of exceptions:

  1. IncidentException: exception able to be solved by a change of Workflow config, if thrown, caught by CustomIncidentHandler that creates a new Incident associated to the Ingest workflow that waits to be resolved by a change of the Workflow config

  2. ConfigParserException: special type of IncidentException that indicates that the Workflow config has a corrupted format and is unable to be parsed properly

  3. RuntimeException: exceptions unable to be solved by a change of Workflow config, mostly unexpected errors (breakdown of DB, filesystem etc.) soluble by an interaction of administrator

  4. BpmError: special type of RuntimeException: business errors, that cause a cancellation of the ingest workflow process, they are thrown from the source code, e.g. if we want to finish Ingest workflow process if a virus is found

These exceptions are handled unequally:

  1. BpmErrorHandlerDelegate: handles BpmErrors, triggers IngestErrorHandler

  2. CustomIncidentHandler: handler for exceptions not caught by BpmErrorHandlerDelegate. It differentiaties between the reconfigurable errors (that are soluble by the Workflow config (IncidentException and its subclasses) and non-reconfigurable errors. Reconfigurable errors cause creation of an Incident and non-reconfigurable errors trigger execution of IngestErrorHandler.

  3. IngestErrorHandler: not triggered by any exception, but by the programmer from the code, does the following:

    1. assigns failure info to Ingest workflow (that can be later retrieved) and sets Ingest workflow processing state to FAILED
    2. kills Ingest workflow process
    3. deactivates AIP update lock

Lifecycle of an Ingest process

Ingest workflow throughout the processing of SIP transitions between the following states: New, Processing, Processed / Failed, Persisted.

Ingest workflow state Moment when IW comes to the state
NEW when processing of SIP located at transfer area is initiated
PROCESSING when a free instance of Worker is assigned to process the SIP
FAILED when an unrecoverable error occurs (set in IngestErrorHandler)
PROCESSED at the end of execution of ARCLibXml generator task, when the ARCLibXml has been stored to index
PERSISTED when the AIP has been persisted to Archival storage
Clone this wiki locally