Skip to content

Ingest workflow

Jan Tomášek edited this page May 18, 2020 · 55 revisions

Table of Content

Introduction

This page describes the transformation process of SIP package to AIP package using the BPM process called Ingest workflow. Ingest workflow consists of mandatory and voluntary BPM tasks that represent the elementary parts of the transformation process. Compulsory tasks are grouped together in the subprocess labeled finalize ingest, other tasks (from format identification tool to validator) are optional. The SIP processsing starts in the init event and finishes in the ingest success event in case of success or ingest error event in case of unsuccess.

The picture shows the BPM process definition of ingestWorkflow.bpmn opened in Camunda BPMN software.

Preprocessing

Just before the execution of the BPM process starts, the SIP package is preprocessed. This includes the following steps:

    1. verification of the hash of the incoming SIP package to the hash value supplied in the *.sums file
    1. copying of SIP package content to workspace
    1. creation or assignment of existing authorial package and SIP package according to the determined level of versioning
    1. initialization of Ingest workflow and the BPM process variables

Versioning

ARClib automatically determines when the SIP or XML versioning needs to be performed. In case the extracted authorial id in combination with the producer profile id matches an existing authorial package in database, the versioning is triggered, otherwise new authorial package is created and no versioning is performed.

a) The XML versioning is performed if: the SIP with the highest version number from the SIPs belonging the authorial package has the same checksum as the checksum of the incoming SIP

b) The SIP versioning is performed if: the SIP with the highest version number from the SIPs belonging the authorial package has a different checksum as the checksum of the incoming SIP

BPM tasks

Compulsory tasks

Fixity generator

This task generates three types of fixity for the whole SIP (.zip archive with the SIP content) using algorithms MD5, Sha512 and Crc32. The result is written in ARCLibXml to the premis:objectCharacteristics element of premis:object and the respective event is recorded in the premis:event of type message digest calculation.

This task also generates Sha512 of every SIP file and the result is written in ARCLibXml mets:filesec.

ArclibXML extractor

This task extracts metadata from original SIP using XSLT specified in a SIP profile.

From the XML files of the SIP package ARCLib extracts the specified metadata and produces the primary version of ARClibXml. The process of the extraction is defined using XSLT template that is stored in the SIP profile together with the path to the SIP META XML of the SIP (e.g. main METS file).

The sample XSLT template comprehensiveSipProfile.xsl provides the following mapping (it can be used as the starting point when defining custom templates).

Identity mapping:

XPath Value
/METS:mets/@LABEL /METS:mets/@LABEL
/METS:mets/@TYPE /METS:mets/@TYPE
/METS:mets/METS:metsHdr/METS:agent /METS:mets/METS:metsHdr/METS:agent
/METS:mets/METS:dmdSec /METS:mets/METS:dmdSec

Aggregated mapping:

Elements ARCLIB:formats, ARCLIB:devices, ARCLIB:eventAgents, ARCLIB:ImageCaptureMetadata and ARCLIB:creatingApplications are computed using aggregation. The source values are extracted from files located in folder amdSec with filenames matching regex amd*.xml. For the details see the template.

ArclibXML generator

This task generates additional parts of ARCLibXML using the extracted metadata and the values computed during the ingest workflow process. It appends these parts to the extracted metadata from the task ARCLibXML extractor.

The ArclibXML generation consists of the following phases:

It consists of the following phases:

  1. if necessary, changing of the mets namespace prefix to upper case METS
  2. adding METS:OBJID
  3. filling METS:metsHdr, the element METS:metsHdr must exist in the XML created during the metadata extraction using XSLT (in the first phase)
  4. adding SIP and XML versions and related SIP and XML
  5. adding premis:agents and respective premis:events
  6. adding premis:object for whole package
  7. adding METS:fileSec
  8. adding METS:structMap

The generated ARCLibXML is validated using ArclibXmlValidator. Validation process consists of six parts:

  1. XML schema validation - METS, ARCLIB_XSD, PREMIS
  2. Checking existence of required nodes - see this link for the most recent config (config is also part of source code: arclibXmlDefinition.csv but may not be the most recent)
  3. Checking content of node with AIP id - checking the consistence with the value in entity saved in database
  4. Checking content of node with authorial id - checking the consistence with the value in entity saved in database
  5. Checking content of node with physicalVersionNumber - checking the consistence with the value in entity saved in database
  6. Checking content of node with physicalVersionOf - consistence with the value in entity saved in database

In case of success the ARCLib Xml is stored to index using the state PROCESSED.

Archival storage

This task stores AIP to archival storage. The archival storage is accessed using REST interface and depending on the level of versioning of SIP it is differentiated between calling the endpoint for XML update and the endpoint for storing of AIP. In case of the debugging mode set to active, an internal debugging version of archival storage is used instead of the real archival storage. The Workflow definition file specifies a failed job retry time cycle on this task to ensure that ARCLib repeatedly tries to store the AIP package to archival storage. This handles the temporary break downs of the connection to Archival storage. The retry time cycle is specified in the format R5/PT1M where the part before slash denotes the number of retry times and the parts after slash states the time to wait between the retries.

Storage success verifier

This task verifies that archival storage has succeeded to persist the SIP (or to update the XML respectively). Archival storage is asked for the state of the AIP. In case the state is:

a) ARCHIVED:

  1. ingest workflow state and indexed ARCLibXml document state are set to PERSISTED (ingest workflow states states are further described at the end of this page, ARCLibXml document states are described in WIKI page Usage/Aip search)
  2. JMS message is sent to Coordinator to inform the batch that the ingest workflow process has finished
  3. SIP content is deleted from workspace
  4. SIP content is deleted from transfer area

b) PROCESSING or PRE_PROCESSING: this state indicates that AIP has been successfully transfered to Archival Storage, but Archival storage has not yet saved AIP to all the storage services (ZFS, CEPH etc.), BPM variable aipSavedCheckRetries is decremented and the Storage success verifier task is repeated again after waiting for the time specified in the variable aipSavedCheckTimeout

c) ARCHIVAL_FAILURE or ROLLED_BACK: this state indicates that AIP has not been successfully transfered to Archival Storage, BPM variable aipStoreRetries is decremented and the BPM process execution returns back to the Archival storage task to repeat the AIP storage after waiting for the time specified in the variable aipStoreTimeout

Non compulsory tasks

Each of these tasks is optional, however, if some of these tasks is omitted, the result it generates must not be part of the existence check done by the AIP XML generator. For example, for now, the output of the format identifier is required by the arclibXmlDefinition.csv (see AIP XML generator).

Compulsury task may be placed at an arbitrary place within the Ingest workflow pipeline. Placing format identifier at the front is preferred, because if any ingest issue related to particular file occurs in subsequent task, the identified format will be linked with the issue.

Format identification tool

This task performs the format analysis of files in SIP. For every file of SIP it uniquely determines a single file format. The result is written in ARClibXml to the element ARCLIB:formats as the aggregated formats and the tool used in the identification is written as premis:agent (tool name, tool version). The version in case of DROID contains also the signature files. Moreover, the associated events are written as premis:events. This includes the event of a successful identification and also the events when the identification tool failed to uniquely determine the format.

In some cases the identification tool identifies a file with multiple formats. It is possible to resolve the format ambiguity with the predefined values specified in Workflow config. If the predefined values helped to resolve the ambiguity, the config used is written to respective premis:event created specially for the ambiguity problem. If there are any files that are unable to be determied by the config, an new incident is created that contains information about the problematic files. After that Arclib waits for the user to provide a new config that manually resolves the format conflic.

The type of format identification tool is initialized based on the Workflow config. Currently there is only single format identification tool: DROID.

Format identification with DROID consists of these stages:

  1. run profile: DROID is passed with variables to perform recursive search -R, to create profile from SIP -a and -p to save result to the specified file

  2. export profile: DROID is called to export the result of the specified profile to a CSV file with one row for each format for each file profiled (if a file has multiple format identifications, then a separate row will be written out for each identification made)

  3. parse CSV: from the CSV file with the exported profile ARCLib parses out the PUID values

Fixity checker

This task verifies fixity of files specified in SIP META XML (e.g. main METS). There are three types of errors that can occur during the verification:

  1. some file has invalid checksum
  2. some file is missing
  3. there is an unsupported checksum type specified in the SIP META XML

It is possible to set any of these three types of errors to be ignored using the Workflow config. The error and its subsequent solution is later written to the ARCLibXml as a respective premis:event. The fixity checker supports two types of SIP package types: METS, BAGIT.

Antivirus

Scans SIP package for viruses. Type of antivirus software is initialized in the Workflow config. Currently the only supported tool is ClamAV. In case of a virus found, it is performed one of the actions depending on the configuration in Workflow config:

  1. IGNORE: the infected files are ignored and Ingest workflow process continues
  2. QUARANTINE: the infected files are moved to the quarantine and the Ingest workflow process is stopped
  3. CANCEL: the Ingest workflow process is stopped

The error and its subsequent solution is later written to the ARCLibXml as a respective premis:event.

Validator

Validates SIP using the given validation profile. If the validation has failed, corresponding error is thrown with the reason of the validation failure and the Ingest workflow process is stopped. There are three types of checks in a validation profile:

  1. check for existence of specified files
  2. validation against XSD schema of specified XML files
  3. checks of values of some nodes specified by XPath in the XML files on a specified file path

In case of a validation error the ingest workflow is canceled. It is needed to change the validation profile or to ingest an altered sip package.

Error handling tasks

Bpm error handler

Bpm error handler is a task for executing the specified routines (relative to the given Ingest workflow) after an error occurs that is unable to be resolved using an altered Workflow config.

Storage error handler

Similar to Bpm error handler, executed specifically if AIP storage failed for too many times or Archival storage takes too long to process the AIP.

Workflow config

The ingest workflow process is able to be configured with the provided JSON config that specifies the parameters for the particular BPM tasks.

Sample JSON config (contains all possible configuration parameters):

{
  "fixityCheck": {
    "0": {
      "continueOnMissingFiles": true,
      "continueOnUnsupportedChecksumType": true,
      "continueOnInvalidChecksums": true
    }
  },
  "antivirus": {
    "0": {
      "type": "CLAMAV",
      "cmd": {
        "0": "clamscan",
        "1": "-r"
      },
      "infectedSipAction": "QUARANTINE"
    }
  },
  "formatIdentification": {
    "0": {
      "type": "DROID",
      "pathsAndFormats": {
        "0": {
          "filePath": "this/is/a/filepath",
          "format": "fmt/101"
        },
        "1": {
          "filePath": "this/is/another/filepath",
          "format": "fmt/993"
        }
      }
    }
  }
}

Workflow config must not contain JSON arrays (no current task supports it and no new should support it in the future as the config merge feature does not supports arrays). If there is a need to pass list of values, special type of object should be used. The object keys are ordinal numbers of items in lists and the value is the list item. Creator of the config must ensure that keys are sorted in the JSON by the ordinal number, otherwise the system may silently shuffle the order and behave incorrectly. Valid object list: {"0":"first item","1":"other item"}, invalid object list: {"1":"other item","0":"first item"}.

If there are more instances of the same task in the Workflow definition, e.g. multiple Antivirus instances, it is needed to write separate configuration for every instance. The configuration is mapped to the instances in the order as they are mentioned in JSON.

E.g. following would mean that the first CLAMAV instance in the Workflow definition is configured to QUARANTINE infected files while the second one to IGNORE them.

{
  "antivirus": {
    "0": {
      "type": "CLAMAV",
      "cmd": {
        "0": "clamscan",
        "1": "-r"
      },
      "infectedSipAction": "QUARANTINE"
    },
    "1": {
      "type": "CLAMAV",
      "cmd": {
        "0": "clamscan",
        "1": "-r"
      },
      "infectedSipAction": "IGNORE"
    }
  }
}

Currently there are three tasks there are configurable by workflow config: Format identification tool, Fixity checker and Antivirus.

Format identification tool

JSON path Description Supported values
/formatIdentification/type type of format identification tool currently only one supported type: DROID
/formatIdentification/pathsAndFormats file paths to files and respective predefined formats -
/formatIdentification/pathsAndFormats/filePath regex specifying file paths to files e.g. this/is/a/filepath
/formatIdentification/pathsAndFormats/format PUID of the predefined format e.g. fmt/993

Fixity checker

JSON path Description Supported values
/fixityCheck/continueOnInvalidChecksums whether the Ingest workflow process should continue if a file with invalid checksum is found true / false
/fixityCheck/continueOnUnsupportedChecksumType whether the Ingest workflow process should continue if an unsupported checksum type is found in the SIP METADATA file true / false
/fixityCheck/continueOnMissingFiles whether the Ingest workflow process should continue if some file specified in SIP META XML is missing true / false

Antivirus

JSON path Description Supported values
/antivirus/infectedSipAction action to perform if an infected file is found one of QUARANTINE, IGNORE, CANCEL
/antivirus/type type of antivirus tool currently only single one: CLAMAV
/antivirus/cmd antivirus executable, with full path if not in $PATH variable, with switches

Error handling

There the following types of exceptions:

  1. IncidentException: exception able to be solved by a change of Workflow config, if thrown, caught by CustomIncidentHandler that creates a new Incident associated to the Ingest workflow that waits to be resolved by a change of the Workflow config

  2. ConfigParserException: special type of IncidentException that indicates that the Workflow config has a corrupted format and is unable to be parsed properly

  3. RuntimeException: exceptions unable to be solved by a change of Workflow config, mostly unexpected errors (breakdown of DB, filesystem etc.) soluble by an interaction of administrator

  4. BpmError: special type of RuntimeException: business errors, that cause a cancellation of the ingest workflow process, they are thrown from the source code, e.g. if we want to finish Ingest workflow process if a virus is found

These exceptions are handled unequally:

  1. BpmErrorHandlerDelegate: handles BpmErrors, triggers IngestErrorHandler

  2. CustomIncidentHandler: handler for exceptions not caught by BpmErrorHandlerDelegate. It differentiaties between the reconfigurable errors (that are soluble by the Workflow config (IncidentException and its subclasses) and non-reconfigurable errors. Reconfigurable errors cause creation of an Incident and non-reconfigurable errors trigger execution of IngestErrorHandler.

  3. IngestErrorHandler: not triggered by any exception, but by the programmer from the code, does the following:

    1. assigns failure info to Ingest workflow (that can be later retrieved) and sets Ingest workflow processing state to FAILED
    2. kills Ingest workflow process
    3. deactivates AIP update lock

Lifecycle of an Ingest process

Ingest workflow throughout the processing of SIP transitions between the following states: New, Processing, Processed / Failed, Persisted.

Ingest workflow state Moment when IW comes to the state
NEW when processing of SIP located at transfer area is initiated
PROCESSING when a free instance of Worker is assigned to process the SIP
FAILED when an unrecoverable error occurs (set in IngestErrorHandler)
PROCESSED at the end of execution of ARCLibXml generator task, when the ARCLibXml has been stored to index
PERSISTED when the AIP has been persisted to Archival storage
Clone this wiki locally