Skip to content

Ingest workflow

Jan Tomášek edited this page Jul 26, 2024 · 54 revisions

Table of Content

Introduction

This page describes the transformation process of SIP package to AIP package using the BPM process called Ingest workflow. Ingest workflow consists of mandatory and voluntary BPM tasks that represent the elementary parts of the transformation process. All tasks grouped together in the subprocess labeled finalize ingest and format identifier ale mandatory, other tasks (from duplicate SIP check to validator) are optional. The SIP processsing starts in the init event and finishes in the ingest success event in case of success or ingest error event in case of unsuccess.

The picture shows the BPM process definition of ingestWorkflow.bpmn opened in Camunda BPMN software.

Preprocessing

Just before the execution of the BPM process starts, the SIP package is preprocessed. This includes the following steps:

    1. verification of the hash of the incoming SIP package to the hash value supplied in the *.sums file
    1. copying of SIP package content to workspace
    1. creation or assignment of existing authorial package and SIP package according to the determined level of versioning
    1. initialization of Ingest workflow and the BPM process variables

Versioning

There are two levels of versioning and two types of related AIPs linkage resolution. ARClib automatically determines the linkage and XML versioning type.

XML versioning

This versioning is performed if: the implicit linkage is applied and the SIP with the highest version number from the SIPs belonging the same authorial package has the same checksum as the checksum of the incoming SIP.

The SIP is ingested to produce new AIP XML which is stored to the Archival Storage next to the previous SIP. The minor version is incremented.

SIP versioning

This versioning is performed if: the explicit linkage is applied, or the implicit linkage is applied and the SIP with the highest version number from the SIPs belonging the authorial package has a different checksum as the checksum of the incoming SIP.

The SIP is ingested to produce new AIP XML and is stored together with the AIP XML to the Archival Storage. The major version is incremented.

If the JSON config of the ingest workflow has deletePreviousSipVersion option set to true, then deletion request of the previous SIP version is automatically created at the end of the ingest. If the option is not set in JSON config, default is false.

Explicit linkage

Explicit linkage may apply if the ARCLib_export_info.csv file is present in the SIP root. The file is parsed and deleted during preprocessing and if it contained the authorial_package_uuid property property, then Authorial package with corresponding uuid is linked.

  • if the file is found but property is not included, implicit linkage is applied
  • if the property is included but the authorial package with provided uuid is not found, process fails
  • explicit linkage always results in SIP versioning

Even if explicit linkage is applied, the incoming SIP is still scanned for the authorial ID. If the authorial ID is found on the configured path and it differs from the one recorder in DB, then authorial ID of the linked authorial package is updated at the end of the ingest.

Implicit linkage

Implicit linkage is applied if the explicit linkage is not applied.

In case the extracted authorial id in combination with the producer id matches an existing authorial package in database, the SIP/XML versioning is triggered, otherwise new authorial package is created and no versioning is performed.

See Path to XML file with authorial ID and XPath to node with authorial ID at Usage@Sip profiles

BPM tasks

Compulsory tasks

Fixity generator

This task generates three types of fixity for the whole SIP (.zip archive with the SIP content) using algorithms MD5, Sha512 and Crc32. The result is written in ARCLibXml to the premis:objectCharacteristics element of premis:object and the respective event is recorded in the premis:event of type message digest calculation.

This task also generates Sha512 of every SIP file and the result is written in ARCLibXml mets:filesec.

ARCLibXML extractor

This task extracts metadata from original SIP using XSLT specified in a SIP profile.

From the XML files of the SIP package ARCLib extracts the specified metadata and produces the primary version of ARClibXml. The process of the extraction is defined using XSLT template that is stored in the SIP profile together with the path to the SIP META XML of the SIP (e.g. main METS file).

The sample XSLT template comprehensiveSipProfile.xsl provides the following mapping (it can be used as the starting point when defining custom templates).

Identity mapping:

XPath Value
/METS:mets/@LABEL /METS:mets/@LABEL
/METS:mets/@TYPE /METS:mets/@TYPE
/METS:mets/METS:metsHdr/METS:agent /METS:mets/METS:metsHdr/METS:agent
/METS:mets/METS:dmdSec /METS:mets/METS:dmdSec

Aggregated mapping:

Elements ARCLIB:formats, ARCLIB:devices, ARCLIB:eventAgents, ARCLIB:ImageCaptureMetadata and ARCLIB:creatingApplications are computed using aggregation. The source values are extracted from files located in folder amdSec with filenames matching regex amd*.xml. For the details see the template.

System validates the result of the extraction against XSDs (METS, PREMIS, ARCLIB and other XSDs referenced from the resulting XML).

System also checks existence of required nodes documented in arclibXmlSystemWideValidationConfig.csv (those where Data source starts with SIP profile XSLT). If some node is missing and no JSON config at systemWideValidation declaring how to handle this is present, incident is thrown.

ARCLibXML generator

This task generates additional parts of ARCLibXML using the extracted metadata and the values computed during the ingest workflow process. It appends these parts to the extracted metadata from the task ARCLibXML extractor.

The ARCLibXML generation consists of the following phases:

It consists of the following phases:

  1. if necessary, changing of the mets namespace prefix to upper case METS
  2. adding METS:OBJID
  3. filling METS:metsHdr, the element METS:metsHdr must exist in the XML created during the metadata extraction using XSLT (in the first phase)
  4. adding SIP and XML versions and related SIP and XML
  5. adding premis:agents and respective premis:events
  6. adding premis:object for whole package
  7. adding METS:fileSec
  8. adding METS:structMap

The generated ARCLibXML is validated using ARCLibXmlValidator. Validation process consists of three parts:

  1. XML schema validation - METS, ARCLIB_XSD, PREMIS
  2. Checking existence of required nodes - Checks existence of required nodes documented in arclibXmlSystemWideValidationConfig.csv (those where Data source does not start with SIP profile XSLT). If some node is missing and no JSON config at systemWideValidation declaring how to handle this is present, incident is thrown.
  3. Checking content of some nodes against database - checking the consistence of some XML nodes against database (AIP ID, XML ID, profiles, timestamps, related packages)

Archival storage

This task stores AIP to archival storage. The archival storage is accessed using REST interface and depending on the level of versioning of SIP it is differentiated between calling the endpoint for XML update and the endpoint for storing of AIP. When debug mode is active, an internal debugging version of archival storage is used instead of the real archival storage. If the Archival Storage is unreachable or returns non-standard result then ARCLib tries to repeat the task. The count of attempts and interval between atempts is configurable.

Storage success verifier

This task verifies that archival storage has succeeded to persist the SIP (or to update the XML respectively). Archival storage is asked for the state of the AIP. In case the state is:

a) ARCHIVED:

  1. ingest workflow state is set to PERSISTED and ARCLib XML document is indexed with ARCHIVED state (ingest workflow states states are further described at the end of this page, ARCLibXml document states are described in WIKI page Usage/Aip search)
  2. JMS message is sent to Coordinator to inform the batch that the ingest workflow process has finished
  3. SIP content is deleted from workspace
  4. SIP content is deleted from transfer area

b) PROCESSING or PRE_PROCESSING: These states indicates that AIP has been successfully transferred to Archival Storage, but Archival storage has not yet saved AIP to all the storage services (ZFS, CEPH etc.), BPM variable aipSavedCheckAttempts is decremented and the Storage success verifier task is repeated again after waiting for the time specified in the variable aipSavedCheckAttemptsInterval.

c) ARCHIVAL_FAILURE, ROLLED_BACK or ROLLBACK_FAILURE: These states indicates that AIP has not been successfully transferred to Archival Storage, however failed afterwards (e.g. one of the logical storages has failed). BPM variable aipStoreAttempts is decremented and the BPM process execution returns back to the Archival storage task to repeat the AIP storage after waiting for the time specified in the variable aipStoreAttemptsInterval

Non compulsory tasks

Non compulsory task may be placed at an arbitrary place within the Ingest workflow pipeline before the finalize ingest group.

Duplicate SIP check

If automatic XML versioning is not suitable, this task may block ingest workflow (let it fail) instead of creating new XML version of the package.

SIP merger

If SIP versioning is applied, then this tasks downloads previous version of the package from the Archival Storage to workspace and rewrites its data with the data of incoming SIP or changes its content according to provided JSON config.

  • if incoming SIP does not contain file contained in previous version, file from previous version is kept
  • if incoming SIP contains file not contained in previous version, file from incoming SIP si added
  • if both, incoming SIP and previous version contain a file, file from incoming SIP overwrites file from previous version

If the SIP versioning is not applied, then this tasks passes with no action.

JSON config allows modification of the older SIP version before it is merged with the new incoming SIP, see JSON config doc.

Format identifier

Even though Format identifier is not compulsory, the output of the format identifier is by default required in the system-wide validation done at ARClib XML generator, see arclibXmlSystemWideValidationConfig.csv. Even though this validation may be skipped by JSON config, the methodological recommendation is to include this task in your workflow.

This task performs the format analysis of files in SIP. For every file of SIP it uniquely determines a single file format. The result is written in ARClibXml to the element ARCLIB:formats as the aggregated formats and the tool used in the identification is written as premis:agent (tool name, tool version). The version in case of DROID contains also the signature files. Moreover, the associated events are written as premis:events. This includes the event of a successful identification and also the events when the identification tool failed to uniquely determine the format.

In some cases the identification tool identifies a file with multiple formats. It is possible to resolve the format ambiguity with the predefined values specified in Workflow config. If the predefined values helped to resolve the ambiguity, the config used is written to respective premis:event created specially for the ambiguity problem. If there are any files that are unable to be determined by the config, an new incident is created that contains information about the problematic files. After that ARCLib waits for the user to provide a new config that manually resolves the format conflict.

The type of format identification tool is initialized based on the Workflow config. Currently there is only single format identification tool: DROID.

Format identification with DROID consists of these stages:

  1. run profile: DROID is passed with variables to perform recursive search -R, to create profile from SIP -a and -p to save result to the specified file

  2. export profile: DROID is called to export the result of the specified profile to a CSV file with one row for each format for each file profiled (if a file has multiple format identifications, then a separate row will be written out for each identification made)

  3. parse CSV: from the CSV file with the exported profile ARCLib parses out the PUID values

Placing the format identifier somewhere at the front is preferred, because if any ingest issue related to particular file occurs in subsequent task, the identified format will be linked with the issue.

Fixity checker

Fixity checker task verifies chekcsums of SIP files. Fixity checker supports various methods of verification depending on the format of the SIP. At least one of the methods described bellow must be set in JSON config.

Additionally, the task may be also configured with packageType (METS/BAGIT).

  • COMMON - fixity checker scans whole SIP for all files with .md5/.sha1/.sha256/.sha512 extensions and verifies fixities of all files specified in those checksum files
    • checker uses following regex: (\w+)[*\s]+(\S+) to parse the checksum (group 1) and path to the file (2)
    • If the path starts with slash (forward or backward), system resolves the path with the SIP root folder and checks for file existence. If no file exists on that path, system falls back to the relative path resolution. Relative path resolution is done against the parent of the checksum file.
  • BAGIT - system looks for BAGIT manifest files (at the root of the package as per BAGIT specification) and verifies provided fixities
  • METS - system expects the main metadata file configured in Sip Profile to be a file containing METS metadata and verifies the fixities provided in that METS file

There are three types of errors that can occur during the verification:

  1. some file has invalid checksum
  2. some file is missing
  3. there is an unsupported checksum type specified in the SIP META XML

It is possible to set any of these three types of errors to be ignored using the Workflow config. The error and its subsequent solution is later written to the ARCLibXml as a respective premis:event.

Antivirus

Scans SIP package for viruses. Type of antivirus software is initialized in the Workflow config. Currently the only supported tool is ClamAV. In case of a virus found, it is performed one of the actions depending on the configuration in Workflow config:

  1. IGNORE: the infected files are ignored and Ingest workflow process continues
  2. QUARANTINE: the infected files are moved to the quarantine and the Ingest workflow process is stopped
  3. CANCEL: the Ingest workflow process is stopped

The error and its subsequent solution is later written to the ARCLibXml as a respective premis:event.

Validator

Validates SIP using the given validation profile. If the validation has failed, corresponding error is thrown with the reason of the validation failure and the Ingest workflow process is stopped. There are three types of checks in a validation profile:

  1. check for existence of specified files
  2. validation against XSD schema of specified XML files
  3. checks of values of particular nodes specified by XPath 3.1 (without namespace prefixes) in the XML files on a specified file path

In case of a validation error the ingest workflow is canceled. It is needed to change the validation profile or to ingest an altered sip package.

Error handling tasks

Bpm error handler

Bpm error handler is a task for executing the specified routines (relative to the given Ingest workflow) after an error occurs that is unable to be resolved using an altered Workflow config.

Storage error handler

Similar to Bpm error handler, executed specifically if AIP storage failed for too many times or Archival storage takes too long to process the AIP.

Workflow config

The ingest workflow process is able to be configured with the provided JSON config that specifies the parameters for the particular BPM tasks.

Sample JSON config (contains all possible configuration parameters):

{
  "sipProfile": "1",
  "validationProfile": "1",
  "continueOnDuplicateSip": false,
  "deletePreviousSipVersion": false,
  "systemWideValidation": {
    "missingNodesAfterXsltAction": "IGNORE",
    "missingNodesAfterFinalValidationAction": "CANCEL"
  },
  "fixityCheck": {
    "0": {
      "methods": "COMMON, METS",
      "continueOnMissingFiles": true,
      "continueOnUnsupportedChecksumType": true,
      "continueOnInvalidChecksums": true
    }
  },
  "antivirus": {
    "0": {
      "type": "CLAMAV",
      "cmd": {
        "0": "clamscan",
        "1": "-r"
      },
      "infectedSipAction": "QUARANTINE"
    }
  },
  "formatIdentification": {
    "0": {
      "type": "DROID",
      "pathsAndFormats": {
        "0": {
          "filePath": "this/is/a/filepath",
          "format": "fmt/101"
        },
        "1": {
          "filePath": "this/is/another/filepath",
          "format": "fmt/993"
        }
      }
    }
  },
  "sipmerger": {
    "move": [
      {
        "regex": "amdsec/amd_mets_(.+).xml",
        "replacement": "amdsec/renamed_$1.xml"
      },
      {
        "regex": "(info)_7033d800-0935-11e4-beed-5ef3fc9ae867(.xml)",
        "replacement": "moved/$1$2"
      }
    ],
    "reduce": {
      "regexes": [
        "alto/alto_.+_000\\d.xml",
        "txt"
      ],
      "mode": "DELETE"
    }
  }
}

Workflow config must not contain JSON arrays (no current task supports it and no new should support it in the future as the config merge feature does not supports arrays). If there is a need to pass list of values, special type of object should be used. The object keys are ordinal numbers of items in lists and the value is the list item. Creator of the config must ensure that keys are sorted in the JSON by the ordinal number, otherwise the system may silently shuffle the order and behave incorrectly. Valid object list: {"0":"first item","1":"other item"}, invalid object list: {"1":"other item","0":"first item"}.

If there are more instances of the same task in the Workflow definition, e.g. multiple Antivirus instances, it is needed to write separate configuration for every instance. The configuration is mapped to the instances in the order as they are mentioned in JSON.

E.g. following would mean that the first CLAMAV instance in the Workflow definition is configured to QUARANTINE infected files while the second one to IGNORE them.

{
  "antivirus": {
    "0": {
      "type": "CLAMAV",
      "cmd": {
        "0": "clamscan",
        "1": "-r"
      },
      "infectedSipAction": "QUARANTINE"
    },
    "1": {
      "type": "CLAMAV",
      "cmd": {
        "0": "clamscan",
        "1": "-r"
      },
      "infectedSipAction": "IGNORE"
    }
  }
}

Documentation of all config options follows:
$n should be replaced with ordinal number of the configured task (for example, if there are 2 antivirus tasks in the BPM workflow definition, 1 is ordinal number of the first one, 2 of the second one)

Duplicate SIP check

JSON path Description Supported values
/continueOnDuplicateSip whether the Ingest workflow process should continue if the XML versioning is detected true / false

SIP merger

JSON path Description Supported values
/sipmerger/move list of {regex,replacement} pairs specifying files/folders which should be moved/renamed; all moves are made before any reduction (see reduction config below) -
/sipmerger/reduce/regexes list of regular expressions specifying which files/folders of the previous SIP version should be deleted/kept -
/sipmerger/reduce/mode file paths to files and respective predefined formats one of KEEP, DELETE

There is currently no configurable option for the SIP merger task.

Format identifier

JSON path Description Supported values
/formatIdentification/$n/type type of format identification tool currently only one supported type: DROID
/formatIdentification/$n/pathsAndFormats file paths to files and respective predefined formats -
/formatIdentification/$n/pathsAndFormats/filePath regex specifying file paths to files e.g. this/is/a/filepath
/formatIdentification/$n/pathsAndFormats/format PUID of the predefined format e.g. fmt/993

Fixity checker

JSON path Description Supported values
/fixityCheck/$n/continueOnInvalidChecksums whether the Ingest workflow process should continue if a file with invalid checksum is found true / false
/fixityCheck/$n/continueOnUnsupportedChecksumType whether the Ingest workflow process should continue if an unsupported checksum type is found in the SIP METADATA file true / false
/fixityCheck/$n/continueOnMissingFiles whether the Ingest workflow process should continue if some file specified in SIP META XML is missing true / false
/fixityCheck/$n/continueOnMissingFiles whether the Ingest workflow process should continue if some file specified in SIP META XML is missing true / false
/fixityCheck/$n/methods configuration of fixity check methods string containing comma separated list of enum values, at least 1 is required: METS, BAGIT, COMMON

Antivirus

JSON path Description Supported values
/antivirus/$n/infectedSipAction action to perform if an infected file is found one of QUARANTINE, IGNORE, CANCEL
/antivirus/$n/type type of antivirus tool currently only single one: CLAMAV
/antivirus/$n/cmd antivirus executable, with full path if not in $PATH variable, with switches

Validator

JSON path Description Supported values
/validationProfile external ID of the validation profile to be used during validation task

Note that this value is taken into account only in overriding configs (configs of single ingest / routine / incident solution). Specification of the value at the Producer Profile level config is not required and is ignored. At the Producer Profile level, system will always use the validation profile linked with the particular Producer Profile.

ARCLibXML extractor

JSON path Description Supported values
/sipProfile external ID of the SIP profile to be used during ARCLibXML extractor task
/systemWideValidation/missingNodesAfterXsltAction rule to decide whether to cancel the ingest process or ignore missing nodes and continue one of IGNORE, CANCEL

Note that this value is taken into account only in overriding configs (configs of single ingest / routine / incident solution). Specification of the value at the Producer Profile level config is not required and is ignored. At the Producer Profile level, system will always use the SIP profile linked with the particular Producer Profile.

ARCLibXML generator

JSON path Description Supported values
/systemWideValidation/missingNodesAfterFinalValidationAction rule to decide whether to cancel the ingest process or ignore missing nodes and continue one of IGNORE, CANCEL

Error handling

There the following types of exceptions:

  1. IncidentException: exception that could be solved by a change of Workflow config (or admin side-effect action and use of the same config). If thrown, caught by CustomIncidentHandler that creates a new Incident associated to the Ingest workflow that waits to be resolved from GUI.
    1. ConfigParserException: special type of IncidentException that indicates that the Workflow config has a corrupted format and is unable to be parsed properly.
    2. CommandLineProcessException: special type of IncidentException that indicates that external process started from Java ProcessBuilder (e.g. clamav or droid binary) has failed.
  2. RuntimeException: exceptions unable to be solved by a change of Workflow config or admin side action, mostly unexpected errors (breakdown of DB, filesystem etc.). In newer version of ARCLib, possibility of such error is reduced, in many cases IncidentException is thrown instead even if the reason is not known by system and it is unsure whether admin can solve the incident (see catch block of ARCLibDelegate#execute method).
  3. BpmError: special type of RuntimeException: business errors, that cause a cancellation of the ingest workflow process. They are thrown from the source code, e.g. if we want to finish Ingest workflow process if a virus is found or in case of duplicate SIP.

These exceptions are handled unequally:

  1. BpmErrorHandlerDelegate: handles BpmErrors, triggers IngestErrorHandler
  2. CustomIncidentHandler: handler for exceptions not caught by BpmErrorHandlerDelegate. It differentiaties between the reconfigurable errors (that are soluble by the Workflow config or admin side action (IncidentException and its subclasses) and non-reconfigurable errors. Reconfigurable errors cause creation of an Incident and non-reconfigurable errors trigger execution of IngestErrorHandler.
  3. IngestErrorHandler: not triggered by any exception, but by the programmer from the code, does the following:
    1. assigns failure info to Ingest workflow (that can be later retrieved) and sets Ingest workflow processing state to FAILED
    2. kills Ingest workflow process
    3. deactivates AIP update lock

Lifecycle of an Ingest process

Ingest workflow throughout the processing of SIP transitions between the following states: New, Processing, Processed / Failed, Persisted.

Ingest workflow state Moment when IW comes to the state
NEW when processing of SIP located at transfer area is initiated
PROCESSING when a free instance of Worker is assigned to process the SIP
FAILED when an unrecoverable error occurs (set in IngestErrorHandler)
PROCESSED at the end of execution of ARCLibXml generator task, when the ARCLibXml has been stored to workspace
PERSISTED when the AIP has been persisted to Archival storage
Clone this wiki locally