-
Notifications
You must be signed in to change notification settings - Fork 1
Ingest workflow
This page describes the transformation process of SIP package to AIP package using the BPM process called Ingest workflow.
Ingest workflow consists of mandatory and voluntary BPM tasks that represent the elementary parts of the transformation process. All tasks grouped together in the subprocess labeled finalize ingest
and format identifier
ale mandatory, other tasks (from duplicate SIP check
to validator
) are optional. The SIP processsing starts in the init event
and finishes in the ingest success event
in case of success or ingest error event
in case of unsuccess.
The picture shows the BPM process definition of ingestWorkflow.bpmn opened in Camunda BPMN software.
Just before the execution of the BPM process starts, the SIP package is preprocessed. This includes the following steps:
-
- verification of the hash of the incoming SIP package to the hash value supplied in the
*.sums
file
- verification of the hash of the incoming SIP package to the hash value supplied in the
-
- copying of SIP package content to workspace
-
- creation or assignment of existing authorial package and SIP package according to the determined level of versioning
-
- initialization of Ingest workflow and the BPM process variables
There are two levels of versioning and two types of related AIPs linkage resolution. ARClib automatically determines the linkage and XML versioning type.
This versioning is performed if: the implicit linkage is applied and the SIP with the highest version number from the SIPs belonging the same authorial package has the same checksum as the checksum of the incoming SIP.
The SIP is ingested to produce new AIP XML which is stored to the Archival Storage next to the previous SIP. The minor version is incremented.
This versioning is performed if: the explicit linkage is applied, or the implicit linkage is applied and the SIP with the highest version number from the SIPs belonging the authorial package has a different checksum as the checksum of the incoming SIP.
The SIP is ingested to produce new AIP XML and is stored together with the AIP XML to the Archival Storage. The major version is incremented.
If the JSON config of the ingest workflow has deletePreviousSipVersion option set to true, then deletion request of the previous SIP version is automatically created at the end of the ingest. If the option is not set in JSON config, default is false.
Explicit linkage may apply if the ARCLib_export_info.csv file is present in the SIP root. The file is parsed and deleted during preprocessing and if it contained the authorial_package_uuid property property, then Authorial package with corresponding uuid is linked.
- if the file is found but property is not included, implicit linkage is applied
- if the property is included but the authorial package with provided uuid is not found, process fails
- explicit linkage always results in SIP versioning (since precnce of ARCLib_export_info.csv in the SIP ZIP implies that checksum of the SIP ZIP != checksum of the AIP ZIP)
Even if explicit linkage is applied, the incoming SIP is still scanned for the authorial ID. If the authorial ID is found on the configured path and it differs from the one recorder in DB, then authorial ID of the linked authorial package is updated at the end of the ingest.
Implicit linkage is applied if the explicit linkage is not applied.
In case the extracted authorial id in combination with the producer id matches an existing authorial package in database, the SIP/XML versioning is triggered, otherwise new authorial package is created and no versioning is performed.
See Path to XML file with authorial ID and XPath to node with authorial ID at Usage@Sip profiles
This task generates three types of fixity for the whole SIP (.zip archive with the SIP content) using algorithms MD5
, Sha512
and Crc32
. The result is written in ARCLibXml to the premis:objectCharacteristics
element of premis:object
and the respective event is recorded in the premis:event
of type message digest calculation
.
This task also generates Sha512
of every SIP file and the result is written in ARCLibXml mets:filesec
.
This task extracts metadata from original SIP using XSLT specified in a SIP profile.
From the XML files of the SIP package ARCLib extracts the specified metadata and produces the primary version of ARClibXml. The process of the extraction is defined using XSLT template that is stored in the SIP profile together with the path to the SIP META XML of the SIP (e.g. main METS file).
The sample XSLT template comprehensiveSipProfile.xsl provides the following mapping (it can be used as the starting point when defining custom templates).
Identity mapping:
XPath | Value |
---|---|
/METS:mets/@LABEL | /METS:mets/@LABEL |
/METS:mets/@TYPE | /METS:mets/@TYPE |
/METS:mets/METS:metsHdr/METS:agent | /METS:mets/METS:metsHdr/METS:agent |
/METS:mets/METS:dmdSec | /METS:mets/METS:dmdSec |
Aggregated mapping:
Elements ARCLIB:formats
, ARCLIB:devices
, ARCLIB:eventAgents
, ARCLIB:ImageCaptureMetadata
and ARCLIB:creatingApplications
are computed using aggregation. The source values are extracted from files located in folder amdSec
with filenames matching regex amd*.xml
. For the details see the template.
System validates the result of the extraction against XSDs (METS, PREMIS, ARCLIB and other XSDs referenced from the resulting XML).
System also checks existence of required nodes documented in arclibXmlSystemWideValidationConfig.csv (those where Data source starts with SIP profile XSLT). If some node is missing and no JSON config at systemWideValidation declaring how to handle this is present, incident is thrown.
This task generates additional parts of ARCLibXML using the extracted metadata and the values computed during the ingest workflow process. It appends these parts to the extracted metadata from the task ARCLibXML extractor.
The ARCLibXML generation consists of the following phases:
It consists of the following phases:
- if necessary, changing of the
mets
namespace prefix to upper caseMETS
- adding
METS:OBJID
- filling
METS:metsHdr
, the element METS:metsHdr must exist in the XML created during the metadata extraction using XSLT (in the first phase) - adding SIP and XML versions and related SIP and XML
- adding
premis:agents
and respectivepremis:events
- adding
premis:object
for whole package - adding
METS:fileSec
- adding
METS:structMap
The generated ARCLibXML is validated using ARCLibXmlValidator. Validation process consists of three parts:
- XML schema validation - METS, ARCLIB_XSD, PREMIS
- Checking existence of required nodes - Checks existence of required nodes documented in arclibXmlSystemWideValidationConfig.csv (those where Data source does not start with SIP profile XSLT). If some node is missing and no JSON config at systemWideValidation declaring how to handle this is present, incident is thrown.
- Checking content of some nodes against database - checking the consistence of some XML nodes against database (AIP ID, XML ID, profiles, timestamps, related packages)
This task stores AIP to archival storage. The archival storage is accessed using REST interface and depending on the level of versioning of SIP it is differentiated between calling the endpoint for XML update and the endpoint for storing of AIP. When debug mode is active, an internal debugging version of archival storage is used instead of the real archival storage. If the Archival Storage is unreachable or returns non-standard result then ARCLib tries to repeat the task. The count of attempts and interval between atempts is configurable.
This task verifies that archival storage has succeeded to persist the SIP (or to update the XML respectively). Archival storage is asked for the state of the AIP. In case the state is:
a) ARCHIVED
:
- ingest workflow state is set to PERSISTED and ARCLib XML document is indexed with ARCHIVED state (ingest workflow states states are further described at the end of this page, ARCLibXml document states are described in WIKI page Usage/Aip search)
- JMS message is sent to Coordinator to inform the batch that the ingest workflow process has finished
- SIP content is deleted from workspace
- SIP content is deleted from transfer area
b) PROCESSING
or PRE_PROCESSING
:
These states indicates that AIP has been successfully transferred to Archival Storage, but Archival storage has not yet saved AIP to all the storage services (ZFS, CEPH etc.), BPM variable aipSavedCheckAttempts
is decremented and the Storage success verifier task is repeated again after waiting for the time specified in the variable aipSavedCheckAttemptsInterval
.
c) ARCHIVAL_FAILURE
, ROLLED_BACK
or ROLLBACK_FAILURE
:
These states indicates that AIP has not been successfully transferred to Archival Storage, however failed afterwards (e.g. one of the logical storages has failed). BPM variable aipStoreAttempts
is decremented and the BPM process execution returns back to the Archival storage task to repeat the AIP storage after waiting for the time specified in the variable aipStoreAttemptsInterval
Non compulsory task may be placed at an arbitrary place within the Ingest workflow pipeline before the finalize ingest
group.
If automatic XML versioning is not suitable, this task may block ingest workflow (let it fail) instead of creating new XML version of the package.
If SIP versioning is applied, then this tasks downloads previous version of the package from the Archival Storage to workspace and rewrites its data with the data of incoming SIP or changes its content according to provided JSON config.
- if incoming SIP does not contain file contained in previous version, file from previous version is kept
- if incoming SIP contains file not contained in previous version, file from incoming SIP si added
- if both, incoming SIP and previous version contain a file, file from incoming SIP overwrites file from previous version
If the SIP versioning is not applied, then this tasks passes with no action.
JSON config allows modification of the older SIP version before it is merged with the new incoming SIP, see JSON config doc.
Even though Format identifier is not compulsory, the output of the format identifier is by default required in the system-wide validation done at ARClib XML generator, see arclibXmlSystemWideValidationConfig.csv. Even though this validation may be skipped by JSON config, the methodological recommendation is to include this task in your workflow.
This task performs the format analysis of files in SIP. For every file of SIP it uniquely determines a single file format. The result is written in ARClibXml to the element ARCLIB:formats
as the aggregated formats and the tool used in the identification is written as premis:agent
(tool name, tool version). The version in case of DROID
contains also the signature files. Moreover, the associated events are written as premis:events
. This includes the event of a successful identification and also the events when the identification tool failed to uniquely determine the format.
In some cases the identification tool identifies a file with multiple formats. It is possible to resolve the format ambiguity with the predefined values specified in Workflow config. If the predefined values helped to resolve the ambiguity, the config used is written to respective premis:event
created specially for the ambiguity problem. If there are any files that are unable to be determined by the config, an new incident is created that contains information about the problematic files. After that ARCLib waits for the user to provide a new config that manually resolves the format conflict.
The type of format identification tool is initialized based on the Workflow config. Currently there is only single format identification tool: DROID
.
Format identification with DROID consists of these stages:
-
run profile: DROID is passed with variables to perform recursive search
-R
, to create profile from SIP-a
and-p
to save result to the specified file -
export profile: DROID is called to export the result of the specified profile to a CSV file with one row for each format for each file profiled (if a file has multiple format identifications, then a separate row will be written out for each identification made)
-
parse CSV: from the CSV file with the exported profile ARCLib parses out the PUID values
Placing the format identifier somewhere at the front is preferred, because if any ingest issue related to particular file occurs in subsequent task, the identified format will be linked with the issue.
Fixity checker task verifies chekcsums of SIP files. Fixity checker supports various methods of verification depending on the format of the SIP. At least one of the methods described bellow must be set in JSON config.
Additionally, the task may be also configured with packageType (METS/BAGIT).
- COMMON - fixity checker scans whole SIP for all files with .md5/.sha1/.sha256/.sha512 extensions and verifies fixities of all files specified in those checksum files
- checker uses following regex:
(\w+)[*\s]+(\S+)
to parse the checksum (group 1) and path to the file (2) - If the path starts with slash (forward or backward), system resolves the path with the SIP root folder and checks for file existence. If no file exists on that path, system falls back to the relative path resolution. Relative path resolution is done against the parent of the checksum file.
- checker uses following regex:
- BAGIT - system looks for BAGIT manifest files (at the root of the package as per BAGIT specification) and verifies provided fixities
- METS - system expects the main metadata file configured in Sip Profile to be a file containing METS metadata and verifies the fixities provided in that METS file
There are three types of errors that can occur during the verification:
- some file has invalid checksum
- some file is missing
- there is an unsupported checksum type specified in the SIP META XML
It is possible to set any of these three types of errors to be ignored using the Workflow config. The error and its subsequent solution is later written to the ARCLibXml as a respective premis:event
.
Scans SIP package for viruses. Type of antivirus software is initialized in the Workflow config. Currently the only supported tool is ClamAV
. In case of a virus found, it is performed one of the actions depending on the configuration in Workflow config:
-
IGNORE
: the infected files are ignored and Ingest workflow process continues -
QUARANTINE
: the infected files are moved to the quarantine and the Ingest workflow process is stopped -
CANCEL
: the Ingest workflow process is stopped
The error and its subsequent solution is later written to the ARCLibXml as a respective premis:event
.
Validates SIP using the given validation profile. If the validation has failed, corresponding error is thrown with the reason of the validation failure and the Ingest workflow process is stopped. There are three types of checks in a validation profile:
- check for existence of specified files
- validation against XSD schema of specified XML files
- checks of values of particular nodes specified by XPath 3.1 (without namespace prefixes) in the XML files on a specified file path
In case of a validation error the ingest workflow is canceled. It is needed to change the validation profile or to ingest an altered sip package.
Bpm error handler is a task for executing the specified routines (relative to the given Ingest workflow) after an error occurs that is unable to be resolved using an altered Workflow config.
Similar to Bpm error handler, executed specifically if AIP storage failed for too many times or Archival storage takes too long to process the AIP.
The ingest workflow process is able to be configured with the provided JSON config that specifies the parameters for the particular BPM tasks.
Sample JSON config (contains all possible configuration parameters):
{
"sipProfile": "1",
"validationProfile": "1",
"continueOnDuplicateSip": false,
"deletePreviousSipVersion": false,
"systemWideValidation": {
"missingNodesAfterXsltAction": "IGNORE",
"missingNodesAfterFinalValidationAction": "CANCEL"
},
"fixityCheck": {
"0": {
"methods": "COMMON, METS",
"continueOnMissingFiles": true,
"continueOnUnsupportedChecksumType": true,
"continueOnInvalidChecksums": true
}
},
"antivirus": {
"0": {
"type": "CLAMAV",
"cmd": {
"0": "clamscan",
"1": "-r"
},
"infectedSipAction": "QUARANTINE"
}
},
"formatIdentification": {
"0": {
"type": "DROID",
"pathsAndFormats": {
"0": {
"filePath": "this/is/a/filepath",
"format": "fmt/101"
},
"1": {
"filePath": "this/is/another/filepath",
"format": "fmt/993"
}
}
}
},
"sipmerger": {
"move": [
{
"regex": "amdsec/amd_mets_(.+).xml",
"replacement": "amdsec/renamed_$1.xml"
},
{
"regex": "(info)_7033d800-0935-11e4-beed-5ef3fc9ae867(.xml)",
"replacement": "moved/$1$2"
}
],
"reduce": {
"regexes": [
"alto/alto_.+_000\\d.xml",
"txt"
],
"mode": "DELETE"
}
}
}
Workflow config must not contain JSON arrays (no current task supports it and no new should support it in the future as the config merge feature does not supports arrays). If there is a need to pass list of values, special type of object should be used. The object keys are ordinal numbers of items in lists and the value is the list item. Creator of the config must ensure that keys are sorted in the JSON by the ordinal number, otherwise the system may silently shuffle the order and behave incorrectly. Valid object list: {"0":"first item","1":"other item"}
, invalid object list: {"1":"other item","0":"first item"}
.
If there are more instances of the same task in the Workflow definition, e.g. multiple Antivirus instances, it is needed to write separate configuration for every instance. The configuration is mapped to the instances in the order as they are mentioned in JSON.
E.g. following would mean that the first CLAMAV instance in the Workflow definition is configured to QUARANTINE infected files while the second one to IGNORE them.
{
"antivirus": {
"0": {
"type": "CLAMAV",
"cmd": {
"0": "clamscan",
"1": "-r"
},
"infectedSipAction": "QUARANTINE"
},
"1": {
"type": "CLAMAV",
"cmd": {
"0": "clamscan",
"1": "-r"
},
"infectedSipAction": "IGNORE"
}
}
}
Documentation of all config options follows:
$n should be replaced with ordinal number of the configured task (for example, if there are 2 antivirus tasks in the BPM workflow definition, 1 is ordinal number of the first one, 2 of the second one)
JSON path | Description | Supported values |
---|---|---|
/continueOnDuplicateSip |
whether the Ingest workflow process should continue if the XML versioning is detected |
true / false
|
JSON path | Description | Supported values |
---|---|---|
/sipmerger/move |
list of {regex,replacement} pairs specifying files/folders which should be moved/renamed; all moves are made before any reduction (see reduction config below) | - |
/sipmerger/reduce/regexes |
list of regular expressions specifying which files/folders of the previous SIP version should be deleted/kept | - |
/sipmerger/reduce/mode |
file paths to files and respective predefined formats | one of KEEP , DELETE
|
There is currently no configurable option for the SIP merger task.
JSON path | Description | Supported values |
---|---|---|
/formatIdentification/$n/type |
type of format identification tool | currently only one supported type: DROID
|
/formatIdentification/$n/pathsAndFormats |
file paths to files and respective predefined formats | - |
/formatIdentification/$n/pathsAndFormats/filePath |
regex specifying file paths to files | e.g. this/is/a/filepath |
/formatIdentification/$n/pathsAndFormats/format |
PUID of the predefined format | e.g. fmt/993 |
JSON path | Description | Supported values |
---|---|---|
/fixityCheck/$n/continueOnInvalidChecksums |
whether the Ingest workflow process should continue if a file with invalid checksum is found |
true / false
|
/fixityCheck/$n/continueOnUnsupportedChecksumType |
whether the Ingest workflow process should continue if an unsupported checksum type is found in the SIP METADATA file |
true / false
|
/fixityCheck/$n/continueOnMissingFiles |
whether the Ingest workflow process should continue if some file specified in SIP META XML is missing |
true / false
|
/fixityCheck/$n/continueOnMissingFiles |
whether the Ingest workflow process should continue if some file specified in SIP META XML is missing |
true / false
|
/fixityCheck/$n/methods |
configuration of fixity check methods | string containing comma separated list of enum values, at least 1 is required: METS , BAGIT , COMMON
|
JSON path | Description | Supported values |
---|---|---|
/antivirus/$n/infectedSipAction |
action to perform if an infected file is found | one of QUARANTINE , IGNORE , CANCEL
|
/antivirus/$n/type |
type of antivirus tool | currently only single one: CLAMAV
|
/antivirus/$n/cmd |
antivirus executable, with full path if not in $PATH variable, with switches |
JSON path | Description | Supported values |
---|---|---|
/validationProfile |
external ID of the validation profile to be used during validation task |
Note that this value is taken into account only in overriding configs (configs of single ingest / routine / incident solution). Specification of the value at the Producer Profile level config is not required and is ignored. At the Producer Profile level, system will always use the validation profile linked with the particular Producer Profile.
JSON path | Description | Supported values |
---|---|---|
/sipProfile |
external ID of the SIP profile to be used during ARCLibXML extractor task | |
/systemWideValidation/missingNodesAfterXsltAction |
rule to decide whether to cancel the ingest process or ignore missing nodes and continue | one of IGNORE , CANCEL
|
Note that this value is taken into account only in overriding configs (configs of single ingest / routine / incident solution). Specification of the value at the Producer Profile level config is not required and is ignored. At the Producer Profile level, system will always use the SIP profile linked with the particular Producer Profile.
JSON path | Description | Supported values |
---|---|---|
/systemWideValidation/missingNodesAfterFinalValidationAction |
rule to decide whether to cancel the ingest process or ignore missing nodes and continue | one of IGNORE , CANCEL
|
There the following types of exceptions:
-
IncidentException
: exception that could be solved by a change of Workflow config (or admin side-effect action and use of the same config). If thrown, caught byCustomIncidentHandler
that creates a newIncident
associated to the Ingest workflow that waits to be resolved from GUI.-
ConfigParserException
: special type ofIncidentException
that indicates that the Workflow config has a corrupted format and is unable to be parsed properly. -
CommandLineProcessException
: special type ofIncidentException
that indicates that external process started from Java ProcessBuilder (e.g. clamav or droid binary) has failed.
-
-
RuntimeException
: exceptions unable to be solved by a change of Workflow config or admin side action, mostly unexpected errors (breakdown of DB, filesystem etc.). In newer version of ARCLib, possibility of such error is reduced, in many cases IncidentException is thrown instead even if the reason is not known by system and it is unsure whether admin can solve the incident (see catch block of ARCLibDelegate#execute method). -
BpmError
: special type ofRuntimeException
: business errors, that cause a cancellation of the ingest workflow process. They are thrown from the source code, e.g. if we want to finish Ingest workflow process if a virus is found or in case of duplicate SIP.
These exceptions are handled unequally:
-
BpmErrorHandlerDelegate
: handles BpmErrors, triggersIngestErrorHandler
-
CustomIncidentHandler
: handler for exceptions not caught byBpmErrorHandlerDelegate
. It differentiaties between the reconfigurable errors (that are soluble by the Workflow config or admin side action (IncidentException
and its subclasses) and non-reconfigurable errors. Reconfigurable errors cause creation of anIncident
and non-reconfigurable errors trigger execution ofIngestErrorHandler
. -
IngestErrorHandler
: not triggered by any exception, but by the programmer from the code, does the following:- assigns failure info to Ingest workflow (that can be later retrieved) and sets Ingest workflow processing state to
FAILED
- kills Ingest workflow process
- deactivates AIP update lock
- assigns failure info to Ingest workflow (that can be later retrieved) and sets Ingest workflow processing state to
Ingest workflow throughout the processing of SIP transitions between the following states: New
, Processing
, Processed
/ Failed
, Persisted
.
Ingest workflow state | Moment when IW comes to the state |
---|---|
NEW | when processing of SIP located at transfer area is initiated |
PROCESSING | when a free instance of Worker is assigned to process the SIP |
FAILED | when an unrecoverable error occurs (set in IngestErrorHandler ) |
PROCESSED | at the end of execution of ARCLibXml generator task, when the ARCLibXml has been stored to workspace |
PERSISTED | when the AIP has been persisted to Archival storage |
Home
The Ingest - Archival Process
Instructions for Sample Ingest
Predefined Profiles
Docker
Reindex and Reingest (upgrading ARCLib or its profiles)
- System Setup
- System Setup on Debian (unofficial)
- Api and Authorization
- Administration of running system
- ARCLib XML Index Config
- Usage@Index
- Usage@Reingest
- Sip Format
- Usage@Sip Profiles
- Usage@Validation Profiles
- Usage@Workflow Definitions
- Usage@Producer Profiles
- Usage@Debug Mode
- Tutorial@Custom Ingest