System Setup

Table of Content

Dependencies
- Java
- Database
- ClamAV
- DROID
- Solr
- LDAP
- Yarn
- Apache
- Archival Storage System
Setup
Custom Settings
- Application.yml
Tests

Dependencies

Java

Java 17 OpenJDK

Database

download and install PostgreSQL 15 or later
create user with password, (our test instance uses arclib:vuji61oilo)
create database with the created user as the owner

ClamAV

your local antivirus should be deactivated otherwise it blocks eicar test file
download and install ClamAV antivirus (latest should be OK, 0.100.2 is guaranteed)
create database folder inside ClamAV folder
copy freshclam.conf from resources folder to ClamAV folder
run freshclam.exe
add clamscan command to the PATH variables
create CLAMAV environment variable pointing to CLAMAV directory

DROID

download DROID (latest should be OK, 6.4 is guaranteed)
extract archive with the DROID binary and add path to the directory with the binary to PATH under environment variable droid

Solr

download and install Solr 9.7.0
add solr command to the PATH variables
port 8983 should be available for Solr
see the GUI of the running instance of Solr: *host:*8983/

In order to use all features of the system, the Solr should run in a cloud mode. Use -c switch to start Solr in a cloud mode (solr start -c). In cloud mode, the embedded Zookeeper instance is automatically started with Solr, by default listening on port 9983. Cloud mode is currently utilized in Report agenda (reports generated from the Solr datasource uses Solr JDBC interface which is available only in a cloud mode). For more details about running SOLR in cloud mode see SolrCloud and Getting Started with SolrCloud.

LDAP

LDAP is optional. See LDAP section of application.yml file

Yarn

Yarn is a package manager used for dependency management and build of GUI. Yarn requires Node.js to be installed. Follow the installation guide here.
The GUI repository is located here.
In the root of the cloned repository execute yarn to download dependencies and then yarn build-prod to build the GUI.

Apache

Apache httpd (or other http server) where the build of the system GUI is deployed is required. After the build using yarn move its result - the content of build folder - to the configured path at the server, e.g. /var/www/html.

The GUI build will look for API at address localhost/api, if it runs on different port the proxy mapping should be added.

Sample httpd config:

<VirtualHost *:80>
        ServerName arclib.cz
        DocumentRoot /var/www/html

        ProxyPreserveHost On
        ProxyRequests off
        ProxyPass "/api" "http://127.0.0.1:8080/api"
        ProxyPassReverse "/api" "http://127.0.0.1:8080/api"
</VirtualHost>

Archival Storage System

Setup

Building of ARCLib

mvn clean package -DskipTests

Solr

Configsets and collections

ARCLib uses several self-managed collections to index DB records and one schema.xml managed collection used for ARCLib XML index.

Configuration is provided as predefined configsets

Contents of arclib-arclibXmlC-schema folder may change during development, see arclibXmlC schema update process on how to update the schema in Solr.

The arclib-managed configset folder is considered to be up to date with recommended Solr major version, as it is close to the default config of that major version and all collections initialized from this configset run in managed mode, which means their schema is updated automatically when Solr consumes document with new attributes.

If any of the collection break, you can always start from scratch and then reindex the data from their primary datasource using the Index section in ARCLib GUI.

Creating the collections

NOTE: This guide assumes that the SOLR instance is installed as a system service named solr with work dir at /var/solr and there is a privileged user account for SOLR. Also the configset directory located in solr/server/solr should contain predefined configsets

sudo su solr
solr delete -c arclibDomainC
solr delete -c formatC
solr delete -c ingestIssueC
solr delete -c arclibXmlC
solr create -c arclibDomainC -d arclib-managed
solr create -c formatC -d arclib-managed
solr create -c ingestIssueC -d arclib-managed
solr create -c arclibXmlC -d arclib-arclibXmlC-schema

arclibXmlC schema update process

Once the index schema or config of ARCLib XML changes, the changes must be uploaded to Solr, for example using the process bellow.

update the schema.xml and sorconfig.xml in the configset folder (arclib-arclibXmlC-schema)
solr zk -cmd upconfig -n arclibXmlC -d /opt/solr/server/solr/configsets/arclib-arclibXmlC-schema/conf -z localhost:9983 -collection arclibXmlC
curl --location --request POST 'http://localhost:8983/solr/admin/collections?action=RELOAD&name=arclibXmlC'

Prepare ARCLib service

Create ARCLib service (sample systemd config file bellow)

[Unit]
Description=ARCLib
Wants=network-online.target
After=network-online.target

[Service]
WorkingDirectory=/opt/arclib

User=arclib
Group=arclib

ExecStart=/usr/bin/java -jar -Xms2g -Xmx6512m /opt/arclib/arclib.jar

StandardOutput=journal
StandardError=inherit

# Disable timeout logic and wait until process is stopped
TimeoutStopSec=0

# SIGTERM signal is used to stop the Java process
KillSignal=SIGTERM

# Send the signal only to the JVM rather than its control group
KillMode=mixed

# Java process is never killed
SendSIGKILL=no

# When a JVM receives a SIGTERM signal it exits with code 143
SuccessExitStatus=143

[Install]
WantedBy=multi-user.target

Create required folders:

logs folder should be created one level up from the working directory (/opt/arclib/../logs with respect to sample systemd file above)

First ARCLib run

Generating DB

On ARCLib server execute folowing (replace {name of user} with the name of the user):

sudo -u postgres psql postgres
drop database arclib;
create database arclib owner arclib;
\q
sudo cp /home/{name of user}/arclib.jar /opt/arclib/arclib.jar

Put these two lines into your application.yml file (the one containing DB user credentials etc). If you use appplication.properties, then use = instead of : to assign a variable.

spring.jpa.hibernate.ddl-auto: none
spring.liquibase.enabled: false

Run the application (sudo systemctl start arclib) - application should create Camunda BPM tables and then fail, printing: "relation ... does not exist" to log.
Remove the two lines added in step 2 and run the application again - application should not fail this time, it should create ARCLib tables and print: "ARCLib instance started successfull" to log.

During startup the system performs the following actions:

control whether the transfer areas for all producers are available (system checks this on every startup)
scheduling of the automatic updates of format library using PRONOM sever (system checks this on every startup)
fill following arclib_ tables with default system records:
- arclib_user_role
- arclib_role_permission
- arclib_ingest_issue_definition
- arclib_tool
- arclib_tool_ingest_issue_definition

Sample data

If you want to run the sample ingest described in Instructions for sample ingest, you should also insert the test profiles and assign the inserted producer to the user running the ingest (Users section in ARCLib UI).

You should also initialize the format library. You can use the format library init script also in production just to set system faster - otherwise you will have to download the whole library (Preservation Planning GUI) - if you use the init script, then you only need to download the latest updates.

After these steps you should fire reindex of the core and format library in Index section of ARCLib UI.

Creating root superadmin user

Replace 'admin' in below SQL with you LDAP username if using LDAP authentication.
Set producer_id to null in below SQL if you have not imported Sample data as described above this section.
If using LDAP auth login password is your LDAP password, otherwise password of user imported by below SQL is 'admin'.

INSERT INTO arclib_user
(id, username, created, updated, producer_id, "password")
VALUES('962d8cb6-556a-4396-9c4a-2af00c3a17b9', 'admin', NOW(),  NOW(), 'aa7ddcc5-5b81-4747-bfeb-1850d952a359',  '$2a$10$1dHqt7rZG6a7MHi1j5nH1uhlPG8/gORfl8HrVf3Tg0bxFFXi76nnC');

INSERT INTO arclib_assigned_user_role
(arclib_user_id, arclib_role_id)
VALUES('962d8cb6-556a-4396-9c4a-2af00c3a17b9', 'b7a43ad5-883f-4741-948b-08678fa38604');

Initializing file format index

When deploying the application for the first time, the file format database should be updated. Admin can do that in Formats section of the GUI.

Custom Settings

Application.yml

Application.yml located at system/src/main/resources is the main configuration file of ARClib. Bellow is the list of the most important configuration parameters together with the preconfigured values. Parameters bellow are grouped in sections, when configuring the application, you have to prefix the parameter with the header of its section, e.g. not just url but spring.datasource.url.. not just aipSavedCheckAttempts but arclib.aipSavedCheckAttempts.

The file is packed in the .jar file. You can overwrite those value you want by creating another configuration file (application.yml/application.properties) and placing it to the same folder as the .jar file of the application on the server. This is useful e.g. for configuring sensitive information.

Database

spring.datasource

jdbcUrl: JDBC URL
username: DB user
password: DB password

Solr

solr

endpoint: http://localhost:8983/solr

spring

spring-arclib-xml-datasource.url: JDBC URL of arclibXmlC collection.. e.g. jdbc:solr://localhost:9983/?collection=arclibXmlC

Archival storage

archivalStorage

api: http://localhost:8081/api

archivalStorage.authorization.basic - fill the right values in the configuration file on the server

read: base64encode(arclib-read:password)
readWrite: base64encode(arclib-read-write:password)
admin: base64encode(admin:password)

Mail

spring.mail

host: smtp.gmail.com
port: 465
username: email
password: email password
protocol: smtp

mail.sender

email: email

Security

security.basic

authQuery: /api/user/login - endpoint for authentization

security.jwt

expiration: 900000 - expiration time in secones for the JWT token
secret: somestring - secret key used in JWT token

security.local

enabled: true - enables local authentication, passwords stored in DB

security.ldap

enabled: false - enables LDAP authentication
server: ldap://x.x.x.x:port
startTls: true - whether the starttls encription is used
bind.dn: cn=systemAccountUsername,dc=libj,dc=cas,dc=cz - dn used by arclib system to search through the LDAP server for users when they logs in
bind.pwd: systemAccountPassword
user.type: filter
user.search-base: ou=users,ou=Archive,o=KNAV,dc=libj,dc=cas,dc=cz - where to search for users of the ARCLib system
user.filter: (cn={0}) - attribute of the ldap entry which is used as a short username of the user's dn

Example: User logs in with username bob. ARCLib system logs in to the LDAP server with system account and searches at ou=users,ou=Archive,o=KNAV,dc=libj,dc=cas,dc=cz for an entry with cn=bob. Full bob's LDAP dn is: cn=bob,ou=users,ou=Archive,o=KNAV,dc=libj,dc=cas,dc=cz

Paths to folders

arclib.path

workspace: workspace
quarantine: workspace/quarantine
fileStorage: fileStorage - this is the transfer area for storing of SIP package before the ingestion
reports: exportedReports - path of reports stored in file system

spring.servlet.multipart

location: ${user.dir}/multipart_tmp - directory used by spring as temporary storage for uploaded files - must be fully accessible by user running the ARCLib service

Paths to the XML schemas

arclib

arclibXmlSchema - classpath:xmlSchemas/arclibXml.xsd
metsSchema - classpath:xmlSchemas/mets.xsd
premisSchema classpath:xmlSchemas/premis-v2-2.xsd
sipProfileSchema - classpath:xmlSchemas/sipProfile.xsd
validationProfileSchema -classpath:xmlSchemas/validationProfile.xsd

Path to scripts

arclib.script

ingestRoutineBatchStart: classpath:scripts/batchStart.groovy - groovy script for starting of a batch (used in the Ingest routines)
export: classpath:scripts/export.groovy - groovy script for export
keepAliveUpdate: classpath:scripts/keepAliveUpdate.groovy - groovy script for checking of active editation of ARClibXml at fronted and keeping alive the editation lock at authorial package
formatsRevisionNotification: classpath:scripts/formatsRevisionNotification.groovy - groovy script for notifications about necessary revisions of formats
reportNotification: classpath:scripts/reportNotification.groovy - groovy script for notification about generated reports

Namespaces

namespaces

mets: http://www.loc.gov/METS/
arclib: http://arclib.lib.cas.cz/ARCLIB_XSD
premis: info:lc/xmlns/premis-v2
oai_dc: http://www.openarchives.org/OAI/2.0/oai_dc/
dc: http://purl.org/dc/elements/1.1/
xsi: http://www.w3.org/2001/XMLSchema-instance

Pronom

formatLibrary

formatDetailListUrl: http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatDetailListAction.aspx - URL to fetch PRONOM format detail
formatListUrl: http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatListAction.aspx - URL to fetch PRONOM list of formats
scheduleUpdate: false - turns off/on scheduled updates
updateCron: 0 0 0 1 * ? - periodicity of scheduled updates from PRONOM

Camunda

camunda.bpm.job-execution

lockTimeInMillis: 300000 - Timeout of a single BPM task, see application.yml. Because of the slow ClamAV execution on large SIPs we have increased the timeout on tst server to 1800000. Message indicating the timeouts may look like this: update or delete on table "act_ge_bytearray" violates foreign key constraint "act_fk_job_exception" on table "act_ru_job". Before increasing this timeout, try lower the load of the server (lover JMS pool size or Camunda job-execution pool size). If some workflow contains task which runs for too long even if the server is not heavy-loaded, then consider executing it outside of engine and notifying the engine with message once the task is done.
corePoolSize: 3 - core pool size of bpm engine threads. We have reduced this value to 2 on tst server to reduce the possibility of multiple simultanious ClamAV processes fighting for resources.
maxPoolSize: 10 - max pool size of bpm engine threads. We have reduced this value to 2.

Logging

logging

path: ../logs - path to the folder containing logs
level.* - allows specification of logger levels of any class or package, e.g. logging.level.org.apache.solr: WARN

Other parameters

arclib

arclibXmlIndexConfig: classpath:index/arclibXmlIndexConfig.csv - config file used during indexation
arclibXmlSystemWideValidationConfig: classpath:arclibXmlSystemWideValidationConfig.csv - config file used during validations in ARCLibXML extractor and ARCLibXML generator tasks
deleteSipFromTransferArea: true - if true, the SIP package is deleted from the transfer area after the ingest workflow finished successfully
aipStoreAttempts: 3 - number of attempts to store AIP to archival storage
aipStoreAttemptsInterval: PT5M - interval to wait between attempts to store AIP to archival storage (in PnYnMnDTnHnMnS or PnW format of ISO 8601)
aipSavedCheckAttempts: 10 - number of attempts to check that AIP was stored successfully to archival storage
aipSavedCheckAttemptsInterval: PT1M - interval to wait between attempts to check that AIP was stored sucessfully to archival storage (in PnYnMnDTnHnMnS or PnW format of ISO 8601)
keepAliveUpdateTimeout: 10 - minimal interval (in seconds) in which the frontend needs to call keep alive endpoint during the editation of ARCLibXml
keepAliveNetworkDelay: 2 - estimated network delay (in seconds) used during validation of edit lock during the editation of ARCLibXml. If the real delay is much higher, user may see an error message during ARCLib XML editation. The error message explictily mentions that increasing this property may be a solution if the problem persists.
externalProcess.timeout.sigterm: 7200 - timeout (in seconds) for any external process run by the application, e.g. ClamAV, Droid.. if this timeout is reached and process is still runing, termination signal is fired
externalProcess.timeout.sigkill: 1800 - timeout (in seconds) to wait for the process to gracefully shut down after receiving termination signal.. if this timeout is reached and process is still runing, kill signal is fired

Tests

Tests wiki section is obsolete

Tests uses system/src/test/resources/application.properties file as the configuration file for the test environment. Specified properties overwrites the properties listed in system/src/main/resources/application.yml.

Tests requires Solr instance running in cloud mode with 2 additional collections. Names of the required test collections are specified in application.properties file:

solr.arclibxml.corename=arclibXmlTestC
solr.test.corename=testC

Create these collections analogically to the other collections used by the system:

solr create -c testC -d arclib-managed
solr create -c arclibXmlTestC -d arclib-arclibXmlC-schema

General

Home
The Ingest - Archival Process
Instructions for Sample Ingest
Predefined Profiles
Docker
Reindex and Reingest (upgrading ARCLib or its profiles)

System Setup

Table of Content

Dependencies

Java

Database

ClamAV

DROID

Solr

LDAP

Yarn

Apache

Archival Storage System

Setup

Building of ARCLib

Solr

Configsets and collections

Creating the collections

arclibXmlC schema update process

Prepare ARCLib service

First ARCLib run

Generating DB

Sample data

Creating root superadmin user

Initializing file format index

Custom Settings

Application.yml

Database

Solr

Archival storage

Mail

Security

Paths to folders

Paths to the XML schemas

Path to scripts

Namespaces

Pronom

Camunda

Logging

Other parameters

Tests

General

Admin Documentation

Running the ARCLib system

Managing Producers, Users and Roles

Configuring ingest profiles

Managing the Archival Storage

User Documentation

User Account

Archiving

Searching and Exporting

Preservation Planning

Clone this wiki locally