Skip to content

System Setup

Jan Tomášek edited this page Jan 21, 2025 · 52 revisions

Table of Content

Dependencies

Java

  • Java 17 OpenJDK

Database

  • download and install PostgreSQL 15 or later
  • create user with password, (our test instance uses arclib:vuji61oilo)
  • create database with the created user as the owner

ClamAV

  • your local antivirus should be deactivated otherwise it blocks eicar test file
  • download and install ClamAV antivirus (latest should be OK, 0.100.2 is guaranteed)
  • create database folder inside ClamAV folder
  • copy freshclam.conf from resources folder to ClamAV folder
  • run freshclam.exe
  • add clamscan command to the PATH variables
  • create CLAMAV environment variable pointing to CLAMAV directory

DROID

  • download DROID (latest should be OK, 6.4 is guaranteed)
  • extract archive with the DROID binary and add path to the directory with the binary to PATH under environment variable droid

Solr

  • download and install Solr 9.7.0
  • add solr command to the PATH variables
  • port 8983 should be available for Solr
  • see the GUI of the running instance of Solr: *host:*8983/

In order to use all features of the system, the Solr should run in a cloud mode. Use -c switch to start Solr in a cloud mode (solr start -c). In cloud mode, the embedded Zookeeper instance is automatically started with Solr, by default listening on port 9983. Cloud mode is currently utilized in Report agenda (reports generated from the Solr datasource uses Solr JDBC interface which is available only in a cloud mode). For more details about running SOLR in cloud mode see SolrCloud and Getting Started with SolrCloud.

LDAP

LDAP is optional. See LDAP section of application.yml file

Yarn

  • Yarn is a package manager used for dependency management and build of GUI. Yarn requires Node.js to be installed. Follow the installation guide here.
  • The GUI repository is located here.
  • In the root of the cloned repository execute yarn to download dependencies and then yarn build-prod to build the GUI.

Apache

Apache httpd (or other http server) where the build of the system GUI is deployed is required. After the build using yarn move its result - the content of build folder - to the configured path at the server, e.g. /var/www/html.

The GUI build will look for API at address localhost/api, if it runs on different port the proxy mapping should be added.

Sample httpd config:

<VirtualHost *:80>
        ServerName arclib.cz
        DocumentRoot /var/www/html

        ProxyPreserveHost On
        ProxyRequests off
        ProxyPass "/api" "http://127.0.0.1:8080/api"
        ProxyPassReverse "/api" "http://127.0.0.1:8080/api"
</VirtualHost>

Archival Storage System

Setup

Building of ARCLib

mvn clean package -DskipTests

Solr

Configsets and collections

ARCLib uses several self-managed collections to index DB records and one schema.xml managed collection used for ARCLib XML index.

Configuration is provided as predefined configsets

Contents of arclib-arclibXmlC-schema folder may change during development, see arclibXmlC schema update process on how to update the schema in Solr.

The arclib-managed configset folder is considered to be up to date with recommended Solr major version, as it is close to the default config of that major version and all collections initialized from this configset run in managed mode, which means their schema is updated automatically when Solr consumes document with new attributes.

If any of the collection break, you can always start from scratch and then reindex the data from their primary datasource using the Index section in ARCLib GUI.

Creating the collections

NOTE: This guide assumes that the SOLR instance is installed as a system service named solr with work dir at /var/solr and there is a privileged user account for SOLR. Also the configset directory located in solr/server/solr should contain predefined configsets

sudo su solr
solr delete -c arclibDomainC
solr delete -c formatC
solr delete -c ingestIssueC
solr delete -c arclibXmlC
solr create -c arclibDomainC -d arclib-managed
solr create -c formatC -d arclib-managed
solr create -c ingestIssueC -d arclib-managed
solr create -c arclibXmlC -d arclib-arclibXmlC-schema

arclibXmlC schema update process

Once the index schema or config of ARCLib XML changes, the changes must be uploaded to Solr, for example using the process bellow.

Prepare ARCLib service

  1. Create ARCLib service (sample systemd config file bellow)
[Unit]
Description=ARCLib
Wants=network-online.target
After=network-online.target

[Service]
WorkingDirectory=/opt/arclib

User=arclib
Group=arclib

ExecStart=/usr/bin/java -jar -Xms2g -Xmx6512m /opt/arclib/arclib.jar

StandardOutput=journal
StandardError=inherit

# Disable timeout logic and wait until process is stopped
TimeoutStopSec=0

# SIGTERM signal is used to stop the Java process
KillSignal=SIGTERM

# Send the signal only to the JVM rather than its control group
KillMode=mixed

# Java process is never killed
SendSIGKILL=no

# When a JVM receives a SIGTERM signal it exits with code 143
SuccessExitStatus=143

[Install]
WantedBy=multi-user.target
  1. Create required folders:
  • logs folder should be created one level up from the working directory (/opt/arclib/../logs with respect to sample systemd file above)

First ARCLib run

Generating DB

  1. On ARCLib server execute folowing (replace {name of user} with the name of the user):
sudo -u postgres psql postgres
drop database arclib;
create database arclib owner arclib;
\q
sudo cp /home/{name of user}/arclib.jar /opt/arclib/arclib.jar
  1. Put these two lines into your application.yml file (the one containing DB user credentials etc). If you use appplication.properties, then use = instead of : to assign a variable.
spring.jpa.hibernate.ddl-auto: none
spring.liquibase.enabled: false
  1. Run the application (sudo systemctl start arclib) - application should create Camunda BPM tables and then fail, printing: "relation ... does not exist" to log.

  2. Remove the two lines added in step 2 and run the application again - application should not fail this time, it should create ARCLib tables and print: "ARCLib instance started successfull" to log.

During startup the system performs the following actions:

  • control whether the transfer areas for all producers are available (system checks this on every startup)
  • scheduling of the automatic updates of format library using PRONOM sever (system checks this on every startup)
  • fill following arclib_ tables with default system records:
    • arclib_user_role
    • arclib_role_permission
    • arclib_ingest_issue_definition
    • arclib_tool
    • arclib_tool_ingest_issue_definition

Sample data

If you want to run the sample ingest described in Instructions for sample ingest, you should also insert the test profiles and assign the inserted producer to the user running the ingest (Users section in ARCLib UI).

You should also initialize the format library. You can use the format library init script also in production just to set system faster - otherwise you will have to download the whole library (Preservation Planning GUI) - if you use the init script, then you only need to download the latest updates.

After these steps you should fire reindex of the core and format library in Index section of ARCLib UI.

Creating root superadmin user

  • Replace 'admin' in below SQL with you LDAP username if using LDAP authentication.
  • Set producer_id to null in below SQL if you have not imported Sample data as described above this section.
  • If using LDAP auth login password is your LDAP password, otherwise password of user imported by below SQL is 'admin'.
INSERT INTO arclib_user
(id, username, created, updated, producer_id, "password")
VALUES('962d8cb6-556a-4396-9c4a-2af00c3a17b9', 'admin', NOW(),  NOW(), 'aa7ddcc5-5b81-4747-bfeb-1850d952a359',  '$2a$10$1dHqt7rZG6a7MHi1j5nH1uhlPG8/gORfl8HrVf3Tg0bxFFXi76nnC');

INSERT INTO arclib_assigned_user_role
(arclib_user_id, arclib_role_id)
VALUES('962d8cb6-556a-4396-9c4a-2af00c3a17b9', 'b7a43ad5-883f-4741-948b-08678fa38604');

Initializing file format index

When deploying the application for the first time, the file format database should be updated. Admin can do that in Formats section of the GUI.

Custom Settings

Application.yml

Application.yml located at system/src/main/resources is the main configuration file of ARClib. Bellow is the list of the most important configuration parameters together with the preconfigured values. Parameters bellow are grouped in sections, when configuring the application, you have to prefix the parameter with the header of its section, e.g. not just url but spring.datasource.url.. not just aipSavedCheckAttempts but arclib.aipSavedCheckAttempts.

The file is packed in the .jar file. You can overwrite those value you want by creating another configuration file (application.yml/application.properties) and placing it to the same folder as the .jar file of the application on the server. This is useful e.g. for configuring sensitive information.

Database

spring.datasource

  • jdbcUrl: JDBC URL
  • username: DB user
  • password: DB password

Solr

solr

  • endpoint: http://localhost:8983/solr

spring

  • spring-arclib-xml-datasource.url: JDBC URL of arclibXmlC collection.. e.g. jdbc:solr://localhost:9983/?collection=arclibXmlC

Archival storage

archivalStorage

  • api: http://localhost:8081/api

archivalStorage.authorization.basic - fill the right values in the configuration file on the server

  • read: base64encode(arclib-read:password)
  • readWrite: base64encode(arclib-read-write:password)
  • admin: base64encode(admin:password)

Mail

spring.mail

  • host: smtp.gmail.com
  • port: 465
  • username: email
  • password: email password
  • protocol: smtp

mail.sender

  • email: email

Security

security.basic

  • authQuery: /api/user/login - endpoint for authentization

security.jwt

  • expiration: 900000 - expiration time in secones for the JWT token
  • secret: somestring - secret key used in JWT token

security.local

  • enabled: true - enables local authentication, passwords stored in DB

security.ldap

  • enabled: false - enables LDAP authentication
  • server: ldap://x.x.x.x:port
  • startTls: true - whether the starttls encription is used
  • bind.dn: cn=systemAccountUsername,dc=libj,dc=cas,dc=cz - dn used by arclib system to search through the LDAP server for users when they logs in
  • bind.pwd: systemAccountPassword
  • user.type: filter
  • user.search-base: ou=users,ou=Archive,o=KNAV,dc=libj,dc=cas,dc=cz - where to search for users of the ARCLib system
  • user.filter: (cn={0}) - attribute of the ldap entry which is used as a short username of the user's dn

Example: User logs in with username bob. ARCLib system logs in to the LDAP server with system account and searches at ou=users,ou=Archive,o=KNAV,dc=libj,dc=cas,dc=cz for an entry with cn=bob. Full bob's LDAP dn is: cn=bob,ou=users,ou=Archive,o=KNAV,dc=libj,dc=cas,dc=cz

Reingest

arclib.reingest

  • transferAreaKeepFreeMb: workspace - minimum space to leave free in transfer area by the reingest
  • workspaceKeepFreeMb: workspace/quarantine - minimum space to leave free in workspace by the reingest
  • sharedStorage: true - true if workspace and transfer area shares space (on the disk)
  • exportCron: "0 * * ? * *" - period of the reingest job

Paths to folders

arclib.path

  • workspace: workspace
  • quarantine: workspace/quarantine
  • fileStorage: fileStorage - this is the transfer area for storing of SIP package before the ingestion
  • reports: exportedReports - path of reports stored in file system

spring.servlet.multipart

  • location: ${user.dir}/multipart_tmp - directory used by spring as temporary storage for uploaded files - must be fully accessible by user running the ARCLib service

Paths to the XML schemas

arclib

  • arclibXmlSchema - classpath:xmlSchemas/arclibXml.xsd
  • metsSchema - classpath:xmlSchemas/mets.xsd
  • premisSchema classpath:xmlSchemas/premis-v2-2.xsd
  • sipProfileSchema - classpath:xmlSchemas/sipProfile.xsd
  • validationProfileSchema -classpath:xmlSchemas/validationProfile.xsd

Path to scripts

arclib.script

  • ingestRoutineBatchStart: classpath:scripts/batchStart.groovy - groovy script for starting of a batch (used in the Ingest routines)
  • export: classpath:scripts/export.groovy - groovy script for export
  • keepAliveUpdate: classpath:scripts/keepAliveUpdate.groovy - groovy script for checking of active editation of ARClibXml at fronted and keeping alive the editation lock at authorial package
  • formatsRevisionNotification: classpath:scripts/formatsRevisionNotification.groovy - groovy script for notifications about necessary revisions of formats
  • reportNotification: classpath:scripts/reportNotification.groovy - groovy script for notification about generated reports

Namespaces

namespaces

  • mets: http://www.loc.gov/METS/
  • arclib: http://arclib.lib.cas.cz/ARCLIB_XSD
  • premis: info:lc/xmlns/premis-v2
  • oai_dc: http://www.openarchives.org/OAI/2.0/oai_dc/
  • dc: http://purl.org/dc/elements/1.1/
  • xsi: http://www.w3.org/2001/XMLSchema-instance

Pronom

formatLibrary

  • formatDetailListUrl: http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatDetailListAction.aspx - URL to fetch PRONOM format detail
  • formatListUrl: http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatListAction.aspx - URL to fetch PRONOM list of formats
  • scheduleUpdate: false - turns off/on scheduled updates
  • updateCron: 0 0 0 1 * ? - periodicity of scheduled updates from PRONOM

Camunda

camunda.bpm.job-execution

  • lockTimeInMillis: 300000 - Timeout of a single BPM task, see application.yml. Because of the slow ClamAV execution on large SIPs we have increased the timeout on tst server to 1800000. Message indicating the timeouts may look like this: update or delete on table "act_ge_bytearray" violates foreign key constraint "act_fk_job_exception" on table "act_ru_job". Before increasing this timeout, try lower the load of the server (lower JMS pool size or Camunda job-execution pool size). If some workflow contains task which runs for too long even if the server is not heavy-loaded, then consider executing it outside of engine and notifying the engine with message once the task is done.
  • corePoolSize: 3 - core pool size of bpm engine threads. We have reduced this value to 2 on tst server to reduce the possibility of multiple simultanious ClamAV processes fighting for resources.
  • maxPoolSize: 10 - max pool size of bpm engine threads. We have reduced this value to 2.

Logging

logging

  • path: ../logs - path to the folder containing logs
  • level.* - allows specification of logger levels of any class or package, e.g. logging.level.org.apache.solr: WARN

Other parameters

arclib

  • arclibXmlIndexConfig: classpath:index/arclibXmlIndexConfig.csv - config file used during indexation
  • arclibXmlSystemWideValidationConfig: classpath:arclibXmlSystemWideValidationConfig.csv - config file used during validations in ARCLibXML extractor and ARCLibXML generator tasks
  • deleteSipFromTransferArea: true - if true, the SIP package is deleted from the transfer area after the ingest workflow finished successfully
  • aipStoreAttempts: 3 - number of attempts to store AIP to archival storage
  • aipStoreAttemptsInterval: PT5M - interval to wait between attempts to store AIP to archival storage (in PnYnMnDTnHnMnS or PnW format of ISO 8601)
  • aipSavedCheckAttempts: 10 - number of attempts to check that AIP was stored successfully to archival storage
  • aipSavedCheckAttemptsInterval: PT1M - interval to wait between attempts to check that AIP was stored sucessfully to archival storage (in PnYnMnDTnHnMnS or PnW format of ISO 8601)
  • keepAliveUpdateTimeout: 10 - minimal interval (in seconds) in which the frontend needs to call keep alive endpoint during the editation of ARCLibXml
  • keepAliveNetworkDelay: 2 - estimated network delay (in seconds) used during validation of edit lock during the editation of ARCLibXml. If the real delay is much higher, user may see an error message during ARCLib XML editation. The error message explictily mentions that increasing this property may be a solution if the problem persists.
  • externalProcess.timeout.sigterm: 7200 - timeout (in seconds) for any external process run by the application, e.g. ClamAV, Droid.. if this timeout is reached and process is still runing, termination signal is fired
  • externalProcess.timeout.sigkill: 1800 - timeout (in seconds) to wait for the process to gracefully shut down after receiving termination signal.. if this timeout is reached and process is still runing, kill signal is fired

Tests

Tests wiki section is obsolete


Tests uses system/src/test/resources/application.properties file as the configuration file for the test environment. Specified properties overwrites the properties listed in system/src/main/resources/application.yml.

Tests requires Solr instance running in cloud mode with 2 additional collections. Names of the required test collections are specified in application.properties file:

solr.arclibxml.corename=arclibXmlTestC
solr.test.corename=testC

Create these collections analogically to the other collections used by the system:

solr create -c testC -d arclib-managed
solr create -c arclibXmlTestC -d arclib-arclibXmlC-schema
Clone this wiki locally