-
Notifications
You must be signed in to change notification settings - Fork 1
System Setup
- Java 17 OpenJDK
- download and install PostgreSQL 15 or later
- create user with password, (our test instance uses
arclib
:vuji61oilo
) - create database with the created user as the owner
- your local antivirus should be deactivated otherwise it blocks eicar test file
- download and install ClamAV antivirus (latest should be OK, 0.100.2 is guaranteed)
- create database folder inside ClamAV folder
- copy freshclam.conf from resources folder to ClamAV folder
- run freshclam.exe
- add clamscan command to the PATH variables
- create CLAMAV environment variable pointing to CLAMAV directory
- download DROID (latest should be OK, 6.4 is guaranteed)
- extract archive with the DROID binary and add path to the directory with the binary to PATH under environment variable droid
- download and install Solr 9.7.0
- add solr command to the PATH variables
- port 8983 should be available for Solr
- see the GUI of the running instance of Solr: *host:*8983/
In order to use all features of the system, the Solr should run in a cloud mode. Use -c switch to start Solr in a cloud mode (solr start -c). In cloud mode, the embedded Zookeeper instance is automatically started with Solr, by default listening on port 9983. Cloud mode is currently utilized in Report agenda (reports generated from the Solr datasource uses Solr JDBC interface which is available only in a cloud mode). For more details about running SOLR in cloud mode see SolrCloud and Getting Started with SolrCloud.
LDAP is optional. See LDAP section of application.yml file
- Yarn is a package manager used for dependency management and build of GUI. Yarn requires Node.js to be installed. Follow the installation guide here.
- The GUI repository is located here.
- In the root of the cloned repository execute
yarn
to download dependencies and thenyarn build-prod
to build the GUI.
Apache httpd (or other http server) where the build of the system GUI is deployed is required. After the build using yarn move its result - the content of build
folder - to the configured path at the server, e.g. /var/www/html
.
The GUI build will look for API at address localhost/api, if it runs on different port the proxy mapping should be added.
Sample httpd config:
<VirtualHost *:80>
ServerName arclib.cz
DocumentRoot /var/www/html
ProxyPreserveHost On
ProxyRequests off
ProxyPass "/api" "http://127.0.0.1:8080/api"
ProxyPassReverse "/api" "http://127.0.0.1:8080/api"
</VirtualHost>
mvn clean package -DskipTests
ARCLib uses several self-managed collections to index DB records and one schema.xml managed collection used for ARCLib XML index.
Configuration is provided as predefined configsets
Contents of arclib-arclibXmlC-schema folder may change during development, see arclibXmlC schema update process on how to update the schema in Solr.
The arclib-managed configset folder is considered to be up to date with recommended Solr major version, as it is close to the default config of that major version and all collections initialized from this configset run in managed mode, which means their schema is updated automatically when Solr consumes document with new attributes.
If any of the collection break, you can always start from scratch and then reindex the data from their primary datasource using the Index section in ARCLib GUI.
NOTE: This guide assumes that the SOLR instance is installed as a system service named solr
with work dir at /var/solr
and there is a privileged user account for SOLR. Also the configset directory located in solr/server/solr should contain predefined configsets
sudo su solr
solr delete -c arclibDomainC
solr delete -c formatC
solr delete -c ingestIssueC
solr delete -c arclibXmlC
solr create -c arclibDomainC -d arclib-managed
solr create -c formatC -d arclib-managed
solr create -c ingestIssueC -d arclib-managed
solr create -c arclibXmlC -d arclib-arclibXmlC-schema
Once the index schema or config of ARCLib XML changes, the changes must be uploaded to Solr, for example using the process bellow.
- update the schema.xml and sorconfig.xml in the configset folder (arclib-arclibXmlC-schema)
solr zk -cmd upconfig -n arclibXmlC -d /opt/solr/server/solr/configsets/arclib-arclibXmlC-schema/conf -z localhost:9983 -collection arclibXmlC
- curl --location --request POST 'http://localhost:8983/solr/admin/collections?action=RELOAD&name=arclibXmlC'
- Create ARCLib service (sample systemd config file bellow)
[Unit]
Description=ARCLib
Wants=network-online.target
After=network-online.target
[Service]
WorkingDirectory=/opt/arclib
User=arclib
Group=arclib
ExecStart=/usr/bin/java -jar -Xms2g -Xmx6512m /opt/arclib/arclib.jar
StandardOutput=journal
StandardError=inherit
# Disable timeout logic and wait until process is stopped
TimeoutStopSec=0
# SIGTERM signal is used to stop the Java process
KillSignal=SIGTERM
# Send the signal only to the JVM rather than its control group
KillMode=mixed
# Java process is never killed
SendSIGKILL=no
# When a JVM receives a SIGTERM signal it exits with code 143
SuccessExitStatus=143
[Install]
WantedBy=multi-user.target
- Create required folders:
- logs folder should be created one level up from the working directory (/opt/arclib/../logs with respect to sample systemd file above)
- On ARCLib server execute folowing (replace {name of user} with the name of the user):
sudo -u postgres psql postgres
drop database arclib;
create database arclib owner arclib;
\q
sudo cp /home/{name of user}/arclib.jar /opt/arclib/arclib.jar
- Put these two lines into your application.yml file (the one containing DB user credentials etc). If you use appplication.properties, then use = instead of : to assign a variable.
spring.jpa.hibernate.ddl-auto: none
spring.liquibase.enabled: false
-
Run the application (
sudo systemctl start arclib
) - application should create Camunda BPM tables and then fail, printing: "relation ... does not exist" to log. -
Remove the two lines added in step 2 and run the application again - application should not fail this time, it should create ARCLib tables and print: "ARCLib instance started successfull" to log.
During startup the system performs the following actions:
- control whether the transfer areas for all producers are available (system checks this on every startup)
- scheduling of the automatic updates of format library using PRONOM sever (system checks this on every startup)
- fill following arclib_ tables with default system records:
- arclib_user_role
- arclib_role_permission
- arclib_ingest_issue_definition
- arclib_tool
- arclib_tool_ingest_issue_definition
If you want to run the sample ingest described in Instructions for sample ingest, you should also insert the test profiles and assign the inserted producer to the user running the ingest (Users section in ARCLib UI).
You should also initialize the format library. You can use the format library init script also in production just to set system faster - otherwise you will have to download the whole library (Preservation Planning GUI) - if you use the init script, then you only need to download the latest updates.
After these steps you should fire reindex of the core and format library in Index section of ARCLib UI.
- Replace 'admin' in below SQL with you LDAP username if using LDAP authentication.
- Set producer_id to null in below SQL if you have not imported Sample data as described above this section.
- If using LDAP auth login password is your LDAP password, otherwise password of user imported by below SQL is 'admin'.
INSERT INTO arclib_user
(id, username, created, updated, producer_id, "password")
VALUES('962d8cb6-556a-4396-9c4a-2af00c3a17b9', 'admin', NOW(), NOW(), 'aa7ddcc5-5b81-4747-bfeb-1850d952a359', '$2a$10$1dHqt7rZG6a7MHi1j5nH1uhlPG8/gORfl8HrVf3Tg0bxFFXi76nnC');
INSERT INTO arclib_assigned_user_role
(arclib_user_id, arclib_role_id)
VALUES('962d8cb6-556a-4396-9c4a-2af00c3a17b9', 'b7a43ad5-883f-4741-948b-08678fa38604');
When deploying the application for the first time, the file format database should be updated. Admin can do that in Formats section of the GUI.
Application.yml located at system/src/main/resources
is the main configuration file of ARClib. Bellow is the list of the most important configuration parameters together with the preconfigured values. Parameters bellow are grouped in sections, when configuring the application, you have to prefix the parameter with the header of its section, e.g. not just url but spring.datasource.url.. not just aipSavedCheckAttempts but arclib.aipSavedCheckAttempts.
The file is packed in the .jar file. You can overwrite those value you want by creating another configuration file (application.yml/application.properties) and placing it to the same folder as the .jar file of the application on the server. This is useful e.g. for configuring sensitive information.
spring.datasource
- jdbcUrl:
JDBC URL
- username:
DB user
- password:
DB password
solr
- endpoint:
http://localhost:8983/solr
spring
- spring-arclib-xml-datasource.url:
JDBC URL of arclibXmlC collection.. e.g. jdbc:solr://localhost:9983/?collection=arclibXmlC
archivalStorage
- api:
http://localhost:8081/api
archivalStorage.authorization.basic - fill the right values in the configuration file on the server
- read:
base64encode(arclib-read:password)
- readWrite:
base64encode(arclib-read-write:password)
- admin:
base64encode(admin:password)
spring.mail
- host:
smtp.gmail.com
- port:
465
- username:
email
- password:
email password
- protocol:
smtp
mail.sender
- email:
email
security.basic
- authQuery:
/api/user/login
- endpoint for authentization
security.jwt
- expiration:
900000
- expiration time in secones for the JWT token - secret:
somestring
- secret key used in JWT token
security.local
- enabled: true - enables local authentication, passwords stored in DB
security.ldap
- enabled:
false
- enables LDAP authentication - server:
ldap://x.x.x.x:port
- startTls:
true
- whether the starttls encription is used - bind.dn:
cn=systemAccountUsername,dc=libj,dc=cas,dc=cz
- dn used by arclib system to search through the LDAP server for users when they logs in - bind.pwd:
systemAccountPassword
- user.type:
filter
- user.search-base:
ou=users,ou=Archive,o=KNAV,dc=libj,dc=cas,dc=cz
- where to search for users of the ARCLib system - user.filter:
(cn={0})
- attribute of the ldap entry which is used as a short username of the user's dn
Example: User logs in with username bob. ARCLib system logs in to the LDAP server with system account and searches at ou=users,ou=Archive,o=KNAV,dc=libj,dc=cas,dc=cz
for an entry with cn
=bob. Full bob's LDAP dn is: cn=bob,ou=users,ou=Archive,o=KNAV,dc=libj,dc=cas,dc=cz
arclib.reingest
- transferAreaKeepFreeMb:
workspace
- minimum space to leave free in transfer area by the reingest
- workspaceKeepFreeMb:
workspace/quarantine
- minimum space to leave free in workspace by the reingest
- sharedStorage:
true
- true if workspace and transfer area shares space (on the disk)
- exportCron:
"0 * * ? * *"
- period of the reingest job
arclib.path
- workspace:
workspace
- quarantine:
workspace/quarantine
- fileStorage:
fileStorage
- this is the transfer area for storing of SIP package before the ingestion
- reports:
exportedReports
- path of reports stored in file system
spring.servlet.multipart
- location:
${user.dir}/multipart_tmp
- directory used by spring as temporary storage for uploaded files - must be fully accessible by user running the ARCLib service
arclib
- arclibXmlSchema -
classpath:xmlSchemas/arclibXml.xsd
- metsSchema -
classpath:xmlSchemas/mets.xsd
- premisSchema
classpath:xmlSchemas/premis-v2-2.xsd
- sipProfileSchema -
classpath:xmlSchemas/sipProfile.xsd
- validationProfileSchema -
classpath:xmlSchemas/validationProfile.xsd
arclib.script
- ingestRoutineBatchStart:
classpath:scripts/batchStart.groovy
- groovy script for starting of a batch (used in the Ingest routines)
- export:
classpath:scripts/export.groovy
- groovy script for export
- keepAliveUpdate:
classpath:scripts/keepAliveUpdate.groovy
- groovy script for checking of active editation of ARClibXml at fronted and keeping alive the editation lock at authorial package
- formatsRevisionNotification:
classpath:scripts/formatsRevisionNotification.groovy
- groovy script for notifications about necessary revisions of formats - reportNotification:
classpath:scripts/reportNotification.groovy
- groovy script for notification about generated reports
namespaces
- mets:
http://www.loc.gov/METS/
- arclib:
http://arclib.lib.cas.cz/ARCLIB_XSD
- premis:
info:lc/xmlns/premis-v2
- oai_dc:
http://www.openarchives.org/OAI/2.0/oai_dc/
- dc:
http://purl.org/dc/elements/1.1/
- xsi:
http://www.w3.org/2001/XMLSchema-instance
formatLibrary
- formatDetailListUrl:
http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatDetailListAction.aspx
- URL to fetch PRONOM format detail - formatListUrl:
http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatListAction.aspx
- URL to fetch PRONOM list of formats - scheduleUpdate: false - turns off/on scheduled updates
- updateCron:
0 0 0 1 * ?
- periodicity of scheduled updates from PRONOM
camunda.bpm.job-execution
- lockTimeInMillis:
300000
- Timeout of a single BPM task, see application.yml. Because of the slow ClamAV execution on large SIPs we have increased the timeout on tst server to1800000
. Message indicating the timeouts may look like this: update or delete on table "act_ge_bytearray" violates foreign key constraint "act_fk_job_exception" on table "act_ru_job". Before increasing this timeout, try lower the load of the server (lower JMS pool size or Camunda job-execution pool size). If some workflow contains task which runs for too long even if the server is not heavy-loaded, then consider executing it outside of engine and notifying the engine with message once the task is done.
- corePoolSize:
3
- core pool size of bpm engine threads. We have reduced this value to2
on tst server to reduce the possibility of multiple simultanious ClamAV processes fighting for resources.
- maxPoolSize:
10
- max pool size of bpm engine threads. We have reduced this value to2
.
logging
- path:
../logs
- path to the folder containing logs - level.* - allows specification of logger levels of any class or package, e.g. logging.level.org.apache.solr: WARN
arclib
- arclibXmlIndexConfig:
classpath:index/arclibXmlIndexConfig.csv
- config file used during indexation
- arclibXmlSystemWideValidationConfig:
classpath:arclibXmlSystemWideValidationConfig.csv
- config file used during validations in ARCLibXML extractor and ARCLibXML generator tasks
- deleteSipFromTransferArea:
true
- if true, the SIP package is deleted from the transfer area after the ingest workflow finished successfully
- aipStoreAttempts:
3
- number of attempts to store AIP to archival storage
- aipStoreAttemptsInterval:
PT5M
- interval to wait between attempts to store AIP to archival storage (in PnYnMnDTnHnMnS or PnW format of ISO 8601)
- aipSavedCheckAttempts:
10
- number of attempts to check that AIP was stored successfully to archival storage
- aipSavedCheckAttemptsInterval:
PT1M
- interval to wait between attempts to check that AIP was stored sucessfully to archival storage (in PnYnMnDTnHnMnS or PnW format of ISO 8601)
- keepAliveUpdateTimeout:
10
- minimal interval (in seconds) in which the frontend needs to call keep alive endpoint during the editation of ARCLibXml
- keepAliveNetworkDelay:
2
- estimated network delay (in seconds) used during validation of edit lock during the editation of ARCLibXml. If the real delay is much higher, user may see an error message during ARCLib XML editation. The error message explictily mentions that increasing this property may be a solution if the problem persists.
- externalProcess.timeout.sigterm:
7200
- timeout (in seconds) for any external process run by the application, e.g. ClamAV, Droid.. if this timeout is reached and process is still runing, termination signal is fired - externalProcess.timeout.sigkill:
1800
- timeout (in seconds) to wait for the process to gracefully shut down after receiving termination signal.. if this timeout is reached and process is still runing, kill signal is fired
Tests wiki section is obsolete
Tests uses system/src/test/resources/application.properties file as the configuration file for the test environment. Specified properties overwrites the properties listed in system/src/main/resources/application.yml.
Tests requires Solr instance running in cloud mode with 2 additional collections. Names of the required test collections are specified in application.properties file:
solr.arclibxml.corename=arclibXmlTestC
solr.test.corename=testC
Create these collections analogically to the other collections used by the system:
solr create -c testC -d arclib-managed
solr create -c arclibXmlTestC -d arclib-arclibXmlC-schema
Home
The Ingest - Archival Process
Instructions for Sample Ingest
Predefined Profiles
Docker
Reindex and Reingest (upgrading ARCLib or its profiles)
- System Setup
- System Setup on Debian (unofficial)
- Api and Authorization
- Administration of running system
- ARCLib XML Index Config
- Usage@Index
- Usage@Reingest
- Sip Format
- Usage@Sip Profiles
- Usage@Validation Profiles
- Usage@Workflow Definitions
- Usage@Producer Profiles
- Usage@Debug Mode
- Tutorial@Custom Ingest