The ArchivesSpace Export Service provides a framework for scheduled EAD export and publication. It consists of an ArchivesSpace plug-in and an application that uses the ArchivesSpace API but which is otherwise independent of ArchivesSpace and can be deployed anywhere so long as it has access to the ArchivesSpace backend.
The ArchivesSpace Export Service was developed by Hudson Molonglo for Yale University.
This git repository contains both the ArchivesSpace plug-in and the Export Service Application. Download a release from:
https://github.com/hudmol/archivesspace_export_service/releases
To install the plug-in, unzip the release into:
/path/to/archivesspace/plugins
Like this:
$ cd /path/to/archivesspace/plugins
$ unzip archivesspace_export_service-vX.X.zip
Then enable the plugin by editing the file in config/config.rb
:
AppConfig[:plugins] = ['some_plugin', 'archivesspace_export_service']
(Make sure you uncomment this line (i.e., remove the leading '#' if present))
Then restart ArchivesSpace.
To install the Exporter Application unzip the release to wherever you want to deploy it. This could be where the plug-in is deployed, in which case the unzipped release under plugins can be used.
The Exporter Application is entirely contained within the
exporter_app
directory.
The only external dependency is java. Make sure you have installed the correct version of java:
java -version
It should be version 1.6 or newer.
Run the application like this:
$ cd /path/to/archivesspace_export_service/exporter_app
$ bin/startup.sh
And shut it down like this:
$ cd /path/to/archivesspace_export_service/exporter_app
$ bin/shutdown.sh
The exporter application now uses gems. If running from source you will need to dowload the required gems like this:
$ cd /path/to/archivesspace_export_service/exporter_app
$ bin/bootstrap.sh
This step is not required if running from a distributed release.
UPGRADE NOTE: If upgrading from v1.0, you will need to remove the ead export work queue database before starting the application, like this:
$ cd /path/to/archivesspace_export_service/exporter_app
$ rm workspace/ead/db/ead_export.sqlite
See below for configuration options.
The configuration for the Exporter Application is set in:
/path/to/archivesspace_export_service/exporter_app/config/config.rb
Edit this file to set the url, username and password to access the ArchivesSpace API, like this:
{
aspace_username: 'a_user',
aspace_password: 'secret_password',
aspace_backend_url: 'http://localhost:8089/'
}
Note that the ArchivesSpace user (a_user
, here) will need
view_repository
permission on any ArchivesSpace repository from
which resources will be exported.
Also note that the url is to the ArchivesSpace backend, not the frontend web UI. If the Exporter Application is deployed on a different machine from ArchivesSpace you may need to configure your firewall to open the backend port.
If you intend to use the Handle creation feature, you will also need to set some handle related configuration options, like this:
{
handle_wsdl_url: 'http://link.its.yale.edu/ypls-ws/PersistentLinking?wsdl',
handle_user: '10079.1/FA',
handle_credential: '[YOUR CREDENTIAL]',
handle_prefix: '10079.1/fa',
handle_group: '10079.1/FA',
handle_base: 'http://archives.yale.edu',
}
And if using Handles, the configured ArchivesSpace user (a_user
above)
will need permissions to update resources on any ArchivesSpace repository
from which resources will be exported, in addition to the permission
discussed above. This is because generated Handles are written back to
the resource (in the ead_location field).
The plug-in creates a new endpoint on the ArchivesSpace backend that
provides lists of resources that have changed since a specified
time. It provides a list of adds
or resources that should be
exported (or re-exported) for publication, and a list of removes
that have been deleted, or unpublished or suppressed, which should be
removed from the export pipeline. Note, however, that removing a
record does not delete it from git's history: once a record has been
published to GitHub, it remains in the repository's history.
The Exporter Application runs as a daemon and uses a configuration
file (config/jobs.rb
- discussed below) to determine when it should
start or stop export jobs.
Each export job remembers when it last ran, and whether it has any
unfinished items from its previous run. It uses this information to
hit the new ArchivesSpace endpoint for the lists of adds
and
removes
for this run.
For each resource in the adds
list it exports it as EAD and then
places it in a pipeline for processing. The pipeline for each job is
configured in config/jobs.rb
. It might include steps such as EAD
validation, XSLT transformation, and ultimately a publication
step. The current release contains a Publish to Github
publisher. In
order to publish to a different kind of web service, it will be
necessary to develop a new publisher.
For each resource in the removes
list it simply removes the exported
file (if it has been exported previously) and unpublishes it.
Jobs are configured in the jobs.rb
file at:
/path/to/archivesspace_export_service/exporter_app/config/jobs.rb
See the jobs.rb
file that ships with the release for job
configuration examples.
The jobs.rb
configuration file is Ruby code. At the top level it is
a Hash with one key jobs
. The value of jobs
is an array of
WeekdayJob
objects. Each of these WeekdayJobs represents a scheduled
task (such as an export or a merge).
The following configuration options identify and describe the job:
-
:identifier
- A unique key for this job.Example:
:identifier => 'repo_1_ead'
-
:description
- A description of the job.Example:
:description => 'Export EAD for all of Repository 1'
The following configuration options specify the scheduling for the job. A job is scheduled to run within a time window on particular days of the week. If the job completes within its window, it will run again, after a configurable delay. This will be repeated until the window ends. Any unfinished exports will be resumed when the job is next scheduled to run.
Available options are:
-
:days_of_week
- The days of the week that the job should run on. Use three character string abbreviations in an array.Example:
:days_of_week => ['Mon', 'Tue', 'Wed', 'Thu']
-
:start_time
- The local time in hours and minutes (24 hour clock) that the job will be scheduled to run on the days specified above.Example:
:start_time => '23:05' # 11:05PM
-
:end_time
- The local time in hours and minutes (24 hour clock) that the job will be scheduled to stop running if it hasn't already completed. This allows jobs to be constrained to run within specified time windows - to avoid impacting system performance during business hours, for instance.If a job ends by hitting its
:end_time
(rather than by completing its work), any unfinished exports will be resumed when the job is next scheduled to run.Example:
:end_time => '08:30' # 8:30AM
-
:minimum_seconds_between_runs
- If a job finishes early within its run window, you might not want it to immediately start up again. By setting:minimum_seconds_between_runs
, you can control how long a job must wait before it can be rescheduled. For example, setting this to86400
would force jobs to wait at least 24 hours before running again.Example:
:minimum_seconds_between_runs => 3600 # an hour
Each job has a task. There are currently two task types -
ExportEADTask
and RepositoryMergeTask
. Their configuration options
are described below.
The ExportEADTask exports records from the ArchivesSpace API in EAD format, runs them through a configurable set of validations and transformations, and writes the resulting records to a git repository. Each batch of exported records is committed to the git repository with a timestamp indicating when it was added.
Within the jobs.rb
file, this task can be configured with a number
of :task_parameters
. These are as follows:
-
:commit_every_n_records
- Forces the job to create intermediary git commits as it runs, rather than a single commit when the export finishes. This saves on lost work if the job is interrupted mid-run for some reason. -
:search_options
- Provides scoping to control which records are candidates for export. The available search options are:-
:repo_id - The integer ID of the repository of interest (the default is to export from all repositories)
-
:identifier - The 4-part identifier of a single collection to be exported (the default is all collections)
-
:start_id, :end_id - Partial identifiers that form a range of resource IDs to export. For example, a start of
AAA.01
and end ofZZZ.99
would match records AAA.01, AAA.02, ..., XYZ.50, ..., ZZZ.01, ... ZZZ.98, ZZZ.99.
-
-
:export_options
- Provides additional options to be passed to the ArchivesSpace export process. Current available options are:-
:include_unpublished
- Whether unpublished components should be exported along with the resource (default: false) -
:include_daos
- Whether to include Digital Objects in dao tags (default: false) -
:numbered_cs
- Use numbered c tags in ead (default: false)
-
-
:generate_handles
- If set totrue
then a Handle will be created immediately before export for any resources that have a value inead_id
but do not have a value inead_location
. The created Handle will be written back to the resource in theead_location
field.
The sample jobs.rb
file shows a fully configured ExportEADTask which
makes use of :after_hooks
(described below) to additionally produce
PDF versions of finding aids and a table of contents. Note that the
PDF task will make use of any fonts placed within config/fonts
.
The RepositoryMergeTask
allows the output of many other tasks to be
merged into one git repository.
Within the jobs.rb
file, this task can be configured with a number
of :task_parameters
. These are as follows:
-
:jobs_to_merge
- A list of job identifiers, the outputs of which should be merged. -
:job_descriptions
- A hash keyed on the included job identifiers. The values are descriptions of the jobs that, in the default configuration, will be written to the top-level README for the merged repository. -
:include_additional_files_from
- A directory, the contents of which will be added to the root of the git repository. Any.erb
files included will be processed. -
:git_remote
- The URL of the remote git repository. See below.
The EAD export task performs several validation and transform steps, described in this section.
You can validate the EAD exported by the system using
Schematron validations. To do this, place
your Schematron .sch
file into the exporter_app/config
directory,
and then reference it in your ExportEADTask as follows:
# See the sample jobs.rb file for a complete example
:schematron_checks => ['config/my-schematron-file.sch'],
If an exported record fails its Schematron validation, the validation error will be logged and the record skipped.
Note that you can provide multiple Schematron files if you want to carry out more than one check. Just add them to the list like this:
:schematron_checks => ['config/my-schematron-file.sch', 'config/another-schematron-file.sch'],
In addition to Schematron files, you can also use regular XSD files to carry out validations. The process is identical as for Schematron but for the property name you use. Provide your XSD files to the ExportEADTask like this:
:validation_schema => ['config/ead.xsd', 'config/another.xsd'],
By providing one or more XSLT transforms, you can apply modifications to the EAD files exported from ArchivesSpace. Any XSLT transforms you provide will be run immediately after the EAD record is exported from ArchivesSpace, but before any Schematron or XSD validations. This allows your XSLT to clean up any common data issues prior to validation.
As with the validations, you configure your XSLT transforms by adding
your .xslt
file to the exporter-app/config
directory, then adding
a reference to your jobs.rb
file such as:
:xslt_transforms => ['config/transform.xslt', 'config/another-transform.xslt'],
Transforms are run left to right, with the output of each transform feeding into the next one as input.
Each WeekdayJob
has a list of :before_hooks
and :after_hooks
that are run at different points in the export process. The exporter
app makes use of these for its own purposes (preparing git
directories, producing PDF files, committing to git, etc.), but you
can add your own hooks to inject your own custom behavior. For
example, by adding a new ShellRunner
instance to the list, you could
run a shell script that emailed you whenever an export job completed.
As the names would suggest, before hooks run prior to the export process (at the point the job is started), while after hooks run once the export has completed, and once the XSLT transforms and validations have finished.
When running a script via ShellRunner
, the shell script you provide
should return 0 on success, or anything else for failure. If the
shell script returns a failure status, the export job will abort.
Publishing to GitHub is the responsibility of the
RepositoryMergeTask
, which takes the output from one or more EAD
export tasks, combines them into a single git repository and pushes
the result to GitHub.
Each job defined in jobs.rb
produces a directory (a git repository)
containing the records that have been exported by that job. For the
purposes of publishing records to GitHub, it is often desirable to
combine the output of one or more export jobs into a single repository
on GitHub. For example, you might have one export job that exports
any manuscripts collections, and another job that exports music
collections, but you want these collections to appear merged into a
single GitHub repository.
To do this, we define a new job in jobs.rb
that runs a
RepositoryMergeTask
. You can see an example of this in the sample
jobs.rb
provided with the exporter-app: this combines the records of
the two ExportEADTask jobs into a single repository and pushes it up
to GitHub. Generally you would want to configure the start and end
times for the RepositoryMergeTask
so that it runs after the relevant
export tasks have finished. If the merge task starts running while
the exports are still going, that's fine: it will just merge and
publish whatever is available and catch up when it next runs.
Note that, as the RepositoryMergeTask pushes to GitHub, you will always have at least one of these tasks defined--even if you only have a single export job. Additionally, you are free to have as many number of export tasks and merge tasks as you like, and you can combine them in arbitrary ways to produce the outcome you need.
When the merge task has finished merging the exported records, its final step is to push the result to a GitHub repository. To do this, it needs to be configured with access to the GitHub repository that will receive the records. There are two ways to go about this:
-
You can put a GitHub username and password directory in the configuration file; or
-
You can set up SSH deploy keys in GitHub to give the exporter application the access it needs
For the username/password option, you can just supply your account
credentials when you specify the :git_remote
URL:
:git_remote => 'https://yourusername:[email protected]/yourusername/yourrepo.git',
If you want to use SSH keys, you will need to specify your
:git_remote
URL like this:
:git_remote => '[email protected]:yourusername/yourrepo.git',
Next, create a new pair of SSH keys that you will need to add to GitHub. From a terminal:
$ cd /path/to/exporter-app
$ bin/generate_deploy_key.sh git_repository
Note that git_repository
above should match the identifier of the
merge task in your jobs.rb
(the example file uses git_repository
).
You'll see that the generate_deploy_key.sh
script prints some
instructions on how to import the newly created SSH key into GitHub.
Once you have done that, the application should be able to push to
your GitHub repository.