-
Notifications
You must be signed in to change notification settings - Fork 21
Configuring Bulkrax
Bulkrax has a range of configuration options. Once the Bulkrax installation has been run (rails g bulkrax:install
), a configuration file will be added at the following location:
# config/initializers/bulkrax.rb
Defaults are in place for the various available configurations. To view defaults via the rails console:
rails c
# Bulkrax.{name of config} for example:
> Bulkrax.parsers
[
{ name: "OAI - Dublin Core", class_name: "Bulkrax::OaiDcParser", partial: "oai_fields" },
{ name: "OAI - Qualified Dublin Core", class_name: "Bulkrax::OaiQualifiedDcParser", partial: "oai_fields" },
{ name: "CSV - Comma Separated Values", class_name: "Bulkrax::CsvParser", partial: "csv_fields" },
{ name: "Bagit", class_name: "Bulkrax::BagitParser", partial: "bagit_fields" }
]
Add a local configuration for parsers to:
- Change the name displayed in the importer create and edit pages.
- Disable a parser, by removing it from the configuration.
- Use a custom form partial to override the display in the importer create and edit pages
- Add a custom parser
# config/initializers/bulkrax.rb
Bulkrax.setup do | config |
# Remove the QualifiedDC parser
config.parsers -= [{ name: "OAI - Qualified Dublin Core", class_name: "Bulkrax::OaiQualifiedDcParser", partial: "oai_fields" }]
# Having removed it, add it back with a new name and partial (app/views/bulkrax/importers/_oai_terms_fields.html.erb MUST exist)
config.parsers += [{ name: "OAI - DC Terms", class_name: "Bulkrax::OaiQualifiedDcParser", partial: "oai_terms_fields" }]
# Add a new parser (Bulkrax::OaiCustomParser must exist at app/parsers/bulkrax/oai_custom_parser.rb)
config.parsers += [{ name: "OAI - Custom", class_name: "Bulkrax::OaiCustomParser", partial: "oai_fields" }]
end
In Bulkrax, the representative work_identifier
attribute must exist on the Work, FileSet and Collection models, and be unique for all identifiers in the hyrax/hyku app. It will be used to store the source identifier on those models. When the import runs, it checks whether a model already exists with the same work_identifier: source_identifier
property, and if so, it updates that existing model. If it does not, then a new model is created.
There are three ways to set the attribute:
- Allow Bulkrax to use the default
work_identifier
ofsource
- Use an existing Hyrax attribute. This can be changed in the local application by setting
source_identifier: true
in the mappings:
Bulkrax.setup do | config |
# Use the identifier field (note: identifier must be available on all works, file sets and collections).
config.field_mappings['Bulkrax::OaiDcParser'] = {
"identifier" => { from: ["source_identifier"], source_identifier: true }
end
end
- Create a new attribute
- If there isn't an attribute that's available and unique across all Works, FileSets and Collections, you can make a custom field. An example of how this can be changed in the local application is as follows:
- Add the field to all models (which should also update your solr documents)
- Use the field
config.field_mappings['Bulkrax::CsvParser'] = { 'bulkrax_identifier' => { from: ['bulkrax_identifier'], source_identifier: true } }
In Bulkrax, a field (header) representing the source_identifier
must exist on the imported document (csv, xml, etc.). There are two ways to set the field:
- Use the default
source_identifier
field; no configuration necessary. - Configure Bulkrax to use another another field
- This field does not have to exist on the model, but it does have to exist on the imported document. Either way, it will need to be mapped accordingly as the "from" term for your
source_identifier
mapping.Bulkrax.setup do | config | # Use original_identifier config.field_mappings['Bulkrax::CsvParser'] = { 'identifier' => { from: ['original_identifier'], source_identifier: true } } end
The value corresponding to the field above must be unique throughout the entire database. You cannot use your separator (e.g. "|" or ";") in this value on the imported document and it is recommended that file naming conventions are followed in that there are no spaces, etc. The value in the field will be stored as the value of the work identifier on the Work, FileSet or Collection. When the import runs, it checks whether a model already exists with the same work_identifier: source_identifier
property, and if so, it updates that existing model. If it does not, then a new model is created.
There are two ways to set this value:
- It already exists on the imported document. (e.g., you typed it on your csv or it's already in the oai feed)
field |
---|
'myuniquevalue' |
- Allow Bulkrax to create it
-
Update config/initializers/bulkrax.rb
# You can use any available arguments, not just 'obj' and 'index' config.fill_in_blank_source_identifiers = ->(obj, index) { "#{Site.instance.account.name}-#{obj.importerexporter.id}-#{index}" }
- NOTE: Site.instance.account.name is only used for Hyku and should be removed for Hyrax setups.
-
Use the field in your code
Bulkrax.fill_in_blank_source_identifiers.call(obj, index)
- Allow Bulkrax to export generated metadata
- On the Bulkrax exporter page, if include generated_metadata is selected, it will select any field mapping that contains a generated: true flag set in the field mapping. Although configurable, the idea of this option is to export fields that are not set by a user, but is instead generated and set by the system.
Example:
'date_uploaded' => { from: ['date_uploaded'], generated: true }
In the absence of a specified work type in the import data or mappings, specifies a default.
Bulkrax.setup do | config |
# Supply the work type as a string (ie. wrapped in single or double quotes). The Image work type must exist.
config.default_work_type = 'Image'
end
A list of properties that should be considered 'reserved' and will not be overwritten with import data.
Bulkrax.setup do | config |
# Add a local reserved property; use strings
config.reserved_properties += ['person_identifier']
end
Creating Collections using the collection_field_mapping
will no longer supported as of Bulkrax version 3.0. Please configure Bulkrax to use related_parents_field_mapping
and related_children_field_mapping
instead.
The field in the incoming import data to use to identify a collection. This is set per Entry. By default it is configured for the CSV::Entry only and is set to look for a column called 'collection. It is NOT used by the OAI entries and so does not need setting for those.
Bulkrax.setup do | config |
# Change the collection_field_mapping to use a column called 'primary_collection'
config.collection_field_mapping = {
'Bulkrax::CsvEntry' => 'primary_collection'
}
end
Mappings: related_children_field_mapping
, related_parents_field_mapping
The fields in the incoming import data used to identify a parent-child relationship. These are set per Entry. Both mappings accept IDs as well as Bulkrax source_identifiers.
Similarly to source_identifier, these mappings are declared on one of the field mappings:
Bulkrax.setup do |config|
config.field_mappings = {
'Bulkrax::CsvParser' = {
'parents' => { from: ['parents'], related_parents_field_mapping: true },
'children' => { from: ['children'], related_children_field_mapping: true },
}
}
end
By default, the related_children_field_mapping
is not configured.
Examples (CSV):
In these examples, the related_children_field_mapping
is configured to use the children
column. Work One
will become a child of Work Two
Using source_identifier
source_identifier | title | children |
---|---|---|
imported_work_1 | Work One | |
imported_work_2 | Work Two | imported_work_1 |
Using id
id | title | children |
---|---|---|
abc123 | Work One | |
def456 | Work Two | abc123 |
By default, the related_parents_field_mapping
is not configured.
Example (CSV):
In this example, the related_parents_field_mapping
is configured to use the parents
column. Work One
will become a parent of Work Two
Using source_identifier
source_identifier | title | parents |
---|---|---|
imported_work_1 | Work One | |
imported_work_2 | Work Two | imported_work_1 |
Using id
id | title | parents |
---|---|---|
abc123 | Work One | |
def456 | Work Two | abc123 |
Old Documentation (pre-2.0 and before)
This configuration option has been replaced with related_parents_field_mapping
in newer version of Bulkrax
The field in the incoming import data to use to identify a parent-child relationship. This is set per Entry. By default it is not configured at all. Configuring this will use the identifier found in the given field (eg. a column in a CSV file) to look for an existing Work or Collection resource and add the current record as a child of that resource.
Bulkrax.setup do | config |
# Use a column called 'children'
config.child_field_mapping = {
'Bulkrax::CsvEntry' => 'children'
}
end
For example, in the example given below imported_work_1
will become a child of imported_work_2
:
source_identifier | title | children |
---|---|---|
imported_work_1 | Work One | |
imported_work_2 | Work Two | imported_work_1 |
Field mappings are used to set up mappings from the import (the source / from) data to the repository (the destination). Field mappings are set on a per parser basis. The following shows the default mapping for the OaiDcParser:
config.field_mappings['Bulkrax::OaiDcParser'] = {
"contributor" => { from: ["contributor"] },
"creator" => { from: ["creator"], join: true },
"date_created" => { from: ["date"] },
"description" => { from: ["description"] },
"identifier" => { from: ["identifier"] },
"language" => { from: ["language"], parsed: true },
"publisher" => { from: ["publisher"] },
"related_url" => { from: ["relation"] },
"rights_statement" => { from: ["rights"] },
"license" => { from: ["license"], split: '\|' }, # some characters may need to be escaped
"source" => { from: ["source"] },
"subject" => { from: ["subject"], parsed: true },
"title" => { from: ["title"] },
"resource_type" => { from: ["type"], parsed: true },
"remote_files" => { from: ["thumbnail_url"], parsed: true }
}
Each Parser is represented by a data hash containing the destination field as the key, and a data hash as the value. The following keys may be used in the data hash:
- from: - supply an Array of source data field names to map to the given key
- parsed: - if set to true, use the corresponding parse method, found in "app/matchers/bulkrax/application_matcher.rb", on the given data
- split: - if set to true, split on semi-colon (;) OR pipe (|). Otherwise, supply a string containing a split character, or a regex for more complex patters.
- semicolons are valid characters in a url. When supplying multiple url's for this key, use the pipe as the separator in the csv.
- for ease of use, the same "split" value should be used for all properties
- if: - advanced use only, supply an Array containing two items - a method name (a string) in the first position and a regexp (as a string) in the second position. A common use case for this key is extracting only URLs from a field with
if: ['match?', /http(s{0,1}):\/\//]
. - excluded: - if set to true, this field will not be processed; if omitted, the field will be processed if it matches a field in the destination
- join: - on export, multi valued properties will be separated into numerated column headers unless
join: true
is in the field mapping
# Let's change the mapping for the contributor field on OaiDcParser
# supply 'Bulkrax::OaiDcParser' as the first key, and ['contributor'] as the second
config.field_mappings['Bulkrax::OaiDcParser']['contributor'] = {
# map data from the publisher and contributor fields in the OAI record to 'contributor' in Hyrax
from: ['publisher', 'contributor'],
# run the parse_contributor method on the data (this method MUST exist in the Bulkrax::ApplicationMatcher or
# Bulkrax::OaiMatcher classes) - HINT: this method doesn't currently exist so would need adding locally.
parsed: true,
# split the data on the '--' separator (so 'personA--personB' would create two contributors - personA and
# personB)
split: '--'
}
# Let's exclude the publisher field now the data is going into contributor
config.field_mappings['Bulkrax::OaiDcParser']['publisher'] = { excluded: true }
- If the models in your app allow for objects to be stored, the
object
property must be set in the field mapping - The
from
field should not be numerated, even if the header is. The code will handle numerated and non numerated column headers.
Sample CSV:
creator_first_name_1 | creator_last_name_1 | creator_position_1 | creator_first_name_2 |
---|---|---|---|
Aaliyah | Haughton | Queen | Ruth |
# Example of the field mapping
config.field_mappings['Bulkrax::YOUR-PARSER'] = {
...,
'creator_first_name' => { from: ['creator_first_name'], object: 'creator' },
'creator_last_name' => { from: ['creator_last_name'], object: 'creator' },
'position' => { from: ['creator_position'], object: 'creator', nested_type: 'Array' },
}
# Example of the model after the import
Model = {
id: 1234,
creator: [
{
creator_first_name: 'Aaliyah',
creator_last_name: 'Haughton',
position: ['Queen']
}, {
creator_first_name: 'Ruth'
}
]
}
Data hash:
- from: - supply an Array of source data field names to map from the csv to the given key
- object: - if the "from" value is a property on an object, supply the name of that object. The parser can handle two object situations:
- The key is prefixed with the name of the object, e.g.
"creator_first_name" => { from: ["creator_first_name"], object: "creator" }
- The key is not prefixed with the name of the object, e.g.
"first_name" => { from: ["creator_first_name"], object: "creator" }
- The key is prefixed with the name of the object, e.g.
- nested_type: - the data type of the "value" in the property that maps to the "from" key, if it's not a string (which is the default)
The default_field_mapping is used in the absence of a configured field_mapping. This configuration is a lambda that returns the following mapping when given a field. You are unlikely to need to override this configuration.
{
field =>
{
from: [field],
split: false,
parsed: false,
if: nil,
excluded: false
}
}
The directory to write imports to, prior to import. The default is 'tmp/imports'.
The directory to write exports to, prior to download. The default is 'tmp/exports'.
The server name is sent with OAI requests. By default it is [email protected]
.