Skip to content

Configuring Bulkrax

Alisha Evans edited this page May 9, 2023 · 16 revisions

Bulkrax has a range of configuration options. Once the Bulkrax installation has been run (rails g bulkrax:install), a configuration file will be added at the following location:

# config/initializers/bulkrax.rb

Defaults are in place for the various available configurations. To view defaults via the rails console:

rails c
# Bulkrax.{name of config} for example:
> Bulkrax.parsers

[
  { name: "OAI - Dublin Core", class_name: "Bulkrax::OaiDcParser", partial: "oai_fields" },
  { name: "OAI - Qualified Dublin Core", class_name: "Bulkrax::OaiQualifiedDcParser", partial: "oai_fields" },
  { name: "CSV - Comma Separated Values", class_name: "Bulkrax::CsvParser", partial: "csv_fields" },
  { name: "Bagit", class_name: "Bulkrax::BagitParser", partial: "bagit_fields" }
]

Parsers

Add a local configuration for parsers to:

  • Change the name displayed in the importer create and edit pages.
  • Disable a parser, by removing it from the configuration.
  • Use a custom form partial to override the display in the importer create and edit pages
  • Add a custom parser
# config/initializers/bulkrax.rb

Bulkrax.setup do | config |

  # Remove the QualifiedDC parser
  config.parsers -= [{ name: "OAI - Qualified Dublin Core", class_name: "Bulkrax::OaiQualifiedDcParser", partial: "oai_fields" }]
  
  # Having removed it, add it back with a new name and partial (app/views/bulkrax/importers/_oai_terms_fields.html.erb MUST exist)
  config.parsers += [{ name: "OAI - DC Terms", class_name: "Bulkrax::OaiQualifiedDcParser", partial: "oai_terms_fields" }]
  
  # Add a new parser (Bulkrax::OaiCustomParser must exist at app/parsers/bulkrax/oai_custom_parser.rb)
  config.parsers += [{ name: "OAI - Custom", class_name: "Bulkrax::OaiCustomParser", partial: "oai_fields" }]

end

Work Identifier

In Bulkrax, the representative work_identifier attribute must exist on the Work, FileSet and Collection models, and be unique for all identifiers in the hyrax/hyku app. It will be used to store the source identifier on those models. When the import runs, it checks whether a model already exists with the same work_identifier: source_identifier property, and if so, it updates that existing model. If it does not, then a new model is created.

There are three ways to set the attribute: ​

  1. Allow Bulkrax to use the default work_identifier of source
  2. Use an existing Hyrax attribute. This can be changed in the local application by setting source_identifier: true in the mappings:
Bulkrax.setup do | config |
  # Use the identifier field (note: identifier must be available on all works, file sets and collections).
  config.field_mappings['Bulkrax::OaiDcParser'] = {
    "identifier" => { from: ["source_identifier"], source_identifier: true }
  end
end
  1. Create a new attribute
  • If there isn't an attribute that's available and unique across all Works, FileSets and Collections, you can make a custom field. An example of how this can be changed in the local application is as follows:
    • Add the field to all models (which should also update your solr documents)
    • Use the field
      config.field_mappings['Bulkrax::CsvParser'] = {
        'bulkrax_identifier' => { from: ['bulkrax_identifier'], source_identifier: true }
      }
      

Source Identifier

Field

In Bulkrax, a field (header) representing the source_identifier must exist on the imported document (csv, xml, etc.). There are two ways to set the field: ​

  1. Use the default source_identifier field; no configuration necessary.
  2. Configure Bulkrax to use another another field
  • This field does not have to exist on the model, but it does have to exist on the imported document. Either way, it will need to be mapped accordingly as the "from" term for your source_identifier mapping.
    Bulkrax.setup do | config |
      # Use original_identifier
      config.field_mappings['Bulkrax::CsvParser'] = {
        'identifier' => { from: ['original_identifier'], source_identifier: true }
      }
    end
    

Value

The value corresponding to the field above must be unique throughout the entire database. You cannot use your separator (e.g. "|" or ";") in this value on the imported document and it is recommended that file naming conventions are followed in that there are no spaces, etc. The value in the field will be stored as the value of the work identifier on the Work, FileSet or Collection. When the import runs, it checks whether a model already exists with the same work_identifier: source_identifier property, and if so, it updates that existing model. If it does not, then a new model is created.

There are two ways to set this value:

  1. It already exists on the imported document. (e.g., you typed it on your csv or it's already in the oai feed)
field
'myuniquevalue'
  1. Allow Bulkrax to create it
  • Update config/initializers/bulkrax.rb

      # You can use any available arguments, not just 'obj' and 'index'
      config.fill_in_blank_source_identifiers = ->(obj, index) { "#{Site.instance.account.name}-#{obj.importerexporter.id}-#{index}" }
    
    • NOTE: Site.instance.account.name is only used for Hyku and should be removed for Hyrax setups.
  • Use the field in your code

    Bulkrax.fill_in_blank_source_identifiers.call(obj, index)
    

Generated Metadata

  • Allow Bulkrax to export generated metadata
    • On the Bulkrax exporter page, if include generated_metadata is selected, it will select any field mapping that contains a generated: true flag set in the field mapping. Although configurable, the idea of this option is to export fields that are not set by a user, but is instead generated and set by the system.

Example:

'date_uploaded' => { from: ['date_uploaded'], generated: true }

Default work type

In the absence of a specified work type in the import data or mappings, specifies a default.

Bulkrax.setup do | config |
  # Supply the work type as a string (ie. wrapped in single or double quotes). The Image work type must exist.
  config.default_work_type = 'Image'
end

Reserved properties

A list of properties that should be considered 'reserved' and will not be overwritten with import data.

Bulkrax.setup do | config |
  # Add a local reserved property; use strings
  config.reserved_properties += ['person_identifier']
end

Collection field mapping **

Creating Collections using the collection_field_mapping will no longer supported as of Bulkrax version 3.0. Please configure Bulkrax to use related_parents_field_mapping and related_children_field_mapping instead.

The field in the incoming import data to use to identify a collection. This is set per Entry. By default it is configured for the CSV::Entry only and is set to look for a column called 'collection. It is NOT used by the OAI entries and so does not need setting for those.

Bulkrax.setup do | config |
  # Change the collection_field_mapping to use a column called 'primary_collection'
  config.collection_field_mapping = {
    'Bulkrax::CsvEntry' => 'primary_collection'
  } 
end

Parent-child relationship field mappings

Version 2.0.0 and later

Mappings: related_children_field_mapping, related_parents_field_mapping

The fields in the incoming import data used to identify a parent-child relationship. These are set per Entry. Both mappings accept IDs as well as Bulkrax source_identifiers.

Similarly to source_identifier, these mappings are declared on one of the field mappings:

Bulkrax.setup do |config|
  config.field_mappings = {
    'Bulkrax::CsvParser' = {
      'parents' => { from: ['parents'], related_parents_field_mapping: true },
      'children' => { from: ['children'], related_children_field_mapping: true },
    }
  }
end

related_children_field_mapping

By default, the related_children_field_mapping is not configured.

Examples (CSV):

In these examples, the related_children_field_mapping is configured to use the children column. Work One will become a child of Work Two

Using source_identifier

source_identifier title children
imported_work_1 Work One
imported_work_2 Work Two imported_work_1

Using id

id title children
abc123 Work One
def456 Work Two abc123

related_parents_field_mapping

By default, the related_parents_field_mapping is not configured.

Example (CSV):

In this example, the related_parents_field_mapping is configured to use the parents column. Work One will become a parent of Work Two

Using source_identifier

source_identifier title parents
imported_work_1 Work One
imported_work_2 Work Two imported_work_1

Using id

id title parents
abc123 Work One
def456 Work Two abc123
Old Documentation (pre-2.0 and before)

This configuration option has been replaced with related_parents_field_mapping in newer version of Bulkrax

Parent child field mapping

The field in the incoming import data to use to identify a parent-child relationship. This is set per Entry. By default it is not configured at all. Configuring this will use the identifier found in the given field (eg. a column in a CSV file) to look for an existing Work or Collection resource and add the current record as a child of that resource.

Bulkrax.setup do | config |
  # Use a column called 'children' 
  config.child_field_mapping = {
    'Bulkrax::CsvEntry' => 'children'
  } 
end

For example, in the example given below imported_work_1 will become a child of imported_work_2:

source_identifier title children
imported_work_1 Work One
imported_work_2 Work Two imported_work_1

Field mappings

Field mappings are used to set up mappings from the import (the source / from) data to the repository (the destination). Field mappings are set on a per parser basis. The following shows the default mapping for the OaiDcParser:

config.field_mappings['Bulkrax::OaiDcParser'] = {
  "contributor" => { from: ["contributor"] },
  "creator" => { from: ["creator"], join: true },
  "date_created" => { from: ["date"] },
  "description" => { from: ["description"] },
  "identifier" => { from: ["identifier"] },
  "language" => { from: ["language"], parsed: true },
  "publisher" => { from: ["publisher"] },
  "related_url" => { from: ["relation"] },
  "rights_statement" => { from: ["rights"] },
  "license" => { from: ["license"], split: '\|' }, # some characters may need to be escaped
  "source" => { from: ["source"] },
  "subject" => { from: ["subject"], parsed: true },
  "title" => { from: ["title"] },
  "resource_type" => { from: ["type"], parsed: true },
  "remote_files" => { from: ["thumbnail_url"], parsed: true }
}

Each Parser is represented by a data hash containing the destination field as the key, and a data hash as the value. The following keys may be used in the data hash:

  • from: - supply an Array of source data field names to map to the given key
  • parsed: - if set to true, use the corresponding parse method, found in "app/matchers/bulkrax/application_matcher.rb", on the given data
  • split: - if set to true, split on semi-colon (;) OR pipe (|). Otherwise, supply a string containing a split character, or a regex for more complex patters.
    • semicolons are valid characters in a url. When supplying multiple url's for this key, use the pipe as the separator in the csv.
    • for ease of use, the same "split" value should be used for all properties
  • if: - advanced use only, supply an Array containing two items - a method name (a string) in the first position and a regexp (as a string) in the second position. A common use case for this key is extracting only URLs from a field with if: ['match?', /http(s{0,1}):\/\//].
  • excluded: - if set to true, this field will not be processed; if omitted, the field will be processed if it matches a field in the destination
  • join: - on export, multi valued properties will be separated into numerated column headers unless join: true is in the field mapping

Worked Example:

  # Let's change the mapping for the contributor field on OaiDcParser
  #   supply 'Bulkrax::OaiDcParser' as the first key, and ['contributor'] as the second
  config.field_mappings['Bulkrax::OaiDcParser']['contributor'] = {
    # map data from the publisher and contributor fields in the OAI record to 'contributor' in Hyrax
    from: ['publisher', 'contributor'],
    # run the parse_contributor method on the data (this method MUST exist in the Bulkrax::ApplicationMatcher or      
    #   Bulkrax::OaiMatcher classes) - HINT: this method doesn't currently exist so would need adding locally.
    parsed: true,
    # split the data on the '--' separator (so 'personA--personB' would create two contributors - personA and 
    #   personB)
    split: '--'
  }

  # Let's exclude the publisher field now the data is going into contributor
  config.field_mappings['Bulkrax::OaiDcParser']['publisher'] = { excluded: true }

Mapping from objects

  • If the models in your app allow for objects to be stored, the object property must be set in the field mapping
  • The from field should not be numerated, even if the header is. The code will handle numerated and non numerated column headers.

Sample CSV:

creator_first_name_1 creator_last_name_1 creator_position_1 creator_first_name_2
Aaliyah Haughton Queen Ruth
# Example of the field mapping
config.field_mappings['Bulkrax::YOUR-PARSER'] = {
  ...,
  'creator_first_name' => { from: ['creator_first_name'], object: 'creator' },
  'creator_last_name' => { from: ['creator_last_name'], object: 'creator' },
  'position' => { from: ['creator_position'], object: 'creator', nested_type: 'Array' },
}

# Example of the model after the import
Model = {
  id: 1234,
  creator: [
    {
      creator_first_name: 'Aaliyah',
      creator_last_name: 'Haughton',
      position: ['Queen']
    }, {
      creator_first_name: 'Ruth'
    }
  ]
}

Data hash:

  • from: - supply an Array of source data field names to map from the csv to the given key
  • object: - if the "from" value is a property on an object, supply the name of that object. The parser can handle two object situations:
    • The key is prefixed with the name of the object, e.g. "creator_first_name" => { from: ["creator_first_name"], object: "creator" }
    • The key is not prefixed with the name of the object, e.g. "first_name" => { from: ["creator_first_name"], object: "creator" }
  • nested_type: - the data type of the "value" in the property that maps to the "from" key, if it's not a string (which is the default)

Default field mapping

The default_field_mapping is used in the absence of a configured field_mapping. This configuration is a lambda that returns the following mapping when given a field. You are unlikely to need to override this configuration.

  {
    field =>
    {
      from: [field],
      split: false,
      parsed: false,
      if: nil,
      excluded: false
    }
  }

Import path

The directory to write imports to, prior to import. The default is 'tmp/imports'.

Export path

The directory to write exports to, prior to download. The default is 'tmp/exports'.

Server name

The server name is sent with OAI requests. By default it is [email protected].