Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate using OpenRefine as part of the migration process #898

Open
dannylamb opened this issue Aug 20, 2018 · 22 comments
Open

Investigate using OpenRefine as part of the migration process #898

dannylamb opened this issue Aug 20, 2018 · 22 comments
Assignees
Labels
Subject: Metadata related to metadata issues. Consider also using the search tag. Subject: Migration Concerning migration from Islandora 7 to Islandora 2.x.x

Comments

@dannylamb
Copy link
Contributor

Many interested parties have mentioned using http://openrefine.org/ to clean up metadata before migrating to CLAW. We need users to investigate how it can be utilized to find external URIs for existing authorities and how it can clean up our MODS for us.

We also have to find out how to interact with it and where in the migration process it belongs. If we can call out to it over HTTP via an API, we may be able to integrate it using Drupal's migration framework. If not, it will have to be a step done before migration, while data is still in 7.x or working on a export, etc...

Download and install openrefine, then try it out and let us know what you think! More than one person can tackle this. No need for one person to hog all the metadata glory.

@ajs6f
Copy link

ajs6f commented Aug 20, 2018

@dhlamb I don't see how it could be fully automated into a migration workflow. If nothing else, any actual mapping from "strings to things" is going to have a decent number of mistakes, and those will require human intervention. Or are you thinking of breaking up the automation to insert OpenRefine? In that case maybe a PHP-side client would be the bridge.

@dannylamb
Copy link
Contributor Author

I'm like 99% sure this'll have to be something done before kicking off Drupal migrate, and something that you'd manually take to the extent you feel comfortable with, because you can do a lot with it. I don't think it's applicable to every migration, but enough folks have mentioned it that we should find a way to squeeze it into the workflow for those who want to use it. Even if that means "Just run OpenRefine first" and we can provide some guidance. If we're lucky, maybe we make a plugin that uses https://github.com/keboola/openrefine-php-client that people can turn on in their yml if they want it. But full automation is doubtful because everybody's repository is different.

@ajs6f
Copy link

ajs6f commented Aug 20, 2018

Ok, cool, that's a much less ambitious / more practical approach than I thought was intended.

@exsilica
Copy link

Assigning myself @exsilica - lacking permissions for this repo

@carakey
Copy link

carakey commented Aug 20, 2018

Assigning myself as well - @carakey

@DigitLib
Copy link

Maybe this tool? https://github.com/LibreCat/Catmandu I had a problem to export MODS to RDF..

@carakey
Copy link

carakey commented Aug 20, 2018

Under development at LSU for converting from XML to CSV: https://github.com/lsulibraries/xml2csv

@mbolam
Copy link

mbolam commented Aug 20, 2018

I'm happy to jump in on this one, too -- but don't seem to have the permissions -- @mbolam

@rtilla1
Copy link

rtilla1 commented Aug 20, 2018

@rtilla1

@amcshane
Copy link

Happy to jump on this @amcshane --- also happy to help compile the notes once folks have a chance to explore.

@dannylamb
Copy link
Contributor Author

@carakey @exsilica I've sent you invites to the Islandora-CLAW organization and can assign you this ticket once you accept.

Thanks for signing on everyone!

@mbolam
Copy link

mbolam commented Aug 20, 2018

Regarding reconciliation using "Conciliator" -- https://github.com/codeforkjeff/conciliator. My troubles turned out to be rated to Java versions and my Mac pointing at an outdated version. Tested with latest version of Java and it is working on my desktop. No need for developer support, assuming people can get the Java 1.8 working on device.

@amcshane
Copy link

Are there any particular authority files folks want to make sure work? I've got a bunch of MARC that I can offer if anybody needs a bit of a mess to play with. It will not migrate prettily, I promise.

@rtilla1
Copy link

rtilla1 commented Aug 22, 2018

@carakey and I were able to get xml2csv running on 15 of the sample MODS Islandora 7.x users have provided to MIG. Here's the branch with those resulting files: https://github.com/rtilla1/xml2csv . Point of interest: 15 well-formed MODS files have together 285 unique "fields" (every xpath that points to contents and which has a different combination of elements or attributes). There are holes in how these are being counted, but it's an interesting starting point.

@amcshane
Copy link

@rtilla1 -- once you import the xml as a csv and complete the clean-up work, do you export as a csv and use another tool to recreate the MODS or are you using the Templating export feature in OpenRefine?

i.e. https://gist.github.com/sallain/7604ffb0c155294fcfaf

@carakey
Copy link

carakey commented Aug 29, 2018

The xml2csv project at https://github.com/lsulibraries/xml2csv has been updated to use the current mapping spreadsheet from the MIG - i.e., only the mapped xpaths are included in the csv output. The latest incorporates most of the action items from the 8/28 call.

Example of output: 15 MODS files as CSV in Google sheets

If anyone takes it for a spin, I'd love feedback.

@amcshane
Copy link

@carakey I'll give it a shot tomorrow morning!

@rtilla1
Copy link

rtilla1 commented Aug 30, 2018

Using OpenRefine as part of the migration process requires a number of steps, some of which are complete, and a number of which have multiple steps left undone at this point.

  1. Export data from 7.x with CRUD as MODS
  2. Transform MODS with xml2csv tool into very specific columns with well-documented delimiters between compound or complex contents ##()
  3. Transform resulting tsv into reconcilable data with OpenRefine
  4. Reconcile each column as appropriate (subjects against LCSH, MeSH, AAT, names against LCSH, VIAF, and WikiData, etc.), sometimes grabbing additional data about the term (such as personal or corporate name)
  5. Transform reconciled data into columns so that a string is available for each taxonomy term, along with optional namespace:code or namespace:term data, and other information about each term
  6. Export each type of vocabulary into it's own CSV
  7. Re-transform the data (possibly removing the reconciliation data) so that the bibliographic records can be exported, making sure to carry over the PID from the CRUD export in Step 1.
  8. Import the vocabulary/taxonomy terms.
  9. Import the bibliographic records, re-matching them with the appropriate object.

Step 2(#913), 3(#914), 4, 5, and 7 need work. Step 1, 6, 8, and 9 are theoretically ready to go and have been tested with other applications or data.

@seth-shaw-unlv
Copy link
Contributor

So, this might be crazy talk, but could we get OpenRefine to export these records back out as mods records in a single modsCollection and Agents as MADSXML? Then we won't have to deal with the nested name delimiters we've been talking about in Zoom meetings. I can migrate XML documents just as easily (and sometimes more so) than CSV.

@mbolam
Copy link

mbolam commented Aug 30, 2018

@seth-shaw-unlv -- One could potentially do templating in OpenRefine to export as MODS and/or MADS.

http://digitalscholarship.utsc.utoronto.ca/content/blogs/converting-spreadsheets-modsxml-using-open-refine

I've played around with templating, but not used it extensively. It probably wouldn't be too tough to come up with a "basic version" that at least handles the core elements we've been considering for the sprint.

@carakey
Copy link

carakey commented Aug 30, 2018 via email

@amcshane
Copy link

Were folks actually able to perform reconciliation with the provided test MODS? The version I'm seeing retains data about MARC subfields in the text, making reconciliation against LOC (for example) impossible.

My assumption was that each subfield needed its own column, as well -- not unlike creating MARC records from delimited text files.

@kstapelfeldt kstapelfeldt added the Subject: Metadata related to metadata issues. Consider also using the search tag. label Sep 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Subject: Metadata related to metadata issues. Consider also using the search tag. Subject: Migration Concerning migration from Islandora 7 to Islandora 2.x.x
Projects
Development

No branches or pull requests

10 participants