Use D8's Migrate API to batch ingest content? #452

mjordan · 2016-12-11T18:17:10Z

Not sure if this has come up yet, but would it make sense to use D8's Migrate API as the basis for batch ingest tools? Implementations are already starting to appear.

mjordan · 2016-12-11T18:22:12Z

With some additional discussion about the Migrate Source CSV module here.

ruebot · 2016-12-11T20:05:07Z

Good question, but I don't think this is in scope for MVP. Back burner @dannylamb?

mjordan · 2016-12-12T00:14:55Z

That's fine, I added this more as a placeholder for discussion more than anything else.

dannylamb · 2016-12-12T13:51:28Z

Oh, I've been purposefully silent on batch loading. Since we will have a REST API, it really opens up the options you can use, so I'm not keen to tie anyone to anything. But if you don't mind doing the work on the web server, the D8 Migrate API is a great option. ETL via a plugin based architecture? That's a solid improvement from D7.

mjordan · 2016-12-12T14:54:11Z

@dannylamb thanks. I take it that the alternative is to ingest content directly into F4 via its REST API. Let me mull over the pros/cons of each approach and in the new year I'll post something here.

MarcusBarnes · 2016-12-12T14:56:36Z

@mjordan I'd be interested in mulling over the pros and cons of the various approaches with you in the new year, since this is of interest to me.

mjordan · 2016-12-12T14:59:50Z

@MarcusBarnes yeah, when I looked up "ETL" it struck me that's exactly what MIK does. Check out https://api.drupal.org/api/drupal/core!modules!migrate!migrate.api.php/group/migration/8.2.x.

dannylamb · 2016-12-12T15:09:09Z

@mjordan I would load content directly through Islandora's API, which we will eventually have. In 7.x-1.x the REST api is read-only, but you will be able to POST content in 8.x. At that point you can use anything to migrate content.

Personally, I'd load up an activemq queue with my content and churn on that. You'd get some nice durability and scalability that way. But if someone wants to just do up a bunch of perl sripts... who am I to judge?

mjordan · 2016-12-12T15:17:24Z

Sweet, thanks for the nudge in the right direction. Is that the Islandora REST API spec'ed out roughly at the very end of https://github.com/Islandora-CLAW/CLAW/blob/master/docs/mvp/mvp_doc.md?

Unimportant, but if you're referring to https://github.com/discoverygarden/islandora_rest as the 7.x-1.x REST API, you can POST and PUT with it as well as GET.

dannylamb · 2016-12-12T15:46:19Z

Yes, also kept purposefully quite vague. :P

Hunh... was always under the impression it was read-only. I wonder why it hasn't gotten the uptake it deserves, then.

mjordan · 2016-12-12T15:55:14Z

Dunno... I have some plans for using it (and have used it already in some internal housekeeping scripts), but it kinda dropped off my horizon last little while. Will return to it soon. REST APIs FTW!

seth-shaw-unlv · 2018-04-10T15:52:36Z

BTW, after the CLAW call last week I cleaned up my proof-of-concept that uses the Migrate API to load master Tiffs, their descriptive metadata, and link some authority records. I mention it here as a point of reference: CLAW Migrate Files Proof-of-Concept.

dannylamb · 2018-04-10T16:54:31Z

@seth-shaw-unlv That's more full featured than what I was working on 👍

mjordan · 2018-04-10T22:43:53Z

@seth-shaw-unlv Sweet. I'm going to be on a plane for a few hours tomorrow so might hack out a Move to Islandora Kit toolchain to fetch images from an existing Islandora (in my case, a vagrant running on my laptop) and dump them out in an arrangement like the one in your repo. That would be one example of a 7.x -> CLAW migration path.

seth-shaw-unlv · 2018-04-10T23:07:58Z

@mjordan I could probably just update mine to match your sample csv. Looks simple enough to do. I could probably do the update in a few minutes tomorrow morning.

Edit: done. Took longer than I expected because I bumped into an issue with migrate_plus' entity_lookup plugin. While my previous example had lookups for each content type based on the CSV column, having a single column look across multiple content types wasn't possible in a single pass without a patch. (Multiple passes works but subjects not already in Drupal would get dropped.)

mjordan · 2018-04-12T17:58:16Z

@seth-shaw-unlv that's awesome. I've been working (also took me longer than I expected) to create an end-to-end MIK toolchain that will generate a CSV like the one you have from an Islandora's OAI endpoint. Here's what the output looks like, harvested from a collection on my 7.x vagrant:

/tmp/oai_to_csv_output/
├── metadata.csv
├── mik.log
├── oai_drupal-site.org_doitest_12.png
├── oai_drupal-site.org_doitest_16.png
├── oai_drupal-site.org_doitest_4.jpeg
├── oai_drupal-site.org_doitest_5.jpeg
├── oai_drupal-site.org_doitest_6.png
└── problem_records.log

with the CSV file looking like this (so far, still needs some work):

"autogen 6 - blurg",StillImage,"nonprojected graphic",doitest:16,"This record was harvested on a Thursday."
"Church Holy Rosary, Vancouver B.C.",Churches,"Holy Rosary Church in Vancouver, B.C.",,1911,image,doitest:4,eng,"Vancouver, BC"
"Second test object.","Vanity Press","Jordan, M. (author)",(editor),2015-01-01,Text,doitest:3,,"This record was harvested on a Thursday."
"Has DOI?","Vanity Press",PhysicalObject,globe,doitest:6,"This record was harvested on a Thursday."
"autogen 6",StillImage,"nonprojected graphic",doitest:12,"This record was harvested on a Thursday."

My goal with this is to allow someone to run MIK against their 7.x repository and get the input for a Migrate Plus ingest like the one you've created.

ajs6f · 2018-04-12T18:06:30Z

Hey, all-- just a naive question from a naive onlooker, but is the basic idea of this ticket to create some tooling so that people can do migrations by acting against Drupal, instead of acting against the backend?

mjordan · 2018-04-12T18:10:25Z

@ajs6f yes, specifically, Drupal 8's Migrate API. A related issue is #819.

ajs6f · 2018-04-12T18:15:22Z

Then I have to stick my 2¢ in; that would be AWESOME. Thanks to all of you for working on this!

mjordan · 2018-04-12T20:59:37Z

OK, now the CSV file looks like this:

ID,title,identifier,description,format,File
oai%3Adrupal-site.org%3Adoitest_16,"autogen 6 - blurg",doitest:16,"This record was harvested on a Thursday.","nonprojected graphic",oai_drupal-site.org_doitest_16.png
oai%3Adrupal-site.org%3Adoitest_4,"Church Holy Rosary, Vancouver B.C.",doitest:4,"Holy Rosary Church in Vancouver, B.C.",oai_drupal-site.org_doitest_4.jpeg
oai%3Adrupal-site.org%3Adoitest_5,"Second test object.",doitest:3,"This record was harvested on a Thursday.",oai_drupal-site.org_doitest_5.jpeg
oai%3Adrupal-site.org%3Adoitest_6,"Has DOI?",doitest:6,"This record was harvested on a Thursday.",globe,oai_drupal-site.org_doitest_6.png
oai%3Adrupal-site.org%3Adoitest_12,"autogen 6",doitest:12,"This record was harvested on a Thursday.","nonprojected graphic",oai_drupal-site.org_doitest_12.png

The specific DC elements that end up in the file are configurable in the MIK .ini file. To parse MODS instead of DC, all we'd need to do is replace one PHP class file. If anyone is interested in seeing an example .ini file, I pasted one into the MIK Github issue linked a couple of comments above.

mjordan · 2018-04-17T18:45:29Z

@seth-shaw-unlv I can demo the MIK harvest part of this in tomorrow's CLAW call. You willing to go over the Migrate import stuff a little?

seth-shaw-unlv · 2018-04-18T15:25:52Z

@mjordan Sure.

mjordan · 2018-04-19T15:03:50Z

OK, based on conversation yesterday, we now have a way of outputting one larger XML file to use as Migrate Plus' input, rather than a CSV. A sample file is attached. The file is generated by concatenating all of the harvested MODS or DC XML files into one, wrapping them all in an outer element. The script is here. I've attached a sample output file (renamed to .txt so I can attach it).

metadata.xml.txt

seth-shaw-unlv · 2018-04-19T15:13:25Z

This might be a little nit-picky, but the root element should probably be modsCollection.

mjordan · 2018-04-19T15:49:45Z

No problem, it's configurable in the script but I'll change that to the default later today.

DiegoPino · 2018-04-19T16:15:11Z

Folks. What is the rationale for breaking Drupal 8 (memory 😬 ) by passing it a huge XML? instead of a key, value set like CSV? I have seen java apps like oxygen use 16 Gbytes of ram on 2xxxxx MODS records if all together in the same document. You think Drupal8/PHP can handle this and survive also a bad formed XML or are you planning on splitting before reading into a PHP Object? It is just a question before you all go this route because it could be amazing but also expensive and prone to fail. Maybe we can build a test case? like 100.000 mods documents? (that would be the median number of objects people have in their Islandora 7.x deployments)

seth-shaw-unlv · 2018-04-19T16:29:24Z

If you use the migrate_plus XML processor it uses SAX which is memory efficient.

Now, If you want to go nuts with the XPath requirements you would have to use the XMLSimple Processor which will try to load the whole thing into memory, which I agree, is likely to die on you.

That stated, if someone has a giant MODS file I can play with (all my collection records are in CSV), I'm happy to test it. The only XML testing I've done was with a relatively small set of Agent authorities (9,046 entries) which didn't give me any troubles in a default drupalvm image.

Edit: I didn't answer the initial question: the rationale is simplicity of setup. Skipping the intermediary CSV file saves a step and the single file allows us to provide a single data source in the migrate configuration entity (rather than a list of 100k records). If it doesn't work, then sure, bring on the CSV intermediary.

DiegoPino · 2018-04-19T16:43:16Z

@seth-shaw-unlv, makes sense. Will ask our folks here to generate a test set and share as soon as possible, SAX should be able to deal with it if simplistic, but i agree that some XPath use cases can become heavy. thanks!

mjordan · 2018-04-19T18:00:00Z

It might make sense to keep (and document) the new CSV writer component of MIK and provide the script to assemble the one-XML-file-to-rule-them-all. So if people want one or the other they could choose. Or not use MIK at all!

mjordan · 2018-05-01T20:25:27Z

I've written a small script that harvests a collection from 7.x. MIK may be overkill. The script is at https://github.com/mjordan/get_islandora_content. It only supports XML input, not CSV.

mjordan added the question label Dec 11, 2016

mjordan changed the title ~~Use the D8's Migrate API to batch ingest content?~~ Use D8's Migrate API to batch ingest content? Dec 11, 2016

mjordan mentioned this issue Apr 12, 2018

Add OAI to CSV toolchain to support migrations from Islandora 7.x to CLAW MarcusBarnes/mik#463

Open

mjordan mentioned this issue Apr 18, 2018

Move to Islandora Kit's output CSV is completely under our control seth-shaw-unlv/claw-migrate-files-poc#2

Closed

seth-shaw-unlv closed this as completed Jun 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use D8's Migrate API to batch ingest content? #452

Use D8's Migrate API to batch ingest content? #452

mjordan commented Dec 11, 2016

mjordan commented Dec 11, 2016

ruebot commented Dec 11, 2016

mjordan commented Dec 12, 2016

dannylamb commented Dec 12, 2016

mjordan commented Dec 12, 2016

MarcusBarnes commented Dec 12, 2016 •

edited

Loading

mjordan commented Dec 12, 2016

dannylamb commented Dec 12, 2016

mjordan commented Dec 12, 2016

dannylamb commented Dec 12, 2016

mjordan commented Dec 12, 2016

seth-shaw-unlv commented Apr 10, 2018

dannylamb commented Apr 10, 2018

mjordan commented Apr 10, 2018

seth-shaw-unlv commented Apr 10, 2018 •

edited

Loading

mjordan commented Apr 12, 2018

ajs6f commented Apr 12, 2018

mjordan commented Apr 12, 2018

ajs6f commented Apr 12, 2018

mjordan commented Apr 12, 2018 •

edited

Loading

mjordan commented Apr 17, 2018

seth-shaw-unlv commented Apr 18, 2018

mjordan commented Apr 19, 2018

seth-shaw-unlv commented Apr 19, 2018

mjordan commented Apr 19, 2018

DiegoPino commented Apr 19, 2018

seth-shaw-unlv commented Apr 19, 2018 •

edited

Loading

DiegoPino commented Apr 19, 2018

mjordan commented Apr 19, 2018 •

edited

Loading

mjordan commented May 1, 2018

Use D8's Migrate API to batch ingest content? #452

Use D8's Migrate API to batch ingest content? #452

Comments

mjordan commented Dec 11, 2016

mjordan commented Dec 11, 2016

ruebot commented Dec 11, 2016

mjordan commented Dec 12, 2016

dannylamb commented Dec 12, 2016

mjordan commented Dec 12, 2016

MarcusBarnes commented Dec 12, 2016 • edited Loading

mjordan commented Dec 12, 2016

dannylamb commented Dec 12, 2016

mjordan commented Dec 12, 2016

dannylamb commented Dec 12, 2016

mjordan commented Dec 12, 2016

seth-shaw-unlv commented Apr 10, 2018

dannylamb commented Apr 10, 2018

mjordan commented Apr 10, 2018

seth-shaw-unlv commented Apr 10, 2018 • edited Loading

mjordan commented Apr 12, 2018

ajs6f commented Apr 12, 2018

mjordan commented Apr 12, 2018

ajs6f commented Apr 12, 2018

mjordan commented Apr 12, 2018 • edited Loading

mjordan commented Apr 17, 2018

seth-shaw-unlv commented Apr 18, 2018

mjordan commented Apr 19, 2018

seth-shaw-unlv commented Apr 19, 2018

mjordan commented Apr 19, 2018

DiegoPino commented Apr 19, 2018

seth-shaw-unlv commented Apr 19, 2018 • edited Loading

DiegoPino commented Apr 19, 2018

mjordan commented Apr 19, 2018 • edited Loading

mjordan commented May 1, 2018

MarcusBarnes commented Dec 12, 2016 •

edited

Loading

seth-shaw-unlv commented Apr 10, 2018 •

edited

Loading

mjordan commented Apr 12, 2018 •

edited

Loading

seth-shaw-unlv commented Apr 19, 2018 •

edited

Loading

mjordan commented Apr 19, 2018 •

edited

Loading