Jupiter: Back end rake task for batch ingest #762

weiweishi · 2018-06-30T05:30:57Z

User Story

As the administrator of ERA, I would like to have a batch ingest function to allow me to ingest a large amount of objects into ERA so that I don't need to deposit them manually one by one.

Describe the solution you'd like
I would like to:

prepare my metadata in a manifest in the format of a spreadsheet;
organize the files for the objects in one directory
run a process to read the manifest and ingest the files into ERA. In the past, it has been run as a rake task on the backend. Preferably, this can be kicked off from the front UI.
get a report at the end of the batch ingest process, with the ID and DOI of the object added to the manifest for each object.

Additional context
In the past, a manifest is created based on this template:
https://docs.google.com/spreadsheets/d/12ur24RsvgX8ptaqg-_-poACUUkBmVSAlJA69D14WBNo/edit#gid=0

Current workflow (in old HN):

A manifest is prepared in Google Sheet.
Files associated with a batch are uploaded to Google Drive:
https://drive.google.com/drive/u/0/folders/0B8kEt6qEUaR-QWlPa3JObGkySGM
Weiwei downloads the manifest and the files, and upload them to bournemouth
Weiwei runs batch.rake to do the batch ingest, and send the report to [email protected] after the batch ingest completes.

Requirements

For this user story, we will focus on replicating our workflow in old HN. Any improvement for this process will be done at a later stage.

weiweishi · 2018-07-09T19:38:34Z

Will have a meeting with @mbarnett to flesh out the details.

murny · 2018-08-09T20:48:16Z

Some assumptions and questions about this from digging into the batch process from HN. Can talk about this in a person if you need more info etc.

Questions:

Above CSV must have been updated? Missing a bunch of columns etc. No such thing as "Noids" (see # 3 below), bunch of references to "File Location" (but there is none in current CSV, there is a "file_name" which i am going to assume is the same), Looks like community/collection ids were changed, before you would have 6 columns? (collection_noid, collection_noid_2, collection_noid_3, and community_noid, community_noid_2, community_noid_3), etc.
Yes, the template was since updated so that the students can prepare new batches based on Jupiter's data structure/requirements etc, but multiple collection_noid/community_noid is added based on needs, as most of them only need one community/collection pair. This is a batch done in HN with the HN script:
https://docs.google.com/spreadsheets/d/1caWLskwbaHZNg8ToSkgw80Hjhy3MWIOp8eS7GlAtxU4/edit#gid=0.
Is it okay we rename columns to something more human readable? Aka map them to the deposit item form labels e.g. "is version of" is actually "citations", change abbreviations that are saving a couple chars like "vis_after_embargo" to "visibility_after_embargo", etc?
Definitely! we are not in anyway attached to the current template, and human readable would probably be more welcomed by ERA support team.
There seems to be two different types of ingests, an "update" and a regular new "ingest". A batch "update" will updates existing records via grabbing the items "noids" from the CSV file (does not update files only metadata changes). However, there is no "noids" column in the spreadsheet. Is this something in scope and needs to be done for this? Or can I assume this isn't something we will support? And this is only for new records, not updating existing records.
I don't think update function has ever been used, so we can assume at this point this is not something we will support, until the need rises
We are logging the IO out to a batch log file? Assuming this isn't needed? Can just output to standard out and can pipe to a log file if you need this functionality (e.g. rails jupiter:batch_ingest |& tee -a batch.log or rails jupiter:batch_ingest &>> batch.log
Yes, that would be fine. Only requirement is to generate a report with title/creator name/uuid/doi, I will specify that in the initial requirements. DOIs can be tricky as it depends on an external API call, but it also deemed as the most important information in the batch ingest report.
How do you typically run this rake task? Are the manifest and batchfiles_location usually kept in a separate location often? Can we trim down on the arguments you need to pass into the rake task? Pass in a "batch directory" or something that contains the manifest file and holds all the files etc? Similarity is "investigation_id" needed (more info about what this is, see # 6 below)?
Anything to simplify/refactor the existing process is great. I'm not attached to how this should be run. usually the students create a google spreadsheet as the manifest, and they create a google drive with pdfs for given batch. I will download them separately and scp them to app server. I could either ask the students to put the manifest file into the batch google drive, or do so when scp. So yes, go for it.
HN batch ingest creates a "ingest report" which just creates a csv file using a "investigation_id" (which is what the user passes in when running the job?) and puts in information such as ingestion_id, time of when report was generated, and a list of the items created with following data (url, citation, title, doi and noid). Sounds like this can be simplified? From above you just want a list of successful IDs and DOIs? Do we need investigation_id etc? (could just calculate file name based on a timestamp, which is unique and kinda what this is already doing)?
Investigation_ID was something Leah initially wanted to us, to have an investigation id indexed in solr so they can easily identify items ingested in the same batch. I will check in with her to see if this is still a requirement, because I don't think investigation id is in data dictionary, and as this point, doesn't serve any apparent purpose.
Is there a example of what a filled in CSV might look like? I am more curious if I need to do mappings to "linked data" controlled vocabularies. For example for a column like "resource type" aka "item type", will someone fill in "article" here? And I have to map this to http://purl.org/ontology/bibo/Article in the batch ingest script so we can save this in fedora? Or would people be putting in URIs into the CSVs? In this case, the full http://purl.org/ontology/bibo/Article right into the "resource type" column?
A new batch prepared for Jupiter is at: https://docs.google.com/spreadsheets/d/1AlflT-SSNPIoLN-GAjlm37-dCzigHd5Q-8fWGwq0SJ8/edit#gid=0. For controlled vocabs, either way would work. Right now students put in only the string not the URI. But we could ask them to use URIs instead. Whatever you think makes sense.
To follow up on # 7, do I need to be doing any error handling/checking such as checking if data is actually valid? (if we doing URIs is this a correct URI in jupiter for item_types etc?) or we just accept anything that's in the CSV? Looks like the old HN batch process didn't do anything regarding this.
old script relied on the validation in data model to throw error messages, and we potentially could do the same for Jupiter, as it has much better validation? If the data is invalid, and can't be mapped properly, I'd expect Jupiter's data model would throw an exception? If not, then some validation would be nice

Assumptions:

We will not be uploading multiple files for items via batch ingest, each item has one and only one file associated to it
For now, we can follow this assumption. The need for multiple file will rise in the future, but we can create a separate ticket for enhancement
This will only be for Items (not Theses, as Theses have a bunch of different fields and very different required fields etc, which I am assuming this is out of scope as this would probably need another rake task do something similar )
yes. and I will create a separate ticket for that, we have already got a script in very rough shape to handle our last thesis migration, but more changes are required in the manifest and how the files are downloaded so they are out of scope for this issue

weiweishi · 2018-08-10T15:27:49Z

@murny please see the inline comments on your questions.

murny · 2018-08-10T15:41:16Z

@weiweishi thanks for the answers/feedback! That helps alot 👍

murny · 2018-08-23T19:41:37Z

Updated batch ingest manifest template can be found here if anyone is curious:
https://docs.google.com/spreadsheets/d/178o_-ZEV3Ii-IzJ0AcJQTQqFsH3Ew2WCgyj2aZw99iY

weiweishi changed the title ~~Batch Ingest~~ Jupiter: Batch Ingest Jul 9, 2018

weiweishi changed the title ~~Jupiter: Batch Ingest~~ Jupiter: Back end rake task for batch ingest Jul 24, 2018

murny self-assigned this Jul 27, 2018

murny added the in-progress label Jul 27, 2018

murny closed this as completed Aug 23, 2018

murny reopened this Aug 23, 2018

murny mentioned this issue Aug 28, 2018

Batch ingest rake task for items #838

Merged

weiweishi added waiting for deployment and removed in-progress labels Sep 13, 2018

weiweishi closed this as completed Sep 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jupiter: Back end rake task for batch ingest #762

Jupiter: Back end rake task for batch ingest #762

weiweishi commented Jun 30, 2018 •

edited

Loading

weiweishi commented Jul 9, 2018

murny commented Aug 9, 2018 •

edited by weiweishi

Loading

weiweishi commented Aug 10, 2018

murny commented Aug 10, 2018

murny commented Aug 23, 2018

Jupiter: Back end rake task for batch ingest #762

Jupiter: Back end rake task for batch ingest #762

Comments

weiweishi commented Jun 30, 2018 • edited Loading

weiweishi commented Jul 9, 2018

murny commented Aug 9, 2018 • edited by weiweishi Loading

weiweishi commented Aug 10, 2018

murny commented Aug 10, 2018

murny commented Aug 23, 2018

weiweishi commented Jun 30, 2018 •

edited

Loading

murny commented Aug 9, 2018 •

edited by weiweishi

Loading