Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jupiter: Back end rake task for batch ingest #762

Closed
weiweishi opened this issue Jun 30, 2018 · 5 comments
Closed

Jupiter: Back end rake task for batch ingest #762

weiweishi opened this issue Jun 30, 2018 · 5 comments
Assignees

Comments

@weiweishi
Copy link
Contributor

weiweishi commented Jun 30, 2018

User Story

As the administrator of ERA, I would like to have a batch ingest function to allow me to ingest a large amount of objects into ERA so that I don't need to deposit them manually one by one.

Describe the solution you'd like
I would like to:

  • prepare my metadata in a manifest in the format of a spreadsheet;
  • organize the files for the objects in one directory
  • run a process to read the manifest and ingest the files into ERA. In the past, it has been run as a rake task on the backend. Preferably, this can be kicked off from the front UI.
  • get a report at the end of the batch ingest process, with the ID and DOI of the object added to the manifest for each object.

Additional context
In the past, a manifest is created based on this template:
https://docs.google.com/spreadsheets/d/12ur24RsvgX8ptaqg-_-poACUUkBmVSAlJA69D14WBNo/edit#gid=0

Current workflow (in old HN):

Requirements

  • For this user story, we will focus on replicating our workflow in old HN. Any improvement for this process will be done at a later stage.
@weiweishi weiweishi changed the title Batch Ingest Jupiter: Batch Ingest Jul 9, 2018
@weiweishi
Copy link
Contributor Author

Will have a meeting with @mbarnett to flesh out the details.

@weiweishi weiweishi changed the title Jupiter: Batch Ingest Jupiter: Back end rake task for batch ingest Jul 24, 2018
@murny murny self-assigned this Jul 27, 2018
@murny
Copy link
Contributor

murny commented Aug 9, 2018

Some assumptions and questions about this from digging into the batch process from HN. Can talk about this in a person if you need more info etc.

Questions:

  1. Above CSV must have been updated? Missing a bunch of columns etc. No such thing as "Noids" (see # 3 below), bunch of references to "File Location" (but there is none in current CSV, there is a "file_name" which i am going to assume is the same), Looks like community/collection ids were changed, before you would have 6 columns? (collection_noid, collection_noid_2, collection_noid_3, and community_noid, community_noid_2, community_noid_3), etc.
    Yes, the template was since updated so that the students can prepare new batches based on Jupiter's data structure/requirements etc, but multiple collection_noid/community_noid is added based on needs, as most of them only need one community/collection pair. This is a batch done in HN with the HN script:
    https://docs.google.com/spreadsheets/d/1caWLskwbaHZNg8ToSkgw80Hjhy3MWIOp8eS7GlAtxU4/edit#gid=0.
  2. Is it okay we rename columns to something more human readable? Aka map them to the deposit item form labels e.g. "is version of" is actually "citations", change abbreviations that are saving a couple chars like "vis_after_embargo" to "visibility_after_embargo", etc?
    Definitely! we are not in anyway attached to the current template, and human readable would probably be more welcomed by ERA support team.
  3. There seems to be two different types of ingests, an "update" and a regular new "ingest". A batch "update" will updates existing records via grabbing the items "noids" from the CSV file (does not update files only metadata changes). However, there is no "noids" column in the spreadsheet. Is this something in scope and needs to be done for this? Or can I assume this isn't something we will support? And this is only for new records, not updating existing records.
    I don't think update function has ever been used, so we can assume at this point this is not something we will support, until the need rises
  4. We are logging the IO out to a batch log file? Assuming this isn't needed? Can just output to standard out and can pipe to a log file if you need this functionality (e.g. rails jupiter:batch_ingest |& tee -a batch.log or rails jupiter:batch_ingest &>> batch.log
    Yes, that would be fine. Only requirement is to generate a report with title/creator name/uuid/doi, I will specify that in the initial requirements. DOIs can be tricky as it depends on an external API call, but it also deemed as the most important information in the batch ingest report.
  5. How do you typically run this rake task? Are the manifest and batchfiles_location usually kept in a separate location often? Can we trim down on the arguments you need to pass into the rake task? Pass in a "batch directory" or something that contains the manifest file and holds all the files etc? Similarity is "investigation_id" needed (more info about what this is, see # 6 below)?
    Anything to simplify/refactor the existing process is great. I'm not attached to how this should be run. usually the students create a google spreadsheet as the manifest, and they create a google drive with pdfs for given batch. I will download them separately and scp them to app server. I could either ask the students to put the manifest file into the batch google drive, or do so when scp. So yes, go for it.
  6. HN batch ingest creates a "ingest report" which just creates a csv file using a "investigation_id" (which is what the user passes in when running the job?) and puts in information such as ingestion_id, time of when report was generated, and a list of the items created with following data (url, citation, title, doi and noid). Sounds like this can be simplified? From above you just want a list of successful IDs and DOIs? Do we need investigation_id etc? (could just calculate file name based on a timestamp, which is unique and kinda what this is already doing)?
    Investigation_ID was something Leah initially wanted to us, to have an investigation id indexed in solr so they can easily identify items ingested in the same batch. I will check in with her to see if this is still a requirement, because I don't think investigation id is in data dictionary, and as this point, doesn't serve any apparent purpose.
  7. Is there a example of what a filled in CSV might look like? I am more curious if I need to do mappings to "linked data" controlled vocabularies. For example for a column like "resource type" aka "item type", will someone fill in "article" here? And I have to map this to http://purl.org/ontology/bibo/Article in the batch ingest script so we can save this in fedora? Or would people be putting in URIs into the CSVs? In this case, the full http://purl.org/ontology/bibo/Article right into the "resource type" column?
    A new batch prepared for Jupiter is at: https://docs.google.com/spreadsheets/d/1AlflT-SSNPIoLN-GAjlm37-dCzigHd5Q-8fWGwq0SJ8/edit#gid=0. For controlled vocabs, either way would work. Right now students put in only the string not the URI. But we could ask them to use URIs instead. Whatever you think makes sense.
  8. To follow up on # 7, do I need to be doing any error handling/checking such as checking if data is actually valid? (if we doing URIs is this a correct URI in jupiter for item_types etc?) or we just accept anything that's in the CSV? Looks like the old HN batch process didn't do anything regarding this.
    old script relied on the validation in data model to throw error messages, and we potentially could do the same for Jupiter, as it has much better validation? If the data is invalid, and can't be mapped properly, I'd expect Jupiter's data model would throw an exception? If not, then some validation would be nice

Assumptions:

  1. We will not be uploading multiple files for items via batch ingest, each item has one and only one file associated to it
    For now, we can follow this assumption. The need for multiple file will rise in the future, but we can create a separate ticket for enhancement
  2. This will only be for Items (not Theses, as Theses have a bunch of different fields and very different required fields etc, which I am assuming this is out of scope as this would probably need another rake task do something similar )
    yes. and I will create a separate ticket for that, we have already got a script in very rough shape to handle our last thesis migration, but more changes are required in the manifest and how the files are downloaded so they are out of scope for this issue

@weiweishi
Copy link
Contributor Author

@murny please see the inline comments on your questions.

@murny
Copy link
Contributor

murny commented Aug 10, 2018

@weiweishi thanks for the answers/feedback! That helps alot 👍

@murny
Copy link
Contributor

murny commented Aug 23, 2018

Updated batch ingest manifest template can be found here if anyone is curious:
https://docs.google.com/spreadsheets/d/178o_-ZEV3Ii-IzJ0AcJQTQqFsH3Ew2WCgyj2aZw99iY

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants