-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch Ingest Improvements #1986
Comments
I think it would be worth doing some soft investigation of the Google Drive API – what Gems are out there, what are the limits we might run into, what's the authentication workflow look like, etc? My concern right now with the ZIP or even Avalon approach is that with all of this SET stuff we might very rapidly end up in more of a normal cloud environment where local storage to give SFTP access to or even to unpack a ZIP in might not be feasible. If we can offload the whole temporary storage for ingest queue side of things to something external to the server it might make our lives easier? |
Yeah makes sense. Thanks for feedback. Will spend some time looking into Google Drive API and see what's possible in that world 👍 |
So as mentioned in chat, Google Drive API seems to allow us a path to accomplish what we want to do. Looks like we can have users login into their google drive accounts, then they can choose the files they want to upload from their drive. Jupiter will receive this as a list of URLs with their tokens. As an example the url will look like Possible Workflow:
Work required:
|
Looks good in general. Let's call email & Theses out of scope for an MVP implementation, just to try to whittle this down a bit (next Thesis ingest isn't for another 6 months). We could focus on either Items or the Digitization pamphlets we're starting with (Folkfest programs) if we want an even simpler model (although the model isn't done yet, so Items may be a faster path). I'll ask Sean to take a look at this and let us know if he sees any major issues or concerns. |
Key point will be that whatever we do here, we'll want to make it really easy to extend to new kinds of models so that we can just use this to ingest the digitization material. Other thing that will likely be important is after-ingest reporting, especially around validation failures, that the service people can review, but again I think that's something that we should design with in mind but leave the implementation until after the MVP, as we want this to be leveragable ASAP for digitization needs. |
The proposed workflow looks very sensible to me, I think it will work well. I'll just flag some things already mentioned in emails that are worth preserving for future consideration:
|
File size: I dug into this a bit last week, and we should be good? Organizationally as far as I know the University has no maximum as to how much it can stick into Google Drive in total. Individually, the limit is 750GB uploaded a day, maximum individual file size is 5TB, which is enormous. Really this should give us a nice way around any kind of connection timeout issues which right now severely limit total upload sizes. Edit: Info on GDrive limits: https://support.google.com/a/answer/172541?hl=en Running a batch twice should be pretty easy to avoid in the most obvious case of submitting the same exact CSV or similar (I'm thinking we could just hash the CSV or something, @murny, to easily rule out a dupe?). "Accidentally similar" batches that submit the same item again might be a harder thing to catch in general, although if we can identify common identifiers for items that would let us identify already-ingested material that would be something we could explore after the initial version? |
Re: file size, this is positive news as it sounds like it might solve some other issues! Re: accidental batches, I envision @anayram and other members of the Metadata Team being gatekeepers of batch ingest, so those are unlikely to occur, but also good to know there may be a way to prevent identical batches from being run twice (there isn't in Avalon, ask me how I know!) |
This looks really promising for digitization ingest as well! A few other considerations for porting these concepts for the digitization workflow.
-I think the Google drive option would work well for those items that we have had digitized elsewhere, as we usually get those delivered on a hard drive. |
#1986: Add new batch ingest models that will hold batch ingestion information
#1986: Add new google drive client service to be able to retrieve files/spreadsheets from Google Drive
Current Process:
Jupiter comes with a handy rake task for batch ingesting items and theses which can be found here: https://github.com/ualbertalib/jupiter/blob/master/lib/tasks/batch_ingest.rake
How it works:
First you need to provide a csv file which outlines all the items metadata you wish to ingest. An example template for what an expected csv file might look like can be found here: https://docs.google.com/spreadsheets/d/178o_-ZEV3Ii-IzJ0AcJQTQqFsH3Ew2WCgyj2aZw99iY/. You can make a copy of this template, fill it out with your items metadata, and then download the first sheet as a CSV file (File > Download as > Comma-seperated values (.csv, current sheet)).
All associated files for the items outlined in the csv file above must live in the same directory. So as an example if we have downloaded the csv file to
~/Downloads/example-batch-ingest-file.csv
, then any file that was referenced in the csv file, should also live in the~/Downloads
directory.Now we can run the batch ingest on the command line using the following rake task:
The argument csv_path is the location of the csv file. So in the above example, this might look like:
In production this means we need to copy these files and csv over to the production server, then someone needs to SSH
into the server to run the task.
After running the above command, you will get output from the task as it progresses. Once completed, you should have just successfully batch ingested your items into jupiter! There should also be an ingest report created inside of the tmp directory of the project which will be a CSV file which contains the ids, urls and titles of ingested items for record keeping.
Pain points of current process:
Current rake tasks were not meant for mass amounts of files/items. However we are currently using the rake tasks for exactly this (for example ingesting 30GB worth of collections into Jupiter with it)
Rake tasks have really no error handling and no way to recover if something goes wrong.
For example, before even kicking off the batch ingest, the rake task should verify if the CSV and files being referenced are in good status
If it fails halfway through the batch ingest, can we continue with where we left off? Or do we have to just rerun it again, which has a good chance of creating duplicate items/theses?
Current process requires someone to SSH into the production server, copy files and folders over to this server and then manually run this rake task. This could open us up to mistakes and whole bunch of other problems that could put production at risk
Someone has to remain SSH'd into the production server to watch the batch ingest process to know when it has completed. If you want to see the ingest report, you need to SSH in to see it, and/or copy it over to your computer
Solutions and Ideas from Avalon/Hydra:
Avalon:
It appears they have Avalon watch a directory (might live in dropbox or on a server/etc). Once an Ingest Package is found in this directory, Avalon will kick into work mode. It will first analyze and verify the manifest file within Ingest Package to see if it's valid. If it is, the user listed on the manifest file will get an email stating that is valid and avalon will now work on ingesting the batch. Once the batch is completed, the user will get an email stating the batch was successfully ingested with a list of the items deposited. If any error occurs, the user will get an email with the error message regarding the failure and with steps on how to troubleshoot the problem. Apparently they let you reply a batch ingest to recover from any errors that happened which is pretty interesting (https://wiki.dlib.indiana.edu/display/VarVideo/Troubleshooting+a+Batch+Ingest)
More info: https://wiki.dlib.indiana.edu/display/VarVideo/Uploading+Content+using+Batch+Ingest
Hydra:
Hydra has an addon gem for this here: https://github.com/samvera-labs/hyrax-batch_ingest
Essentially it's a backend CRUD (think under /admin you might have a batch ingest tab) that allows a user to create a new batch ingest job by uploading different types of files and ingest types. The app will then consume the uploaded file and depending on the ingest type will do different things( e.g upload appears to be either a zip file with manifest/files, or a csv file (CSV probably just for updating items metadata, or referencing files within watched a directory)) . It then kicks off sidekiq jobs for each file/item to be ingested. There is a batch ingest index showing all the batch ingest jobs and their status. You can click on each batch ingest job to get more details about it. It sends emails like Avalon depending on the batch ingest status. Here's some screenshots of this in the wild (found from a Samvera Virtual Connect 2019 presentation. Full PDF can be found here: https://wiki.lyrasis.org/download/attachments/108759543/Samvera_VirtualConnect_2019_batchingest.pdf):
Improvements:
I think the improvements we can make center around a handful of topics.
Protecting Production
We should have no one in the production server if we can help it. So we need to provide an alternative method for kicking off the batch ingest work. Some suggestions is either building functionality in Jupiter that will facilitate this (Batch Ingest CRUD in admin backend)? Or can we leverage Google Drive or something similar instead? Or maybe both?
My thinking is we have a Batch Ingest CRUD in the backend, where you can either upload files/folders and other information about the batch ingest. Jupiter then takes this and kicks off the work via sidekiq jobs (similar to Hyrax example above). Bonus points if we can someone pull the information from Google drive.
Then we need no one to copy files over from their local machine to production, or needing to SSH into production to kick off this job. This will all be done from Jupiters web interface. Which helps us prevent accidental mistakes, allows this process to be testable/automated, etc.
Better notifications
As of now, user has to keep the rake task running in order to see the status of the batch ingest (we simply log to standard out, so I suppose you could run the task as a background process and pipe it to a text file/etc as an alternative). The items that have been successfully ingested are found in a report that is generated and stored in the tmp directory of the application. Both of these are less than ideal way of notifying users and keeping them up to date of the status of the batch ingest (especially if a user may not be technical savvy).
Both Avalon/Hyrax as outlined above, make heavy use of email to notify users of the progress of the batch ingest, if any errors have occurred and when the batch ingest has been completed.
I feel like we could do something similar. Especially on the email front. Having emailing for the batch ingest process would be an easy win. We currently have no email infrastructure in place in Jupiter as of now. Rails makes this easy on the development side, we just need to coordinate with System admins to give us access to an email server (both Avalon/Discovery send out emails, so can't see this being difficult. Looks like Discovery is talking to a Sendmail server and Avalon uses a SMTP server?)
Hyrax also shows the batch ingests progress in the web app. Which If we also initiate the batch ingests from the web interface (/admin), then we should also be displaying the status of the batch ingest here as well.
Better Error handling
There is two ways we can improve error handling with the currently process. The first is preventing user error from occurring in the initial batch ingest and the second is allowing users to correct errors that happen during batch ingest.
For preventing user error during the initial batch ingest this will mostly come from the CSV manifest and files. We should be auditing these before kicking off the work. We should have some code that check and validates the format of the CSV file, if the files exist and are accessible by Jupiter, if the checksums and files checksums match, etc. Once this is all good, then we can proceed with the work.
Allowing users to correct errors that happen during the batch ingest process is mostly around errors that happen outside of the users control. If Jupiter times out when attempting to depositing a file for example, can Jupiter or a user be able to recover from this? Can sidekiq retry the job? Can a user reupload the batch ingest and replay the batch ingest again without creating duplicate entries? Feel like Avalon has a few ideas here that we can maybe borrow. But ultimately we need a way for users to recover from problems without having to bug developers/staff or needing someone to go into Rails console in production to fix the problems.
How to deal with large files/manifests?
This is something I am not quite sure of how to accomplish. If we allow admin users to upload files from a form in Jupiter, then we will have a technical limit of how much data we can upload. Either from the file upload limit via Rails/Apache or file storage on the actual server, etc.
We probably need to find out more requirements of how often this will happen? And maybe the easy solution is we don't allow large files/manifests at all? If you want to upload a collection that is 30GB+ then you may have to break this work into small sizable chunks and upload multiple batch ingest jobs? E.g: Instead of uploading 10GB at once, maybe instead upload 5 jobs of 2 GB each, etc.
One thing that will be worth investigating that could help us here is if we can pull files from Google drive. If we can, this might solve some of these pains? As we don't need to send this data through a form submission or need to temporary store the files on the server. Sidekiq can pull down one file at a time as its working through the collection which would help with quite a few limits that we have.
The text was updated successfully, but these errors were encountered: