Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch Ingest Improvements #1986

Open
murny opened this issue Nov 13, 2020 · 9 comments
Open

Batch Ingest Improvements #1986

murny opened this issue Nov 13, 2020 · 9 comments

Comments

@murny
Copy link
Contributor

murny commented Nov 13, 2020

Current Process:

Jupiter comes with a handy rake task for batch ingesting items and theses which can be found here: https://github.com/ualbertalib/jupiter/blob/master/lib/tasks/batch_ingest.rake

How it works:

  • First you need to provide a csv file which outlines all the items metadata you wish to ingest. An example template for what an expected csv file might look like can be found here: https://docs.google.com/spreadsheets/d/178o_-ZEV3Ii-IzJ0AcJQTQqFsH3Ew2WCgyj2aZw99iY/. You can make a copy of this template, fill it out with your items metadata, and then download the first sheet as a CSV file (File > Download as > Comma-seperated values (.csv, current sheet)).

  • All associated files for the items outlined in the csv file above must live in the same directory. So as an example if we have downloaded the csv file to ~/Downloads/example-batch-ingest-file.csv, then any file that was referenced in the csv file, should also live in the ~/Downloads directory.

  • Now we can run the batch ingest on the command line using the following rake task:

    rails jupiter:batch_ingest_items[csv_path]
    

    The argument csv_path is the location of the csv file. So in the above example, this might look like:

    rails jupiter:batch_ingest_items["~/Downloads/example-batch-ingest-file.csv"]
    

    In production this means we need to copy these files and csv over to the production server, then someone needs to SSH
    into the server to run the task.

  • After running the above command, you will get output from the task as it progresses. Once completed, you should have just successfully batch ingested your items into jupiter! There should also be an ingest report created inside of the tmp directory of the project which will be a CSV file which contains the ids, urls and titles of ingested items for record keeping.

Pain points of current process:

  • Current rake tasks were not meant for mass amounts of files/items. However we are currently using the rake tasks for exactly this (for example ingesting 30GB worth of collections into Jupiter with it)

  • Rake tasks have really no error handling and no way to recover if something goes wrong.

    • For example, before even kicking off the batch ingest, the rake task should verify if the CSV and files being referenced are in good status

    • If it fails halfway through the batch ingest, can we continue with where we left off? Or do we have to just rerun it again, which has a good chance of creating duplicate items/theses?

  • Current process requires someone to SSH into the production server, copy files and folders over to this server and then manually run this rake task. This could open us up to mistakes and whole bunch of other problems that could put production at risk

  • Someone has to remain SSH'd into the production server to watch the batch ingest process to know when it has completed. If you want to see the ingest report, you need to SSH in to see it, and/or copy it over to your computer

Solutions and Ideas from Avalon/Hydra:

Avalon:

It appears they have Avalon watch a directory (might live in dropbox or on a server/etc). Once an Ingest Package is found in this directory, Avalon will kick into work mode. It will first analyze and verify the manifest file within Ingest Package to see if it's valid. If it is, the user listed on the manifest file will get an email stating that is valid and avalon will now work on ingesting the batch. Once the batch is completed, the user will get an email stating the batch was successfully ingested with a list of the items deposited. If any error occurs, the user will get an email with the error message regarding the failure and with steps on how to troubleshoot the problem. Apparently they let you reply a batch ingest to recover from any errors that happened which is pretty interesting (https://wiki.dlib.indiana.edu/display/VarVideo/Troubleshooting+a+Batch+Ingest)

More info: https://wiki.dlib.indiana.edu/display/VarVideo/Uploading+Content+using+Batch+Ingest

Hydra:

Hydra has an addon gem for this here: https://github.com/samvera-labs/hyrax-batch_ingest

Essentially it's a backend CRUD (think under /admin you might have a batch ingest tab) that allows a user to create a new batch ingest job by uploading different types of files and ingest types. The app will then consume the uploaded file and depending on the ingest type will do different things( e.g upload appears to be either a zip file with manifest/files, or a csv file (CSV probably just for updating items metadata, or referencing files within watched a directory)) . It then kicks off sidekiq jobs for each file/item to be ingested. There is a batch ingest index showing all the batch ingest jobs and their status. You can click on each batch ingest job to get more details about it. It sends emails like Avalon depending on the batch ingest status. Here's some screenshots of this in the wild (found from a Samvera Virtual Connect 2019 presentation. Full PDF can be found here: https://wiki.lyrasis.org/download/attachments/108759543/Samvera_VirtualConnect_2019_batchingest.pdf):

image
image
image
image

Improvements:

I think the improvements we can make center around a handful of topics.

Protecting Production

We should have no one in the production server if we can help it. So we need to provide an alternative method for kicking off the batch ingest work. Some suggestions is either building functionality in Jupiter that will facilitate this (Batch Ingest CRUD in admin backend)? Or can we leverage Google Drive or something similar instead? Or maybe both?

My thinking is we have a Batch Ingest CRUD in the backend, where you can either upload files/folders and other information about the batch ingest. Jupiter then takes this and kicks off the work via sidekiq jobs (similar to Hyrax example above). Bonus points if we can someone pull the information from Google drive.

Then we need no one to copy files over from their local machine to production, or needing to SSH into production to kick off this job. This will all be done from Jupiters web interface. Which helps us prevent accidental mistakes, allows this process to be testable/automated, etc.

Better notifications

As of now, user has to keep the rake task running in order to see the status of the batch ingest (we simply log to standard out, so I suppose you could run the task as a background process and pipe it to a text file/etc as an alternative). The items that have been successfully ingested are found in a report that is generated and stored in the tmp directory of the application. Both of these are less than ideal way of notifying users and keeping them up to date of the status of the batch ingest (especially if a user may not be technical savvy).

Both Avalon/Hyrax as outlined above, make heavy use of email to notify users of the progress of the batch ingest, if any errors have occurred and when the batch ingest has been completed.

I feel like we could do something similar. Especially on the email front. Having emailing for the batch ingest process would be an easy win. We currently have no email infrastructure in place in Jupiter as of now. Rails makes this easy on the development side, we just need to coordinate with System admins to give us access to an email server (both Avalon/Discovery send out emails, so can't see this being difficult. Looks like Discovery is talking to a Sendmail server and Avalon uses a SMTP server?)

Hyrax also shows the batch ingests progress in the web app. Which If we also initiate the batch ingests from the web interface (/admin), then we should also be displaying the status of the batch ingest here as well.

Better Error handling

There is two ways we can improve error handling with the currently process. The first is preventing user error from occurring in the initial batch ingest and the second is allowing users to correct errors that happen during batch ingest.

For preventing user error during the initial batch ingest this will mostly come from the CSV manifest and files. We should be auditing these before kicking off the work. We should have some code that check and validates the format of the CSV file, if the files exist and are accessible by Jupiter, if the checksums and files checksums match, etc. Once this is all good, then we can proceed with the work.

Allowing users to correct errors that happen during the batch ingest process is mostly around errors that happen outside of the users control. If Jupiter times out when attempting to depositing a file for example, can Jupiter or a user be able to recover from this? Can sidekiq retry the job? Can a user reupload the batch ingest and replay the batch ingest again without creating duplicate entries? Feel like Avalon has a few ideas here that we can maybe borrow. But ultimately we need a way for users to recover from problems without having to bug developers/staff or needing someone to go into Rails console in production to fix the problems.

How to deal with large files/manifests?

This is something I am not quite sure of how to accomplish. If we allow admin users to upload files from a form in Jupiter, then we will have a technical limit of how much data we can upload. Either from the file upload limit via Rails/Apache or file storage on the actual server, etc.

We probably need to find out more requirements of how often this will happen? And maybe the easy solution is we don't allow large files/manifests at all? If you want to upload a collection that is 30GB+ then you may have to break this work into small sizable chunks and upload multiple batch ingest jobs? E.g: Instead of uploading 10GB at once, maybe instead upload 5 jobs of 2 GB each, etc.

One thing that will be worth investigating that could help us here is if we can pull files from Google drive. If we can, this might solve some of these pains? As we don't need to send this data through a form submission or need to temporary store the files on the server. Sidekiq can pull down one file at a time as its working through the collection which would help with quite a few limits that we have.

@mbarnett
Copy link
Contributor

I think it would be worth doing some soft investigation of the Google Drive API – what Gems are out there, what are the limits we might run into, what's the authentication workflow look like, etc?

My concern right now with the ZIP or even Avalon approach is that with all of this SET stuff we might very rapidly end up in more of a normal cloud environment where local storage to give SFTP access to or even to unpack a ZIP in might not be feasible. If we can offload the whole temporary storage for ingest queue side of things to something external to the server it might make our lives easier?

@murny
Copy link
Contributor Author

murny commented Nov 16, 2020

Yeah makes sense. Thanks for feedback. Will spend some time looking into Google Drive API and see what's possible in that world 👍

@murny
Copy link
Contributor Author

murny commented Nov 19, 2020

So as mentioned in chat, Google Drive API seems to allow us a path to accomplish what we want to do.

Looks like we can have users login into their google drive accounts, then they can choose the files they want to upload from their drive.

Jupiter will receive this as a list of URLs with their tokens. As an example the url will look like https://www.googleapis.com/drive/v3/files/24392442942992442?alt=media and we will have other metadata available to us that we should be able to use to download these files at a later time when it suits us (from background jobs) instead of downloading everything at once. This will work similar to Samvera's Browse Everything file upload plugin.

Possible Workflow:

  1. User goes to Batch Ingests screen (/admin/batch_ingests)
  2. Clicks on create new Batch Ingest button (/admin/batch_ingests/new)
  3. Fills out form details (Title, Type of ingest: Thesis or Item, etc)
  4. User clicks on upload button (user will be required to authenticate with their google account. Then will be shown their files in their google drive account and will have the ability to choose the files they want to upload)
  5. User clicks submit button
  6. Jupiter grabs info from the form submission and do the following:
    • creates a new batch ingest object in the DB with form info
    • queues up a background job for the batch ingest object
    • redirects user back to batch ingest index with message letting them know that ingest has been accepted
  7. User will then be able to view batch ingest object and its progress on either the batch ingest index or by clicking on the batch ingest that was just created (/admin/batch_ingests/1)
  8. Background job that was enqueued for the batch ingest object will run, and will do the following:
    • it will first pull down the manifest file and validate its structure
    • for each row in the manifest (which will be an item or thesis), it will create a batch ingest item object in the database with the rows metadata and the URL reference to its file
    • it will then queue up a job for the batch ingest item.
    • We will then email the user saying batch ingest has started and also update batch ingest objects status
  9. Batch ingest item jobs will then run, for each batch ingest item job it will download the file from the URL, then create a new Thesis/Item by ingesting the metadata and file provided by the batch ingest item. Finally it will then update it's batch ingest item status and reference the new Item/Thesis.
  10. Once last job completes, batch ingest object gets its status updated and an email gets sent to user letting them know batch ingest has completed. Batch ingest show page will then show all batch ingest items ingested with a link to their URL in Jupiter.
  11. If any error happens along the way, user will be notified by email and error will be displayed on batch ingest index/show page with hopefully information of how the user can fix the error or try rerunning the batch ingest again.

Work required:

  • New screens for Batch Ingests (Index, New, Show)
  • 2 New models, one for batch ingest and one for batch ingest items (A BatchIngest has many BatchIngestItems). These models are for storing information about the batch ingest.
  • Email infrastructure needs to be setup for notifications
  • Google Integration (Google Auth and Google API client libraries need to be installed, configured and built out)
  • Check if current manifest template files need updating (These template files should be referenced or downloadable from Jupiter)
  • New background jobs for facilitating the batch ingest work.
  • Tons of testing (manual/automated)

@murny murny changed the title WIP: Batch Ingest Improvements Batch Ingest Improvements Nov 19, 2020
@mbarnett
Copy link
Contributor

Looks good in general. Let's call email & Theses out of scope for an MVP implementation, just to try to whittle this down a bit (next Thesis ingest isn't for another 6 months). We could focus on either Items or the Digitization pamphlets we're starting with (Folkfest programs) if we want an even simpler model (although the model isn't done yet, so Items may be a faster path).

I'll ask Sean to take a look at this and let us know if he sees any major issues or concerns.

@mbarnett
Copy link
Contributor

mbarnett commented Nov 25, 2020

Key point will be that whatever we do here, we'll want to make it really easy to extend to new kinds of models so that we can just use this to ingest the digitization material. Other thing that will likely be important is after-ingest reporting, especially around validation failures, that the service people can review, but again I think that's something that we should design with in mind but leave the implementation until after the MVP, as we want this to be leveragable ASAP for digitization needs.

@seanluyk
Copy link

seanluyk commented Dec 1, 2020

The proposed workflow looks very sensible to me, I think it will work well. I'll just flag some things already mentioned in emails that are worth preserving for future consideration:

  • File size: will large files be an issue through this method? I'm not sure what the limits of the Google API are, but there's already the issue with depositing files larger than 2GB (?) through the UI, so this might be another related problem to consider
    -Reporting: I don't think we need anything fancy, but it's been really helpful in Avalon to have a list of objects that failed during a batch ingest
    -Completed batches: will there be ways to prevent accidentally running a batch twice?

@mbarnett
Copy link
Contributor

mbarnett commented Dec 1, 2020

File size: I dug into this a bit last week, and we should be good? Organizationally as far as I know the University has no maximum as to how much it can stick into Google Drive in total. Individually, the limit is 750GB uploaded a day, maximum individual file size is 5TB, which is enormous. Really this should give us a nice way around any kind of connection timeout issues which right now severely limit total upload sizes.

Edit: Info on GDrive limits: https://support.google.com/a/answer/172541?hl=en

Running a batch twice should be pretty easy to avoid in the most obvious case of submitting the same exact CSV or similar (I'm thinking we could just hash the CSV or something, @murny, to easily rule out a dupe?). "Accidentally similar" batches that submit the same item again might be a harder thing to catch in general, although if we can identify common identifiers for items that would let us identify already-ingested material that would be something we could explore after the initial version?

@seanluyk
Copy link

seanluyk commented Dec 2, 2020

Re: file size, this is positive news as it sounds like it might solve some other issues!

Re: accidental batches, I envision @anayram and other members of the Metadata Team being gatekeepers of batch ingest, so those are unlikely to occur, but also good to know there may be a way to prevent identical batches from being run twice (there isn't in Avalon, ask me how I know!)

@sfbetz
Copy link

sfbetz commented Dec 3, 2020

This looks really promising for digitization ingest as well! A few other considerations for porting these concepts for the digitization workflow.

  • we deal with very large numbers of items on a regular basis, so thinking about the scale of ingest in terms of hundreds to thousands of objects is probably good. Although we can easily split things up into smaller batches if that proves to be a limitation.

  • most of our future digital items will come from our IA digitization activities, and these get uploaded straight into IA as part of their digitization workflow (so we don't receive file transfers or hard drives from IA). IA has an API with documentation here: https://archive.org/services/docs/api/ Pulling materials directly from IA and into Jupiter, would be super helpful. Alternatively we can pull materials down from IA and into a google drive.

-I think the Google drive option would work well for those items that we have had digitized elsewhere, as we usually get those delivered on a hard drive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants