Digitization Folk Fest Ingest #2011

pgwillia · 2020-12-03T22:08:58Z

Is your feature request related to a problem? Please describe.
The first set of digitization content will be the Folk Fest Programs. This set was chosen because it is small (there are 36) and well described.

~~Add has_one_attached :historical_archive to Book Model which takes the full archive for all directories relating to that object and archive it in case it's needed. Will not appear in the UI~~ Remove has_one_attached :historical_archive from Book Model and replace with metadata attribute relators:rps (http://id.loc.gov/vocabulary/relators/rps) to link to the complete AIP stored elsewhere.
fulltext will be extracted from the ALTO xml. Should be easy to access in a join table but not the main object table so that results stay efficient.
Attach one high quality PDF

Initial ingest assumes that we'll push the content (triples csv, pdf, alto) to the server for ingest from a rake script.

Follows #2010

The text was updated successfully, but these errors were encountered:

mbarnett · 2021-03-10T23:05:45Z

– There is a spreadsheet somewhere that maps the folkfest stuff to Swift IDs. I'll try to chase that down, or Sarah knows the details.

– We talked a bit about how we'll want to deal with Preservation/AIPs, and previous archival efforts, and the cheap interim solution I landed on was that we'll just take the tarred (or zipped or whatever the package is) copy of whatever we pull out of Swift, extend the book (and future) models w/ has_one :historical_archive, and attached the full thing to the model. The historical archive will never be loaded or made available for UI purposes and we'll have no reason to ever want to load it, so the attachment's only purpose is, in the future, somebody can find the old copy of the data.

– We're just looking to attach one big high quality PDF of the work for things like this and newspapers, not per-page images, and I'm unsure if that's what we'll get out of Swift or if this will require some preprocessing.

– The fulltext is in a metadata file whose name I forget right now. We'll have to find a parser for it if one exists, and figure out how to store the data – I'm thinking as an attachment if the parsing isn't incredibly complex, or in a text field on a not-normally-loaded join table if we need to heavily pre-process the data upfront and keep the indexable copy somewhere we can get at it easily? The reason not to store the full text on the table itself is that that will make reading the row back from the DB for search results and general metadata viewing way to costly – we'll only ever need the fulltext data for (re-)indexing (I guess this may argue for the pre-process into join table approach bc we'll need to reindex that data on save? Thinking ahead, we don't want to have to pull the text file from the Cloud just to re-save a file, so likely we want that in the DB somewhere, in a cheap-to-reindex form)

sarahseverson · 2021-03-10T23:44:46Z

Here is the spreadsheet with the Peel numbers and matching Swift NOIDS. I've asked Tianyu to grab the file packages so we can see what is inside when we meet next Thursday.

mbarnett · 2021-03-10T23:45:37Z

Thanks!

sarahseverson · 2021-03-16T15:36:24Z

FYI: Here is a Google Drive folder with all the Folk Fest items from SWIFT - it's big at 11 GB but we pulled everything so you could see what's there. Have fun with the files!

…fest_ingest ## Context The first set of digitization content will be the Folk Fest Programs. This set was chosen because it is small (there are 36) and well described. This adds attributes to store the artifacts that we'll need - in how we'll want to deal with Preservation/AIPs, and previous archival efforts. The historical archive will never be loaded or made available for UI purposes and we'll have no reason to ever want to load it, so the attachment's only purpose is, in the future, somebody can find the old copy of the data. - fulltext in an easy to process format. We want this in a join table so that it doesn't penalize us when we frequently request the object but close enough that we can re-index as required. Related to #2011 ## What's New - Add `has_one_attached :historical_archive` to Book Model which takes the full archive for all directories relating to that object and archive it in case it's needed. Will not appear in the UI - `fulltext` will be extracted from the ALTO xml. Should be easy to access in a join table but not the main object table so that results stay efficient. - Attach one high quality PDF

pgwillia · 2021-07-14T22:02:30Z

FYI: Here is a Google Drive folder with all the Folk Fest items from SWIFT - it's big at 11 GB but we pulled everything so you could see what's there. Have fun with the files!

@sarahseverson I made an assumption that this was exactly how the material was in Swift: In directories with the noid label. But there's some indication in Matt's note that the content might have come compressed

take the tarred (or zipped or whatever the package is) copy of whatever we pull out of Swift

What format was it in? I've assumed that we should store it as *.tar.gz, does that makes sense?

pgwillia · 2021-07-21T15:24:00Z

Peel preservation examples was prepared by @kgood to show some of the differences in structures in Swift.

pgwillia · 2021-07-29T20:53:37Z

@kgood @sarahseverson @henryzhang87 and I met today. We discussed some questions I had:

Will a spreadsheet like https://docs.google.com/spreadsheets/d/1U8GckSd-tTaGlBk-TllPl1cd80h8PSQqEdKXUuoCNlI/edit?usp=sharing be created for each batch mapping "Code" to noid?

Yes, though there might be some differences; newspapers might be the biggest exception.

Is it possible to access Swift from an application server where the ingest happens? How will we do authorization/authentication? Or who/when/how will we stage this material? For the FolkFest materials Sarah and Tianyu moved 11 GB to https://drive.google.com/drive/folders/17ckV-is6Mh_ZA_tPQIHjZPwayxz_T6BN

We came up with three ideas

Using geoffry as a staging environment
pmpy
Open Stack Swift read only access from the application server

ACTION: Henry will set Tricia up on geoffy with access to the OpenStack Swift API

My hope is that we will be able to do # 3 when this gets rolling

What format does a download from swift come in? I'm assuming that it's compressed?

Swift downloads directories without compression exactly as Sarah/Tianyu provided above
i.e. swift download peel --prefix=<noid prefix>

noid/type [tiff/jpeg/etc]/1.tar

Some swift directories won't have pdfs: like the Images will be tiff or jpeg

What archival format is appropriate for our attached historical archive?

ACTION: Kenton will follow up with Tricia about this.

sarahseverson · 2021-08-03T15:40:43Z

re: 1 - here is an example of a spreadsheet for a Newspaper upload to IA where we have a NOID, code (three letters) and some individual item-level metadata (year, month, day, pages)

pgwillia · 2021-08-09T15:48:52Z

We had further discussion about the historical_archive/AIP this morning.

we're worried about storage costs in the cloud
don't know if we'll be able to get to the point where we will delete the duplicate packages
looking at olark?
will look at connecting/linking both instances with metadata so that it's possible to do updates/evergreening if the content moves

Next steps:
@sfarnel will look at the appropriate metadata fields and update this issue
@pgwillia will modify the model to replace the attached historical_archive with a metadata property

sfarnel · 2021-08-09T15:50:19Z

Based on discussion with @kgood @pgwillia @sarahseverson

Recording the metadata aspect of the discussion here so that it is captured and can then be incorporated into our AIP specifications for PMPY. The new AIP generated on ingest will include in the metadata a link to the original (i.e., legacy) AIP in the form of a PREMIS relationship type property. The metadata should be: relSubType:sup [PID of legacy AIP], where the PID is likely to be UUID (as this is how we are naming AIPs) and, for reference, the full URI for the property is https://id.loc.gov/vocabulary/preservation/relationshipSubType/sup (which indicates that the new AIP supersedes the original or legacy AIP).

We also discussed the desire for recording the container in which the original or legacy AIP is located, which can also be captured through a metadata property: relators:rps [name of storage container], where the name of the storage container could be something like OpenStack/Swift, and the full URI for the property is http://id.loc.gov/vocabulary/relators/rps (which indicates that it is a repository for the object)

pgwillia mentioned this issue Jan 11, 2021

Digitization Folk Fest Programs Show Views #2080

Open

1 task

pgwillia mentioned this issue Jun 3, 2021

Add Digitization::Book ingest artifacts to model #2373

Merged

pgwillia closed this as completed Aug 26, 2021

pgwillia reopened this Aug 26, 2021

pgwillia mentioned this issue Sep 22, 2021

Review Changes with Stakeholders before 2.1 deploy #2524

Closed

23 tasks

pgwillia self-assigned this Oct 12, 2021

pgwillia mentioned this issue Nov 1, 2021

Task to run reports for digitization batch ingest #2612

Merged

pgwillia mentioned this issue Nov 25, 2021

Add task that will kick off jobs for artifacts #2638

Closed

pgwillia mentioned this issue Dec 6, 2021

Digitization ACN Ingest #2650

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Digitization Folk Fest Ingest #2011

Digitization Folk Fest Ingest #2011

pgwillia commented Dec 3, 2020 •

edited

Loading

mbarnett commented Mar 10, 2021 •

edited

Loading

sarahseverson commented Mar 10, 2021

mbarnett commented Mar 10, 2021

sarahseverson commented Mar 16, 2021

pgwillia commented Jul 14, 2021 •

edited

Loading

pgwillia commented Jul 21, 2021

pgwillia commented Jul 29, 2021 •

edited

Loading

sarahseverson commented Aug 3, 2021 •

edited

Loading

pgwillia commented Aug 9, 2021 •

edited

Loading

sfarnel commented Aug 9, 2021

Digitization Folk Fest Ingest #2011

Digitization Folk Fest Ingest #2011

Comments

pgwillia commented Dec 3, 2020 • edited Loading

mbarnett commented Mar 10, 2021 • edited Loading

sarahseverson commented Mar 10, 2021

mbarnett commented Mar 10, 2021

sarahseverson commented Mar 16, 2021

pgwillia commented Jul 14, 2021 • edited Loading

pgwillia commented Jul 21, 2021

pgwillia commented Jul 29, 2021 • edited Loading

sarahseverson commented Aug 3, 2021 • edited Loading

pgwillia commented Aug 9, 2021 • edited Loading

sfarnel commented Aug 9, 2021

pgwillia commented Dec 3, 2020 •

edited

Loading

mbarnett commented Mar 10, 2021 •

edited

Loading

pgwillia commented Jul 14, 2021 •

edited

Loading

pgwillia commented Jul 29, 2021 •

edited

Loading

sarahseverson commented Aug 3, 2021 •

edited

Loading

pgwillia commented Aug 9, 2021 •

edited

Loading