Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Descriptive spreadsheet import should not accept title1.structuredValue1.value without title*.structuredValue*.type #4279

Open
andrewjbtw opened this issue Nov 17, 2023 · 25 comments
Assignees
Labels

Comments

@andrewjbtw
Copy link

Describe the bug
Descriptive metadata spreadsheet import is allowing imports of spreadsheets where there is a value in title1.structuredValue1.value but no type in title1.structuredValue1.type. This is causing indexing and publishing failures.

Example spreadsheet to reproduce the issue on stage:
descriptive-bx947ty9936.csv
https://argo-stage.stanford.edu/view/druid:bx947ty9936

User Impact
Users should be notified of metadata validation errors at the time of import. Without that, hidden errors cause problems later in accessioning including preventing items from being made accessible on the purl.

Expected behavior
Validation should prevent invalid titles from being created.

@andrewjbtw andrewjbtw added the bug label Nov 17, 2023
@ndushay
Copy link
Contributor

ndushay commented Dec 15, 2023

this may be because validation is turned off right now.

@justinlittman
Copy link
Contributor

@andrewjbtw Would it be acceptable to just make title generation more resilient to missing types, e.g., by assuming that a missing title type is "title"?

@andrewjbtw
Copy link
Author

We turned on type validation and it's still possible to upload titles without a type. I just verified it by following the steps in the ticket.

@andrewjbtw
Copy link
Author

I'm guessing that the reason type validation doesn't address this situation is because there's no type at all rather than an invalid type. I tried an invalid type and the single-item upload was able to catch that.

re:

Would it be acceptable to just make title generation more resilient to missing types, e.g., by assuming that a missing title type is "title"?

I'm hesitant to make assumptions about the type because cases where someone is providing a title type are likely to be more complex structured titles. So it could really be "main title" or "subtitle" or another option. It also doesn't look like plain title is a valid type.

@lwrubel
Copy link
Contributor

lwrubel commented Jan 30, 2024

This is addressed with sul-dlss/cocina-models#675. I'll be bumping cocina-models and releasing, and then uploading a descriptive metadata spreadsheet to confirm validation happens as expected. I'll keep the ticket open until then.

@lwrubel
Copy link
Contributor

lwrubel commented Feb 1, 2024

I don't think the addition to cocina validation is catching this as expected in a spreadsheet upload. If I upload a spreadsheet with an empty title1.structuredValue1.type, the row seems to be making it past the CocinaValidator in

.bind { |description| CocinaValidator.validate_and_save(@cocina, description:) }

It appears to moves on to display_success and then errors in immediate reindexing: https://app.honeybadger.io/projects/49898/faults/103891642#notice-context

@lwrubel
Copy link
Contributor

lwrubel commented Feb 5, 2024

There's some background info in this Slack thread about the DescriptionValuesValidator not validating because of the hash keys type. When updating the validator and running against prod dor-services-app, there were 244 druids that were invalid:
validate-cocina.csv

@arcadiafalcone, would you take a look at these druids?

@arcadiafalcone
Copy link
Collaborator

arcadiafalcone commented Feb 7, 2024

@lwrubel Could you rerun this validation report and include the FOLIO HRIDs? The first few I looked at were converted from MARC, which means this is at least partly a MARC/XSLT problem. Including the collection druid would also be helpful in grouping similar items together for remediation. Thanks!

@lwrubel
Copy link
Contributor

lwrubel commented Feb 7, 2024

@arcadiafalcone here's an updated CSV with the FOLIO HRIDs. Hope this helps!
validate-cocina-with-hrids.csv

@lwrubel
Copy link
Contributor

lwrubel commented Feb 7, 2024

I see you also requested the collection druid, and I hadn't grabbed that. I'll still do that and send an update.

@lwrubel
Copy link
Contributor

lwrubel commented Feb 7, 2024

Thanks for your patience. Here's the report, with collection druids as well:

validate-cocina-hrids-colls-argo-4279.csv

@arcadiafalcone
Copy link
Collaborator

Thanks @lwrubel. Initial analysis is that 20 of these are Cocina-native, the rest were converted from MARC.

Six of the SDR ones are from H2 and have the same issue because the app is generating the same date range twice - once as a structuredValue and once as a single value using EDTF date range notation. We should have either one or the other of these. (cc @amyehodge)
wr164nz3618
gx970nw8882
hj910dt9988
vs407gj3803
yn038px7249
yy114sp3047

I'll work on fixing the remaining 14 SDR items, and then go spelunking in MARC.

@arcadiafalcone
Copy link
Collaborator

For the SDR items, I was able to fix 7, and am contacting the content owners about fixing the other 7.

@amyehodge
Copy link

Thanks for the heads up @arcadiafalcone. Let me know if you need anything specific from me. Do we need to file a ticket to make any changes in H2?

@lwrubel
Copy link
Contributor

lwrubel commented Feb 8, 2024

@amyehodge Might be worth creating a ticket to make 100% sure that we are not still generating invalid dates like this. FWIW, I'm not sure that we are still creating values like these invalid ones when generating the cocina descriptive in H2. I couldn't find that in the DateGenerator code and couldn't generate a similar date value in QA: https://argo-qa.stanford.edu/view/yz466fs4308. But it's possible I missed something.

@amyehodge
Copy link

@arcadiafalcone All of these are items that were migrated over from hydrus. One of them is in the Hopkins Marine Station collection and is “owned” by Don Kohrs, who works in the library — this one I last modified to remediate the date. The other 5 are in the Mapping the Republic of Letters collection that is managed by Nicole Coleman. All 5 of these were also migrated from hydrus and also all have the last update in H2 that includes a change to the dates.

Ideally, we'd want the changes to happen in H2 and not just via Argo, so that H2 has the correct metadata.

@arcadiafalcone
Copy link
Collaborator

@lwrubel @amyehodge From my memory of the process it makes sense that this would be an artifact of migration and not an issue for new records. Amy, is there a way to redeposit these items from H2 to SDR? I suspect the issue is in generating the Cocina rather than in what is stored internally in H2, so that may be enough to fix the items.

@amyehodge
Copy link

@arcadiafalcone All of these can be redeposited in H2, but they all were redeposited with changes to their dates via H2 between Sep 15 and Sep 28, 2022, so I'm not clear that it will help? But I could try with one of them to see perhaps.

@amyehodge
Copy link

@arcadiafalcone I've redeposited wr164nz3618 without making any changes. Please check to see if that fixed it. Thanks.

@arcadiafalcone
Copy link
Collaborator

@amyehodge It looks like that worked!

@amyehodge
Copy link

Oh, sweet. Okay. I'll do the others as well.

@amyehodge
Copy link

@arcadiafalcone These have all been redeposited. Please let me know if these all went through okay. When I'm sure we're done messing with them, I'll email the content owners to let them know.

@arcadiafalcone
Copy link
Collaborator

@amyehodge Looks like they all updated successfully!

@amyehodge
Copy link

Users have been notified of the changes for these 6 items.

@arcadiafalcone
Copy link
Collaborator

@lwrubel @andrewjbtw 138 of these items are Japanese military maps with records coming from MARC. I registered one object in stage with the HRID and the cocina looks OK for that item. What I suspect is that there was a bug in the MODS-cocina transformation at the time these records were created that has since been fixed.

Unfortunately these are digital serials records so I can't just refresh them all from source without redoing all the digital serials labeling. Is there a way to take the existing MODS (which is correct and has the labels) and use it to regenerate the cocina? It would also be helpful to convert the existing MODS created with an old XSLT to cocina for a sample record? This would allow me to confirm that it's not the old MARC2MODS XSLT that is the issue (which I highly doubt because this is error is not possible to do in MODS, but any weirdness is possible with XSLT).

Sample item: wr904cm7589 (and hj735vc4271 on argo-test)

Note to myself because this is complicated:
Production: MARC converted to MODS 3.4 with old XSLT, labels added, MODS converted to cocina probably at migration and not updated since
Test: MARC converted to MODS 3.7 with current XSLT, MODS converted to cocina with current code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Under Review (Unordered)
Development

No branches or pull requests

6 participants