Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subject strings concatenated #1449

Closed
sfarnel opened this issue Jan 31, 2020 · 5 comments
Closed

Subject strings concatenated #1449

sfarnel opened this issue Jan 31, 2020 · 5 comments
Assignees

Comments

@sfarnel
Copy link
Member

sfarnel commented Jan 31, 2020

Describe the bug
Some items in ERA have all of the subject terms concatenated into a single string.

To Reproduce
For example, https://era.library.ualberta.ca/items/0d9000f4-3c7e-4221-9f26-f5faf0a1d1bb

Expected behavior
Each subject word or phrase should be a single item in the list

Additional context
A list of impacted items (based on triplestore query) is attached. @weiweishi This had happened for another collection previously and was cleaned up via script so hoping the same can be done again.
ERA_subject_issue.txt
Thanks!

@pgwillia
Copy link
Member

Something like

namespace :jupiter do
  desc 'Find Subjects that have been concatenated together and split on comma seperator'
  task unconcatenate_subjects: :environment do
    puts 'Unconcatenating subjects...'
    (Item.all + Thesis.all).each do |item|
        print '.' 
        next unless item.subject.any? {|subject| subject.include? ','}
        
        print '-'
        item.subject = item.subject.first.split(%r{,\s*}) 
        item.save!
    end
    puts 'Subjects unconcatenated!'
  end
end

will look at all the Items and Theses and un-concatenate Subjects. This assumes that there aren't any , purposefully embedded in subjects. @sfarnel is there a way to know if this assumption is correct. Alternately I could just use the provided ERA_subject_issue.txt to get the uuid's that you've mentioned.

I could prepare this as a one of rake task (in master or integration_postmigration) or run it in the rails console. Not certain what we've done in the past, or want to do in the future. Would appreciate some guidance or history from @mbarnett @weiweishi

@sfarnel
Copy link
Member Author

sfarnel commented Apr 27, 2020

Thanks @pgwillia!
Many items and theses in ERA have Library of Congress Subject Headings which do make legitimate use of commas in many instances, so I think safer to use the list provided to do a targeted fix. We can then run another query and find any additional that may have sneaked in.

@pgwillia
Copy link
Member

pgwillia commented Apr 28, 2020

@weiweishi said

Sorry I didn't have any formalized script I could share. It has always been a direct Rails Console fix in the past
Most of the batches we need to work on are smaller than 1,000 items and I've never run into any issues.
I have always output the IDs, title, and a link and some basic information as a base report for metadata team to review.

I gave @henryzhang87 the heads up that I was working on this I'll continue with Weiwei's report in mind.

My plan is to practice, practice and then execute.

@pgwillia
Copy link
Member

pgwillia commented Apr 28, 2020

After chatting with @mbarnett the solution I'm going to pursue is to create a migration that will happen after the Fedora migration. Basically the step's I'll follow are:

add new column, for each item we copy old subject to new column unless it’s one of 626 items, in which case generate new subjects based on original subject column & store new subjects in new column, generate a text file/log on each of the 626 item with id, old value, new values (+ anything else they need), we pull the log file off the server after deploy, send it to metadata to get them to verify everything looks good, next deploy we drop the old column and rename the new one to the old name

@mbarnett
Copy link
Contributor

Just sent you the log of the items for review, Tricia

pgwillia added a commit that referenced this issue Jan 19, 2021
## Context
In the first part of this data clean up we created a new column `deconcatenated_subject` and copied all subjects here, cleaning up the ones we've identified.

Now we want to replace the subjects column with these values.

#1449 #1627 

## What's New

We remove the subject column and rename `deconcatenated_subject` to `subject`.  Also clean up the ERA_subjects_issue.txt file and reference in the previous migration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants