Subject strings concatenated #1449

sfarnel · 2020-01-31T19:25:00Z

Describe the bug
Some items in ERA have all of the subject terms concatenated into a single string.

To Reproduce
For example, https://era.library.ualberta.ca/items/0d9000f4-3c7e-4221-9f26-f5faf0a1d1bb

Expected behavior
Each subject word or phrase should be a single item in the list

Additional context
A list of impacted items (based on triplestore query) is attached. @weiweishi This had happened for another collection previously and was cleaned up via script so hoping the same can be done again.
ERA_subject_issue.txt
Thanks!

pgwillia · 2020-04-24T23:03:37Z

Something like

namespace :jupiter do
  desc 'Find Subjects that have been concatenated together and split on comma seperator'
  task unconcatenate_subjects: :environment do
    puts 'Unconcatenating subjects...'
    (Item.all + Thesis.all).each do |item|
        print '.' 
        next unless item.subject.any? {|subject| subject.include? ','}
        
        print '-'
        item.subject = item.subject.first.split(%r{,\s*}) 
        item.save!
    end
    puts 'Subjects unconcatenated!'
  end
end

will look at all the Items and Theses and un-concatenate Subjects. This assumes that there aren't any , purposefully embedded in subjects. @sfarnel is there a way to know if this assumption is correct. Alternately I could just use the provided ERA_subject_issue.txt to get the uuid's that you've mentioned.

I could prepare this as a one of rake task (in master or integration_postmigration) or run it in the rails console. Not certain what we've done in the past, or want to do in the future. Would appreciate some guidance or history from @mbarnett @weiweishi

sfarnel · 2020-04-27T14:25:48Z

Thanks @pgwillia!
Many items and theses in ERA have Library of Congress Subject Headings which do make legitimate use of commas in many instances, so I think safer to use the list provided to do a targeted fix. We can then run another query and find any additional that may have sneaked in.

pgwillia · 2020-04-28T20:41:44Z

@weiweishi said

Sorry I didn't have any formalized script I could share. It has always been a direct Rails Console fix in the past
Most of the batches we need to work on are smaller than 1,000 items and I've never run into any issues.
I have always output the IDs, title, and a link and some basic information as a base report for metadata team to review.

I gave @henryzhang87 the heads up that I was working on this I'll continue with Weiwei's report in mind.

My plan is to practice, practice and then execute.

pgwillia · 2020-04-28T22:59:01Z

After chatting with @mbarnett the solution I'm going to pursue is to create a migration that will happen after the Fedora migration. Basically the step's I'll follow are:

add new column, for each item we copy old subject to new column unless it’s one of 626 items, in which case generate new subjects based on original subject column & store new subjects in new column, generate a text file/log on each of the 626 item with id, old value, new values (+ anything else they need), we pull the log file off the server after deploy, send it to metadata to get them to verify everything looks good, next deploy we drop the old column and rename the new one to the old name

mbarnett · 2020-12-17T22:48:23Z

Just sent you the log of the items for review, Tricia

## Context In the first part of this data clean up we created a new column `deconcatenated_subject` and copied all subjects here, cleaning up the ones we've identified. Now we want to replace the subjects column with these values. #1449 #1627 ## What's New We remove the subject column and rename `deconcatenated_subject` to `subject`. Also clean up the ERA_subjects_issue.txt file and reference in the previous migration.

sfarnel added bug data labels Jan 31, 2020

mbarnett added needs-estimate 5 and removed needs-estimate labels Mar 12, 2020

pgwillia self-assigned this Apr 23, 2020

This was referenced May 1, 2020

add migration to fix concatenated subjects (part 1) #1627

Merged

add migration to fix concatenated subjects (part 2) DONT MERGE YET #1630

Closed

pgwillia mentioned this issue Oct 8, 2020

add migration to fix concatenated subjects (part 2) #1931

Merged

pgwillia closed this as completed Jun 29, 2021

pgwillia mentioned this issue Sep 22, 2021

Review Changes with Stakeholders before 2.1 deploy #2524

Closed

23 tasks

pgwillia mentioned this issue May 18, 2023

ERA Batch Edit: BoardEx licensing terms #3136

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subject strings concatenated #1449

Subject strings concatenated #1449

sfarnel commented Jan 31, 2020

pgwillia commented Apr 24, 2020

sfarnel commented Apr 27, 2020

pgwillia commented Apr 28, 2020 •

edited

Loading

pgwillia commented Apr 28, 2020 •

edited

Loading

mbarnett commented Dec 17, 2020

Subject strings concatenated #1449

Subject strings concatenated #1449

Comments

sfarnel commented Jan 31, 2020

pgwillia commented Apr 24, 2020

sfarnel commented Apr 27, 2020

pgwillia commented Apr 28, 2020 • edited Loading

pgwillia commented Apr 28, 2020 • edited Loading

mbarnett commented Dec 17, 2020

pgwillia commented Apr 28, 2020 •

edited

Loading

pgwillia commented Apr 28, 2020 •

edited

Loading