Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Field method refactor for Ead transform #192

Merged
merged 1 commit into from
Jun 10, 2024

Conversation

jonavellecuerdo
Copy link
Contributor

Purpose and background context

Field method refactor for transform class Ead (Part 3).

  • Add field methods and corresponding unit tests for: rights, related_items, and subjects

How can a reviewer manually see the effects of these changes?

  1. Run make test and verify all unit tests are passing.

  2. Run CLI command

    pipenv run transform -i tests/fixtures/ead/ead_record_all_fields.xml -o output/ead-transformed-records.json -s aspace
    

    Output:

    2024-06-06 15:43:31,737 INFO transmogrifier.cli.main(): Logger 'root' configured with level=INFO
    2024-06-06 15:43:31,737 INFO transmogrifier.cli.main(): No Sentry DSN found, exceptions will not be sent to Sentry
    2024-06-06 15:43:31,737 INFO transmogrifier.cli.main(): Running transform for source aspace
    2024-06-06 15:43:31,756 WARNING transmogrifier.sources.transformer.get_valid_title(): Record repositories/2/resources/1 has multiple titles. Using the first title from the following titles found: ['Charles J. Connick Stained Glass Foundation Collection', 'Title 2', 'Title 3']
    2024-06-06 15:43:31,779 INFO transmogrifier.cli.main(): Completed transform, total records processed: 1, transformed records: 1, skipped records: 0, deleted records: 0
    2024-06-06 15:43:31,779 INFO transmogrifier.cli.main(): Total time to complete transform: 0:00:00.042270
    

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed and verified
  • New dependencies are appropriate or there were no changes

@jonavellecuerdo jonavellecuerdo self-assigned this Jun 6, 2024
@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review June 6, 2024 19:44
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing blocking from me, so setting as approved.

Did leave a couple comments around "related materials", but mostly just observations and discussion.

Comment on lines +499 to +501
collection_description = cls._get_collection_description(source_record)

for related_item_element in collection_description.find_all(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't recall what the convention was in this file, but could optionally just iterate over the method call directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I've generally been iterating over a variable when a .find_all(...) or other call is needed for iteration , for related_item_element in collection_description.find_all() rather than for related_item_element in cls._get_collection_description(source_record).find_all()

Copy link
Contributor Author

@jonavellecuerdo jonavellecuerdo Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes! I've been doing the same as @ehanson8. If it is followed by a .find/.findall, assign to a variable! A bit easier to read for me, personally!

transmogrifier/sources/xml/ead.py Outdated Show resolved Hide resolved
@@ -557,6 +493,96 @@ def get_publishers(cls, source_record: Tag) -> list[timdex.Publisher] | None:
return [timdex.Publisher(name=publication_value)]
return None

@classmethod
def get_related_items(cls, source_record: Tag) -> list[timdex.RelatedItem] | None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading top down, you'll likely encounter this comment before the following which came first. But I must admit, this method broke my brain.

I had thought perhaps some of the if/else logic could be removed by targeting <relatedmaterial> in a list comprehension, and then the others the same way, but looking at the XML, I see now that all values in the test test_get_related_items_transforms_correctly_with_related_materials are coming from two <relatedmaterial> elements.

I have no doubt it's a faithful copying of pre-existing logic, and probably not worth a rework here, but this was hard to parse for me. That said, maybe that's just a "me" problem, and kudos to the test for thwarting what I thought were some easy tweaks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghukill @ehanson

Figured I’d respond in more detail in this comment!

It also took me a second to wrap my head around the derivation for timdex.RelatedItems, but here is my understanding of the logic:

  • timdex.RelatedItem can be derived from of three (3), optional EAD elements:
    • altformavail
    • separatedmaterial
    • relatedmaterial
  • From the logic, it appears that:
    • There can be multiple relatedmaterial elements in a single EAD-formatted XML record.
    • The relatedmaterial element can comprise of a list of defined items (i.e., <list><defitem><label><item>) OR paragraphs (i.e., <p>).
  • Unclear if <altformavail> and <separatedmaterial> can appear multiple times in a single EAD-formatted XML record, but these elements seem to comprise of paragraphs (<p>)
  • The values you’re retrieving from these elements are:
    • altformavail and separatedmaterial = a single timdex.RelatedItem with the joined string of all the text from all the subelements that are not enclosed within a <head>
    • relatedmaterial =
      • If list of defined items: a timdex.RelatedItem with the joined string of all text enclosed within <defitem> descendants for every <defitem>
      • If paragraph(s): a timdex.RelatedItem with the joined string of all text enclosed within

        subelements for every <p>

Thoughts behind the updates introduced:

  • I did retain most of the logic. I struggled to think of alternatives to some of the cleverly written functions in this transform (e.g., create_string_from_mixed_value).
  • The main change was rather than create 2 for-loops, I included “related material” to the list of names passed to the collection.description.find_all() statement in a single for-loop.
  • I was considering creating a second private method for retrieving the values from the altformavail and separatedmaterial elements, but I didn’t because:
    • I couldn’t think of a good name to group these two elements (i.e., _get_related_items_from_altformavail_and_separatedmaterial)
    • The logic for these two elements is similar to that in non-private field methods:
    • I thought it would be confusing to create a private method for this condition (but to not store the logic in a private method for the other fields —content_type, contributor, languages, etc.) But maybe that’s just me.
  • I do think creating a separate test (with_related_items) might’ve contributed to the confusion, so I am thinking of making the following updates to test.
    • Include the relatedmaterial element derivations as part of the get_related_items test.
    • Create separate test for relatedmaterial elements with multiple <defitem>
    • Create separate test for related material elements with multiple <p>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the detailed breakdown! All tracks and makes sense.

At the end of the day, this block has appeared to do what it needs to do, so I'm definitely good with it as-is even. As dicussed here, and elsewhere, it's tricky given the variable nature of EAD.

If you did want to break out into more private, sub-methods, I might propose going pretty hard in that direction, like:

  • _get_related_from_relatedmaterial()
  • _get_related_from_altformavail()
  • _get_related_from_separatedmaterial()

Perhaps this is a throwback to a comment I made on one of @ehanson8's PRs about a variable name that smushed up two words. I'm appreciating now that it was probably because the element was named as such, so it feels accurate; and I agree. Apologies @ehanson8 if I misssed that nuance at the time! I feel as though, now, I would read these methods -- particularly when they were right next to one another when being called -- and understand their naming is reflective of the XML elements, e.g.

related_items = []
related_items.extend(_get_related_from_relatedmaterial())
related_items.extend(_get_related_from_altformavail())
related_items.extend(_get_related_from_separatedmaterial())

But... to your points @jonavellecuerdo about other similar methods not breaking it out into sub-methods like this, I think that's valid.

I've been appreciating during these PR reviews of the thoughtful consideration to the goal of the refactoring to fields methods vs general refactoring of logic that has not been revisited in awhile. I would vote for continuing to err on the side of not making too many sweeping changes during this first pass. Which is not to say changes that have been made were a bad idea, quite the contrary they all seemed good, just that if it gets thorny without a clear, swift improvement, maybe better to just leave for now. Just my two cents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see updated tests! Also decided to rename private method to _get_related_from_relatedmaterial. Although long, I concur that it is a great naming convention that clearly describes the XML element source!

return related_items or None

@classmethod
def _get_related_material(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last comment on these related methods: I still don't know how, or if, there is a more straight-forward way to parse these from those wild semantic + display XML elements, but noting these feel like areas that could be cleaned up a bit.

I think, for me, this refactor work is nice because it exposes some tricky corners like this, but are likely tricky for good reason because the logic is intricate. I don't see any need to address now, and maybe not ever, but I'm appreciating having some insight into corners of these transformations I was not familiar with previously.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this should have an issue for later review, it's rough (coming from the likely author 🙃 who knows better now)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking another look though, could it skip the related_materials = [] and just return:

return [
                timdex.RelatedItem(description=related_item)
                for subelement in subelements
                if (
                    related_item := cls.create_string_from_mixed_value(
                        subelement, separator=" ", skipped_elements=["head"]
                    )
                )
            ]

Apologies if I'm going down the rabbit hole you already went down @ghukill !

Copy link
Contributor

@ghukill ghukill Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tis a rabbit hole indeed! When I saw the XML getting parsed, I immediately appreciated some of the logic here. It's not easy to extract.

Your proposal above makes sense to me! Does this mean that elements ["altformavail", "separatedmaterial"] aren't needed? and if so, is this true across all EAD records?

As an aside, this is precisely where I think many-record analysis could be extremely helpful. If we could:

  • run against 100 EADs, save output
  • make change, run against same EADs, save output
  • have a convenient way to see the difference for this particular field

Lots of approaches come to mind, but the many-record comparison pattern itself is what I'm interested in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on the utility of record analysis in a situation like this! I was just referring to updating the _get_related_material method, I believe "altformavail and separatedmaterial would still be caught in the updated elif block you recommended for get_related_items. But this is also a sign that this should be flagged in an issue for later consideration since it's complicated!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ehanson8 Made the change to remove related_materials = [] from _get_related_material() (and directly return the result from list comprehension instead.

@ghukill Many-record comparison sounds ideal! I think the attributes_and_subfield_variations fixture tries to address this, but it takes looking at the XML file to see the different conditions it tries to trigger. I definitely think a convenient way to see the outputs across many records and comparing different runs would be helpful for complex transforms!

Also, I tried to provide more context regarding the updates to get_related_material here #192 (comment)! A long read but hoped to provide more clarification!

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments!

transmogrifier/sources/xml/ead.py Show resolved Hide resolved
Comment on lines +499 to +501
collection_description = cls._get_collection_description(source_record)

for related_item_element in collection_description.find_all(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I've generally been iterating over a variable when a .find_all(...) or other call is needed for iteration , for related_item_element in collection_description.find_all() rather than for related_item_element in cls._get_collection_description(source_record).find_all()

transmogrifier/sources/xml/ead.py Outdated Show resolved Hide resolved
return related_items or None

@classmethod
def _get_related_material(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this should have an issue for later review, it's rough (coming from the likely author 🙃 who knows better now)

return related_items or None

@classmethod
def _get_related_material(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking another look though, could it skip the related_materials = [] and just return:

return [
                timdex.RelatedItem(description=related_item)
                for subelement in subelements
                if (
                    related_item := cls.create_string_from_mixed_value(
                        subelement, separator=" ", skipped_elements=["head"]
                    )
                )
            ]

Apologies if I'm going down the rabbit hole you already went down @ghukill !

@jonavellecuerdo jonavellecuerdo force-pushed the TIMX-287-ead-field-method-refactor-4 branch from 556e5eb to 8e5e434 Compare June 7, 2024 18:35
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved! Thanks for the flexibility and revisions.

"<emph>stained</emph> glass artist whose work may be found "
"in cities all across the United States."
"""
Charles J. Connick was an American <emph>stained</emph> glass artist.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call on the <emph>! As we chatted about in our call, no real hard and fast rules about what to include, but the more we mimick stuff we might see, the better. Dig it.

</altformavail>
<separatedmaterial>
<head>Separated Materials</head>
<p>Issues of Twilight Zine were separated for library cataloging.</p>
</separatedmaterial>
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, ➕ to dropping the triple quotes, and sorry for missing on prior reviews. Glad we're establishing some patterns now as we continue into others (though as @ehanson8 pointed out, EAD is probably the most garish XML wise).

Why these changes are being introduced:
* These updates are required to implement the architecture described
in the following ADR: https://github.com/MITLibraries/transmogrifier/blob/main/docs/adrs/0005-field-methods.md

How this addresses that need:
* Add field methods and corresponding unit tests:
  related_items, rights, subjects

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-287
@jonavellecuerdo jonavellecuerdo force-pushed the TIMX-287-ead-field-method-refactor-4 branch from 8e5e434 to 254036f Compare June 10, 2024 12:35
@jonavellecuerdo jonavellecuerdo merged commit 0372f2e into main Jun 10, 2024
5 checks passed
@jonavellecuerdo jonavellecuerdo deleted the TIMX-287-ead-field-method-refactor-4 branch June 10, 2024 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants