-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Field method refactor for Ead transform #192
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing blocking from me, so setting as approved.
Did leave a couple comments around "related materials", but mostly just observations and discussion.
collection_description = cls._get_collection_description(source_record) | ||
|
||
for related_item_element in collection_description.find_all( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't recall what the convention was in this file, but could optionally just iterate over the method call directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I've generally been iterating over a variable when a .find_all(...)
or other call is needed for iteration , for related_item_element in collection_description.find_all()
rather than for related_item_element in cls._get_collection_description(source_record).find_all()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes! I've been doing the same as @ehanson8. If it is followed by a .find
/.findall
, assign to a variable! A bit easier to read for me, personally!
@@ -557,6 +493,96 @@ def get_publishers(cls, source_record: Tag) -> list[timdex.Publisher] | None: | |||
return [timdex.Publisher(name=publication_value)] | |||
return None | |||
|
|||
@classmethod | |||
def get_related_items(cls, source_record: Tag) -> list[timdex.RelatedItem] | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading top down, you'll likely encounter this comment before the following which came first. But I must admit, this method broke my brain.
I had thought perhaps some of the if/else logic could be removed by targeting <relatedmaterial>
in a list comprehension, and then the others the same way, but looking at the XML, I see now that all values in the test test_get_related_items_transforms_correctly_with_related_materials
are coming from two <relatedmaterial>
elements.
I have no doubt it's a faithful copying of pre-existing logic, and probably not worth a rework here, but this was hard to parse for me. That said, maybe that's just a "me" problem, and kudos to the test for thwarting what I thought were some easy tweaks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Figured I’d respond in more detail in this comment!
It also took me a second to wrap my head around the derivation for timdex.RelatedItems
, but here is my understanding of the logic:
timdex.RelatedItem
can be derived from of three (3), optional EAD elements:altformavail
separatedmaterial
relatedmaterial
- From the logic, it appears that:
- There can be multiple
relatedmaterial
elements in a single EAD-formatted XML record. - The
relatedmaterial
element can comprise of a list of defined items (i.e.,<list><defitem><label><item>
) OR paragraphs (i.e.,<p>
).
- There can be multiple
- Unclear if
<altformavail>
and<separatedmaterial>
can appear multiple times in a single EAD-formatted XML record, but these elements seem to comprise of paragraphs (<p>
) - The values you’re retrieving from these elements are:
altformavail
andseparatedmaterial
= a singletimdex.RelatedItem
with the joined string of all the text from all the subelements that are not enclosed within a<head>
relatedmaterial
=- If list of defined items: a
timdex.RelatedItem
with the joined string of all text enclosed within<defitem>
descendants for every<defitem>
- If paragraph(s): a
timdex.RelatedItem
with the joined string of all text enclosed withinsubelements for every
<p>
- If list of defined items: a
Thoughts behind the updates introduced:
- I did retain most of the logic. I struggled to think of alternatives to some of the cleverly written functions in this transform (e.g.,
create_string_from_mixed_value
). - The main change was rather than create 2 for-loops, I included
“related material”
to the list of names passed to thecollection.description.find_all()
statement in a single for-loop. - I was considering creating a second private method for retrieving the values from the
altformavail
andseparatedmaterial
elements, but I didn’t because:- I couldn’t think of a good name to group these two elements (i.e.,
_get_related_items_from_altformavail_and_separatedmaterial
) - The logic for these two elements is similar to that in non-private field methods:
- I thought it would be confusing to create a private method for this condition (but to not store the logic in a private method for the other fields —content_type, contributor, languages, etc.) But maybe that’s just me.
- I couldn’t think of a good name to group these two elements (i.e.,
- I do think creating a separate test (
with_related_items
) might’ve contributed to the confusion, so I am thinking of making the following updates to test.- Include the
relatedmaterial
element derivations as part of theget_related_items
test. - Create separate test for
relatedmaterial
elements with multiple<defitem>
- Create separate test for
related material
elements with multiple<p>
- Include the
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appreciate the detailed breakdown! All tracks and makes sense.
At the end of the day, this block has appeared to do what it needs to do, so I'm definitely good with it as-is even. As dicussed here, and elsewhere, it's tricky given the variable nature of EAD.
If you did want to break out into more private, sub-methods, I might propose going pretty hard in that direction, like:
_get_related_from_relatedmaterial()
_get_related_from_altformavail()
_get_related_from_separatedmaterial()
Perhaps this is a throwback to a comment I made on one of @ehanson8's PRs about a variable name that smushed up two words. I'm appreciating now that it was probably because the element was named as such, so it feels accurate; and I agree. Apologies @ehanson8 if I misssed that nuance at the time! I feel as though, now, I would read these methods -- particularly when they were right next to one another when being called -- and understand their naming is reflective of the XML elements, e.g.
related_items = []
related_items.extend(_get_related_from_relatedmaterial())
related_items.extend(_get_related_from_altformavail())
related_items.extend(_get_related_from_separatedmaterial())
But... to your points @jonavellecuerdo about other similar methods not breaking it out into sub-methods like this, I think that's valid.
I've been appreciating during these PR reviews of the thoughtful consideration to the goal of the refactoring to fields methods vs general refactoring of logic that has not been revisited in awhile. I would vote for continuing to err on the side of not making too many sweeping changes during this first pass. Which is not to say changes that have been made were a bad idea, quite the contrary they all seemed good, just that if it gets thorny without a clear, swift improvement, maybe better to just leave for now. Just my two cents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see updated tests! Also decided to rename private method to _get_related_from_relatedmaterial
. Although long, I concur that it is a great naming convention that clearly describes the XML element source!
transmogrifier/sources/xml/ead.py
Outdated
return related_items or None | ||
|
||
@classmethod | ||
def _get_related_material( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last comment on these related methods: I still don't know how, or if, there is a more straight-forward way to parse these from those wild semantic + display XML elements, but noting these feel like areas that could be cleaned up a bit.
I think, for me, this refactor work is nice because it exposes some tricky corners like this, but are likely tricky for good reason because the logic is intricate. I don't see any need to address now, and maybe not ever, but I'm appreciating having some insight into corners of these transformations I was not familiar with previously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed this should have an issue for later review, it's rough (coming from the likely author 🙃 who knows better now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking another look though, could it skip the related_materials = []
and just return:
return [
timdex.RelatedItem(description=related_item)
for subelement in subelements
if (
related_item := cls.create_string_from_mixed_value(
subelement, separator=" ", skipped_elements=["head"]
)
)
]
Apologies if I'm going down the rabbit hole you already went down @ghukill !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tis a rabbit hole indeed! When I saw the XML getting parsed, I immediately appreciated some of the logic here. It's not easy to extract.
Your proposal above makes sense to me! Does this mean that elements ["altformavail", "separatedmaterial"]
aren't needed? and if so, is this true across all EAD records?
As an aside, this is precisely where I think many-record analysis could be extremely helpful. If we could:
- run against 100 EADs, save output
- make change, run against same EADs, save output
- have a convenient way to see the difference for this particular field
Lots of approaches come to mind, but the many-record comparison pattern itself is what I'm interested in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree on the utility of record analysis in a situation like this! I was just referring to updating the _get_related_material
method, I believe "altformavail
and separatedmaterial
would still be caught in the updated elif
block you recommended for get_related_items
. But this is also a sign that this should be flagged in an issue for later consideration since it's complicated!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ehanson8 Made the change to remove related_materials = []
from _get_related_material()
(and directly return the result from list comprehension instead.
@ghukill Many-record comparison sounds ideal! I think the attributes_and_subfield_variations
fixture tries to address this, but it takes looking at the XML file to see the different conditions it tries to trigger. I definitely think a convenient way to see the outputs across many records and comparing different runs would be helpful for complex transforms!
Also, I tried to provide more context regarding the updates to get_related_material
here #192 (comment)! A long read but hoped to provide more clarification!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments!
collection_description = cls._get_collection_description(source_record) | ||
|
||
for related_item_element in collection_description.find_all( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I've generally been iterating over a variable when a .find_all(...)
or other call is needed for iteration , for related_item_element in collection_description.find_all()
rather than for related_item_element in cls._get_collection_description(source_record).find_all()
transmogrifier/sources/xml/ead.py
Outdated
return related_items or None | ||
|
||
@classmethod | ||
def _get_related_material( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed this should have an issue for later review, it's rough (coming from the likely author 🙃 who knows better now)
transmogrifier/sources/xml/ead.py
Outdated
return related_items or None | ||
|
||
@classmethod | ||
def _get_related_material( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking another look though, could it skip the related_materials = []
and just return:
return [
timdex.RelatedItem(description=related_item)
for subelement in subelements
if (
related_item := cls.create_string_from_mixed_value(
subelement, separator=" ", skipped_elements=["head"]
)
)
]
Apologies if I'm going down the rabbit hole you already went down @ghukill !
556e5eb
to
8e5e434
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved! Thanks for the flexibility and revisions.
"<emph>stained</emph> glass artist whose work may be found " | ||
"in cities all across the United States." | ||
""" | ||
Charles J. Connick was an American <emph>stained</emph> glass artist. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call on the <emph>
! As we chatted about in our call, no real hard and fast rules about what to include, but the more we mimick stuff we might see, the better. Dig it.
tests/sources/xml/test_ead.py
Outdated
</altformavail> | ||
<separatedmaterial> | ||
<head>Separated Materials</head> | ||
<p>Issues of Twilight Zine were separated for library cataloging.</p> | ||
</separatedmaterial> | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, ➕ to dropping the triple quotes, and sorry for missing on prior reviews. Glad we're establishing some patterns now as we continue into others (though as @ehanson8 pointed out, EAD is probably the most garish XML wise).
Why these changes are being introduced: * These updates are required to implement the architecture described in the following ADR: https://github.com/MITLibraries/transmogrifier/blob/main/docs/adrs/0005-field-methods.md How this addresses that need: * Add field methods and corresponding unit tests: related_items, rights, subjects Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-287
8e5e434
to
254036f
Compare
Purpose and background context
Field method refactor for transform class Ead (Part 3).
rights
,related_items
, andsubjects
How can a reviewer manually see the effects of these changes?
Run
make test
and verify all unit tests are passing.Run CLI command
Output:
Includes new or updated dependencies?
NO
Changes expectations for external applications?
NO
What are the relevant tickets?
Developer
Code Reviewer(s)