Timx 287 ead field method refactor 1 #185

jonavellecuerdo · 2024-05-30T16:37:29Z

Purpose and background context

Field method refactor for transform class Ead (Part 1).

Added private"methods for key elements in EAD-formatted XML records (<archdesc>, <archdesc><did>, <archdesc><controlaccess>
Added field methods and corresponding unit tests for alternate_titles, citation, content_type

Note: For the time-being, I've added calls to the private methods at the beginning of get_optional_fields(), which is required by field derivations that will eventually be moved into their own field methods.

How can a reviewer manually see the effects of these changes?

Run make test and verify all unit tests are passing.

Run CLI command

pipenv run transform -i tests/fixtures/ead/ead_record_all_fields.xml -o output/ead-transformed-records.json -s aspace

Output:

2024-05-30 12:33:59,914 INFO transmogrifier.cli.main(): Logger 'root' configured with level=INFO
2024-05-30 12:33:59,914 INFO transmogrifier.cli.main(): No Sentry DSN found, exceptions will not be sent to Sentry
2024-05-30 12:33:59,914 INFO transmogrifier.cli.main(): Running transform for source aspace
2024-05-30 12:33:59,930 WARNING transmogrifier.sources.transformer.get_valid_title(): Record repositories/2/resources/1 has multiple titles. Using the first title from the following titles found: ['Charles J. Connick Stained Glass Foundation Collection', 'Title 2', 'Title 3']
2024-05-30 12:33:59,930 INFO transmogrifier.cli.main(): Completed transform, total records processed: 1, transformed records: 1, skipped records: 0, deleted records: 0
2024-05-30 12:33:59,930 INFO transmogrifier.cli.main(): Total time to complete transform: 0:00:00.016147

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/TIMX-287

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed and verified
New dependencies are appropriate or there were no changes

ghukill

Overall, I think it looks great! Left a request for docstrings in the helper methods, and a question, but overall looking good to me.

transmogrifier/sources/xml/ead.py

ghukill · 2024-05-30T17:35:57Z

transmogrifier/sources/xml/ead.py

@@ -362,19 +327,86 @@ def generate_name_identifier_url(cls, name_element: Tag) -> list | None:
        return None

    @classmethod
-    def get_main_titles(cls, xml: Tag) -> list[str]:
+    def _get_collection_description(cls, source_record: Tag) -> Tag:


I would propose a small docstring here just to explain that it's widely used for TIMDEX field values, and therefore makes sense to define this.

Docstrings have been added!

ghukill · 2024-05-30T17:36:15Z

transmogrifier/sources/xml/ead.py

+        raise SkippedRecordEvent(message)
+
+    @classmethod
+    def _get_collection_description_did(cls, source_record: Tag) -> Tag:


Same here: small docstring could be helpful to explain why.

Docstrings have been added!

ehanson8

Very good, a few suggestions!

ehanson8 · 2024-05-30T18:00:55Z

tests/sources/xml/test_ead.py

@@ -1,10 +1,61 @@
+# ruff: noqa: E501


Perhaps just a personal opinion but I think it's preferable to not ignore E501 for a module (or even a line really). For the xml_string var in create_ead_source_record_stub, you can use spaces instead of tabs and you can break up the long citation strings like this:

assert Ead.get_citation(source_record) == ( "Charles J. Connick Stained Glass Foundation Collection, " "VC-0002, box X. Massachusetts Institute of Technology, " "Department of Distinctive Collections, Cambridge, Massachusetts." )

Good idea! On a somewhat related note (and is also why I hesitated to reformat the strings):

When I save on an open XML file, I have a tool in my IDE that reformats the XML. It results in newlines being added throughout the XML file, which then causes the tests to fail.

@ehanson8 @ghukill Do you think it's worth creating an issue to whitespace/newline-proof the XML transforms? 🤔

We'd probably want to preserve whitespace inside an element (its text), as this is often intentional by the record creator. I think it would fundamentally change the nature of the content if we introduce newline characters inside a long string of text for an element.

But reformatting the XML in a way that just moves elements around, but not element text, that seems okay / good.

I think I'll forego creating an issue for now as the newlines are resulting from me not turning off the XML formatter in my IDE 😅 .

Circling back to original comment, I removed the syntax to ignore E501 errors and made updates accordingly!

ehanson8 · 2024-05-30T18:05:55Z

tests/sources/xml/test_ead.py

    assert ("has a date that couldn't be parsed: 'undated'") in caplog.text
    assert ("has a later start date than end date: '2001', '1999'") in caplog.text


-def test_ead_record_correct_identifiers_from_multiple_unitid(caplog):
+def test_ead_transform_with_multiple_unitid_gets_valid_ids():


Ead has more of these edge case tests than the other sources we've done so far, we might consider redoing these with stub records but not sure if that's worth the effort

@ehanson8 For this PR, I'll stick to just renaming the test and will revisit when I work on the get_identifiers field method.

That's totally fine, just food for thought!

ehanson8 · 2024-05-30T18:07:52Z

tests/sources/xml/test_ead.py

+            <archdesc level="collection">
+                <did>
+                    <unittitle>
+                        Charles J. Connick Stained Glass
+                        <emph>
+                            Foundation
+                            <emph>Collection</emph>
+                        </emph>
+                        <num>VC.0002</num>
+                    </unittitle>
+                    <unittitle></unittitle>
+                    <unittitle></unittitle>
+                </did>
+            </archdesc>


This is duplicated a lot, you might refactor so top level elements like <archdesc> aren't repeated in every test

@ehanson8 @ghukill Let me know what you think of the updates I made to the create_ead_source_record_stub method. It was the only way I could think of simplifying some of the metadata_insert arguments throughout the testing module (removes the <archdesc> and <archdesc><did> elements from the string.

In general, we want to avoid complex functions in the testing module, but maybe this is a simple enough method that it isn't a problem? 🤔

tests/sources/xml/test_ead.py

transmogrifier/sources/xml/ead.py

ehanson8 · 2024-05-30T18:24:17Z

transmogrifier/sources/xml/ead.py

+            citation := cls.create_string_from_mixed_value(
+                citation_element, " ", ["head"]


We may want to start using named args for this

I was considering this but felt conflicted with @ghukill 's comment on a previous PR, but perhaps the guideline @ghukill is only applicable for "methods with single positional arguments" or "when testing"?

In any case, I will update to use named args! I think create_string_from_mixed_value is nifty but is difficult to understand just by the name alone.

To clarify my comment a bit...

If I'm reading correctly, this is the signature:

def create_string_from_mixed_value( cls, xml_element: Tag, separator: str = "", skipped_elements: list[str] | None = None, )...

So anytime called, I'd expect to see:

required, positional argument xml_element is not named, but just passed positionally

the rest of them, passed as name arguments, because named in the signature

unless the defaults are okay, and then okay to omit

Apologies if I over-complicated anywhere, but I think it really boils down to:

if positional argument, pass as positional

if named argument, passed as named

Does that mesh with the discussion here?

So in this case, I'd expect:

cls.create_string_from_mixed_value( citation_element, separator = " ", skipped_elements = ["head"] )

Open to pushback, but this convention feels like a pretty widely followed convention?

Hmm, I think my confusion was I took "named arguments" (or keyword arguments) literally. 🤔 Technically, all the arguments are positional since the signature isn't as follows:

def create_string_from_mixed_value( cls, xml_element: Tag, *, separator: str = "", skipped_elements: list[str] | None = None, )

I'm definitely on board with the proposed convention; just curious if it's more accurate to say:

use keyword arguments in the method call when (1) overriding positional arguments with default values and (2) there are multiple arguments in the function signature

?

If I'm reading it right, this is the actual function definition:

@classmethod def create_string_from_mixed_value( cls, xml_element: Tag, separator: str = "", skipped_elements: list[str] | None = None, ) -> str: ...

Apologies if I'm being imprecise with my words here or earlier, but I would characterize the arguments like:

cls: required, placeholder, can kind of ignore

xml_element: required, positional

separator: because it has equals sign =, it's named, optional, with default

skipped_elements: same as separator

I take your point that technically all the arguments are positional, good point. But I suppose my position is that conventionally -- and this is open to debate for sure -- if an argument has an equals sign = than it's treated as a named argument, and therefore optional with a default value.

This SO post breaks it down nicely, and exposes that I have been lazy with my terminology.

If we ever do deal with functions that use *args or **kwargs, then I think it gets more subtle (to your example above), but for most signatures, I feel like it's often positional (no defaults) and then named/keyword (with defaults). My vote is to:

not use the arg name for positionals

do use the arg name for named/keyword

Thoughts? or am I just making a mountain out of a mole hill 😅?

Thank you for clarifying! I accept / understand your definitions and we are on the same page! I do wonder whether we should start using * or / in our function signatures to more explicitly distinguish positional vs named/keyword arguments (but that's beyond the scope of this PR 🤓). If I remember correctly, Ruff throws errors for boolean positional arguments (FBT001).

Circling back to original comment: changes have been made.

That feels right to me but I honestly am learning some of these conventions in these PR conversations 🙃

This conversation is the first time I've seen the / operator for function/method signatures, and helped clarify for me what the * operator does without an immediate following variable name like *args.

This helped me understand it more tacitly:

In [1]: def foo(a, b, *, c="horse"): ...: print(a,b,c) ...: In [2]: foo(42, 101, 999, 888, c='zebra') --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[2], line 1 ----> 1 foo(42, 101, 999, 888, c='zebra') TypeError: foo() takes 2 positional arguments but 4 positional arguments (and 1 keyword-only argument) were given In [3]: def bar(a, b, *args, c="horse"): ...: print(a,b,args,c) ...: In [4]: bar(42, 101, 999, 888, c='zebra') 42 101 (999, 888) zebra

I'm definitely not opposed to using * and / in signatures if we find them useful. Here is an edge case where I could see it being useful, when you want to force keyword argument or no positional:

def baz(*, a=1): pass In [6]: baz() 1 In [7]: baz(42) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[7], line 1 ----> 1 baz(42) TypeError: baz() takes 0 positional arguments but 1 was given In [8]: baz(a=101) 101

Ineresting stuff! In cases without * or /, glad we're honing in on some conventions, and curious if we ever run into situations where it would be nice to use them. I read an article this morning that I found pretty convincing that the main appeal is often APIs, where you want to finely control how functions are called by users potentially not familiar with the library.

ehanson8

Looks good to me and I'm glad this PR led to some learning for all of us! 🙂

ehanson8 · 2024-05-31T13:57:09Z

tests/sources/xml/test_ead.py

-    header_insert: str = "", metadata_insert: str = ""
+    header_insert: str = "",
+    metadata_insert: str = "",
+    parent_element: Literal["archdesc", "did"] = None,  # noqa: RUF013


This isn't too complex in my mind and saves space in the unit tests but I defer to @ghukill for final approval of this

ghukill

Looks good to me!

Thanks again for good discussions. I'm glad in these early updates, we're hashing out some of these conventions and details.

ghukill · 2024-05-31T14:11:37Z

transmogrifier/sources/xml/ead.py

@@ -362,19 +327,86 @@ def generate_name_identifier_url(cls, name_element: Tag) -> list | None:
        return None

    @classmethod
-    def get_main_titles(cls, xml: Tag) -> list[str]:
+    def _get_collection_description(cls, source_record: Tag) -> Tag:


ghukill · 2024-05-31T14:11:56Z

transmogrifier/sources/xml/ead.py

+        For EAD-formatted XML, all of the information about a collection
+        that is relevant to the TIMDEX data model is
+        encapsulated within the <archdesc level="collection"> element.
+        If this element is missing, it suggests that there is a structural
+        error in the record.


Perfect, thanks.

Why these changes are being introduced: * These updates are required to implement the architecture described in the following ADR: https://github.com/MITLibraries/transmogrifier/blob/main/docs/adrs/0005-field-methods.md How this addresses that need: * Rename param 'xml' to 'source_record' * Add private methods for retrieving key elements: <archdesc>, <archdesc><did>, <controlaccess> * Raise SkippedRecordEvent when <archdesc> and <archdesc><did> are missing * Add field methods for alternate_titles, citation, content_type * Use named args for 'create_list/string_from_mixed_value' * Update tests * Rename tests * Add method to create source records from EAD-formatted XML for unit tests * Add tests for field methods alternate_titles, main_titles, citation, content_type * Avoid linting error E501 Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-287

jonavellecuerdo requested review from ehanson8 and ghukill May 30, 2024 16:37

jonavellecuerdo self-assigned this May 30, 2024

jonavellecuerdo marked this pull request as ready for review May 30, 2024 16:37

ghukill requested changes May 30, 2024

View reviewed changes

ehanson8 reviewed May 30, 2024

View reviewed changes

jonavellecuerdo force-pushed the TIMX-287-ead-field-method-refactor-1 branch 3 times, most recently from b8adf19 to 2c8339b Compare May 31, 2024 13:26

jonavellecuerdo requested review from ghukill and ehanson8 May 31, 2024 13:28

ehanson8 approved these changes May 31, 2024

View reviewed changes

ghukill approved these changes May 31, 2024

View reviewed changes

jonavellecuerdo force-pushed the TIMX-287-ead-field-method-refactor-1 branch from 2c8339b to 714e2cc Compare May 31, 2024 15:21

jonavellecuerdo merged commit 2b499cb into main May 31, 2024
3 checks passed

jonavellecuerdo deleted the TIMX-287-ead-field-method-refactor-1 branch May 31, 2024 15:22

jonavellecuerdo mentioned this pull request May 31, 2024

Timx 287 ead field method refactor 2 #187

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timx 287 ead field method refactor 1 #185

Timx 287 ead field method refactor 1 #185

jonavellecuerdo commented May 30, 2024 •

edited

Loading

ghukill left a comment

ghukill May 30, 2024

jonavellecuerdo May 31, 2024

ghukill May 31, 2024

ghukill May 30, 2024

jonavellecuerdo May 31, 2024

ehanson8 left a comment

ehanson8 May 30, 2024

jonavellecuerdo May 30, 2024

ghukill May 30, 2024

jonavellecuerdo May 30, 2024

ehanson8 May 30, 2024

jonavellecuerdo May 31, 2024

ehanson8 May 31, 2024

ehanson8 May 30, 2024

jonavellecuerdo May 31, 2024 •

edited

Loading

ehanson8 May 30, 2024

jonavellecuerdo May 30, 2024

ghukill May 30, 2024 •

edited

Loading

ghukill May 30, 2024

jonavellecuerdo May 30, 2024 •

edited

Loading

ghukill May 30, 2024 •

edited

Loading

jonavellecuerdo May 30, 2024 •

edited

Loading

ehanson8 May 30, 2024

ghukill May 31, 2024

ehanson8 left a comment

ehanson8 May 31, 2024

ghukill left a comment

ghukill May 31, 2024

ghukill May 31, 2024

		citation := cls.create_string_from_mixed_value(
		citation_element, " ", ["head"]

Timx 287 ead field method refactor 1 #185

Timx 287 ead field method refactor 1 #185

Conversation

jonavellecuerdo commented May 30, 2024 • edited Loading

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

ghukill left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ehanson8 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonavellecuerdo May 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghukill May 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonavellecuerdo May 30, 2024 • edited Loading

Choose a reason for hiding this comment

ghukill May 30, 2024 • edited Loading

Choose a reason for hiding this comment

jonavellecuerdo May 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ehanson8 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghukill left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonavellecuerdo commented May 30, 2024 •

edited

Loading

jonavellecuerdo May 31, 2024 •

edited

Loading

ghukill May 30, 2024 •

edited

Loading

jonavellecuerdo May 30, 2024 •

edited

Loading

ghukill May 30, 2024 •

edited

Loading

jonavellecuerdo May 30, 2024 •

edited

Loading