-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timx 287 ead field method refactor 1 #185
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think it looks great! Left a request for docstrings in the helper methods, and a question, but overall looking good to me.
@@ -362,19 +327,86 @@ def generate_name_identifier_url(cls, name_element: Tag) -> list | None: | |||
return None | |||
|
|||
@classmethod | |||
def get_main_titles(cls, xml: Tag) -> list[str]: | |||
def _get_collection_description(cls, source_record: Tag) -> Tag: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would propose a small docstring here just to explain that it's widely used for TIMDEX field values, and therefore makes sense to define this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstrings have been added!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
raise SkippedRecordEvent(message) | ||
|
||
@classmethod | ||
def _get_collection_description_did(cls, source_record: Tag) -> Tag: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here: small docstring could be helpful to explain why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstrings have been added!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good, a few suggestions!
tests/sources/xml/test_ead.py
Outdated
@@ -1,10 +1,61 @@ | |||
# ruff: noqa: E501 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps just a personal opinion but I think it's preferable to not ignore E501 for a module (or even a line really). For the xml_string
var in create_ead_source_record_stub
, you can use spaces instead of tabs and you can break up the long citation strings like this:
assert Ead.get_citation(source_record) == (
"Charles J. Connick Stained Glass Foundation Collection, "
"VC-0002, box X. Massachusetts Institute of Technology, "
"Department of Distinctive Collections, Cambridge, Massachusetts."
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! On a somewhat related note (and is also why I hesitated to reformat the strings):
When I save on an open XML file, I have a tool in my IDE that reformats the XML. It results in newlines being added throughout the XML file, which then causes the tests to fail.
@ehanson8 @ghukill Do you think it's worth creating an issue to whitespace/newline-proof the XML transforms? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd probably want to preserve whitespace inside an element (its text), as this is often intentional by the record creator. I think it would fundamentally change the nature of the content if we introduce newline characters inside a long string of text for an element.
But reformatting the XML in a way that just moves elements around, but not element text, that seems okay / good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'll forego creating an issue for now as the newlines are resulting from me not turning off the XML formatter in my IDE 😅 .
Circling back to original comment, I removed the syntax to ignore E501
errors and made updates accordingly!
assert ("has a date that couldn't be parsed: 'undated'") in caplog.text | ||
assert ("has a later start date than end date: '2001', '1999'") in caplog.text | ||
|
||
|
||
def test_ead_record_correct_identifiers_from_multiple_unitid(caplog): | ||
def test_ead_transform_with_multiple_unitid_gets_valid_ids(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ead
has more of these edge case tests than the other sources we've done so far, we might consider redoing these with stub records but not sure if that's worth the effort
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ehanson8 For this PR, I'll stick to just renaming the test and will revisit when I work on the get_identifiers
field method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's totally fine, just food for thought!
tests/sources/xml/test_ead.py
Outdated
<archdesc level="collection"> | ||
<did> | ||
<unittitle> | ||
Charles J. Connick Stained Glass | ||
<emph> | ||
Foundation | ||
<emph>Collection</emph> | ||
</emph> | ||
<num>VC.0002</num> | ||
</unittitle> | ||
<unittitle></unittitle> | ||
<unittitle></unittitle> | ||
</did> | ||
</archdesc> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is duplicated a lot, you might refactor so top level elements like <archdesc>
aren't repeated in every test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ehanson8 @ghukill Let me know what you think of the updates I made to the create_ead_source_record_stub
method. It was the only way I could think of simplifying some of the metadata_insert
arguments throughout the testing module (removes the <archdesc>
and <archdesc><did>
elements from the string.
In general, we want to avoid complex functions in the testing module, but maybe this is a simple enough method that it isn't a problem? 🤔
transmogrifier/sources/xml/ead.py
Outdated
citation := cls.create_string_from_mixed_value( | ||
citation_element, " ", ["head"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to start using named args for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was considering this but felt conflicted with @ghukill 's comment on a previous PR, but perhaps the guideline @ghukill is only applicable for "methods with single positional arguments" or "when testing"?
In any case, I will update to use named args! I think create_string_from_mixed_value
is nifty but is difficult to understand just by the name alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify my comment a bit...
If I'm reading correctly, this is the signature:
def create_string_from_mixed_value(
cls,
xml_element: Tag,
separator: str = "",
skipped_elements: list[str] | None = None,
)...
So anytime called, I'd expect to see:
- required, positional argument
xml_element
is not named, but just passed positionally - the rest of them, passed as name arguments, because named in the signature
- unless the defaults are okay, and then okay to omit
Apologies if I over-complicated anywhere, but I think it really boils down to:
- if positional argument, pass as positional
- if named argument, passed as named
Does that mesh with the discussion here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in this case, I'd expect:
cls.create_string_from_mixed_value(
citation_element,
separator = " ",
skipped_elements = ["head"]
)
Open to pushback, but this convention feels like a pretty widely followed convention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think my confusion was I took "named arguments" (or keyword arguments) literally. 🤔 Technically, all the arguments are positional since the signature isn't as follows:
def create_string_from_mixed_value(
cls,
xml_element: Tag,
*,
separator: str = "",
skipped_elements: list[str] | None = None,
)
I'm definitely on board with the proposed convention; just curious if it's more accurate to say:
use keyword arguments in the method call when (1) overriding positional arguments with default values and (2) there are multiple arguments in the function signature
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm reading it right, this is the actual function definition:
@classmethod
def create_string_from_mixed_value(
cls,
xml_element: Tag,
separator: str = "",
skipped_elements: list[str] | None = None,
) -> str:
...
Apologies if I'm being imprecise with my words here or earlier, but I would characterize the arguments like:
cls
: required, placeholder, can kind of ignorexml_element
: required, positionalseparator
: because it has equals sign=
, it's named, optional, with defaultskipped_elements
: same as separator
I take your point that technically all the arguments are positional, good point. But I suppose my position is that conventionally -- and this is open to debate for sure -- if an argument has an equals sign =
than it's treated as a named argument, and therefore optional with a default value.
This SO post breaks it down nicely, and exposes that I have been lazy with my terminology.
If we ever do deal with functions that use *args
or **kwargs
, then I think it gets more subtle (to your example above), but for most signatures, I feel like it's often positional (no defaults) and then named/keyword (with defaults). My vote is to:
- not use the arg name for positionals
- do use the arg name for named/keyword
Thoughts? or am I just making a mountain out of a mole hill 😅?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for clarifying! I accept / understand your definitions and we are on the same page! I do wonder whether we should start using *
or /
in our function signatures to more explicitly distinguish positional vs named/keyword arguments (but that's beyond the scope of this PR 🤓). If I remember correctly, Ruff throws errors for boolean positional arguments (FBT001).
Circling back to original comment: changes have been made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That feels right to me but I honestly am learning some of these conventions in these PR conversations 🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conversation is the first time I've seen the /
operator for function/method signatures, and helped clarify for me what the *
operator does without an immediate following variable name like *args
.
This helped me understand it more tacitly:
In [1]: def foo(a, b, *, c="horse"):
...: print(a,b,c)
...:
In [2]: foo(42, 101, 999, 888, c='zebra')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[2], line 1
----> 1 foo(42, 101, 999, 888, c='zebra')
TypeError: foo() takes 2 positional arguments but 4 positional arguments (and 1 keyword-only argument) were given
In [3]: def bar(a, b, *args, c="horse"):
...: print(a,b,args,c)
...:
In [4]: bar(42, 101, 999, 888, c='zebra')
42 101 (999, 888) zebra
I'm definitely not opposed to using *
and /
in signatures if we find them useful. Here is an edge case where I could see it being useful, when you want to force keyword argument or no positional:
def baz(*, a=1):
pass
In [6]: baz()
1
In [7]: baz(42)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[7], line 1
----> 1 baz(42)
TypeError: baz() takes 0 positional arguments but 1 was given
In [8]: baz(a=101)
101
Ineresting stuff! In cases without *
or /
, glad we're honing in on some conventions, and curious if we ever run into situations where it would be nice to use them. I read an article this morning that I found pretty convincing that the main appeal is often APIs, where you want to finely control how functions are called by users potentially not familiar with the library.
b8adf19
to
2c8339b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me and I'm glad this PR led to some learning for all of us! 🙂
header_insert: str = "", metadata_insert: str = "" | ||
header_insert: str = "", | ||
metadata_insert: str = "", | ||
parent_element: Literal["archdesc", "did"] = None, # noqa: RUF013 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't too complex in my mind and saves space in the unit tests but I defer to @ghukill for final approval of this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Thanks again for good discussions. I'm glad in these early updates, we're hashing out some of these conventions and details.
@@ -362,19 +327,86 @@ def generate_name_identifier_url(cls, name_element: Tag) -> list | None: | |||
return None | |||
|
|||
@classmethod | |||
def get_main_titles(cls, xml: Tag) -> list[str]: | |||
def _get_collection_description(cls, source_record: Tag) -> Tag: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
For EAD-formatted XML, all of the information about a collection | ||
that is relevant to the TIMDEX data model is | ||
encapsulated within the <archdesc level="collection"> element. | ||
If this element is missing, it suggests that there is a structural | ||
error in the record. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, thanks.
Why these changes are being introduced: * These updates are required to implement the architecture described in the following ADR: https://github.com/MITLibraries/transmogrifier/blob/main/docs/adrs/0005-field-methods.md How this addresses that need: * Rename param 'xml' to 'source_record' * Add private methods for retrieving key elements: <archdesc>, <archdesc><did>, <controlaccess> * Raise SkippedRecordEvent when <archdesc> and <archdesc><did> are missing * Add field methods for alternate_titles, citation, content_type * Use named args for 'create_list/string_from_mixed_value' * Update tests * Rename tests * Add method to create source records from EAD-formatted XML for unit tests * Add tests for field methods alternate_titles, main_titles, citation, content_type * Avoid linting error E501 Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-287
2c8339b
to
714e2cc
Compare
Purpose and background context
Field method refactor for transform class Ead (Part 1).
<archdesc>
,<archdesc><did>
,<archdesc><controlaccess>
alternate_titles
,citation
,content_type
Note: For the time-being, I've added calls to the private methods at the beginning of
get_optional_fields()
, which is required by field derivations that will eventually be moved into their own field methods.How can a reviewer manually see the effects of these changes?
make test
and verify all unit tests are passing.Includes new or updated dependencies?
NO
Changes expectations for external applications?
NO
What are the relevant tickets?
Developer
Code Reviewer(s)