-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timx 283 oaidc field method refactor #177
Timx 283 oaidc field method refactor #177
Conversation
tests/sources/xml/test_oai_dc.py
Outdated
xml = next(transformer_instance.source_records) | ||
assert transformer_instance.get_dates("test_source_record_id", xml) == [ | ||
timdex.Date(kind="Unknown", note=None, range=None, value="2008-06-19T17:55:27") | ||
assert OaiDc("cool-repo", source_records).get_content_type() == ["cool-repo"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the content_type
is derived from the source
instance variable, OaiDc
must be instantiated for the test.
tests/sources/xml/test_oai_dc.py
Outdated
def test_get_identifiers_transforms_correctly_if_fields_blank( | ||
oai_dc_record_optional_fields_blank, | ||
): | ||
oai_dc_record_optional_fields_blank.header.identifier.clear() | ||
assert OaiDc.get_identifiers(oai_dc_record_optional_fields_blank) == [] | ||
|
||
|
||
def test_get_identifiers_transforms_correctly_if_fields_missing( | ||
oai_dc_record_optional_fields_missing, | ||
): | ||
oai_dc_record_optional_fields_missing.header.identifier.decompose() | ||
assert OaiDc.get_identifiers(oai_dc_record_optional_fields_missing) == [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious what you think of this setup. 🤔 The identifiers are derived from a child of the header
element, so using the stub (as it is written) or the oai_dc_record_*
fixtures directly was not possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, good call out. This is probably one of the inherent issues with the stubs.
Though I agree this works, another option could be modifying the function that creates the stub to one that allows both header overrides and actual content?
Maybe that's a strong argument for keeping these stubs in the test suite for that source itself vs the higher level abstraction proposed in this ticket.
Example:
def create_oaidc_source_record_stub(
header_identifier:str = "oai:libguides.com:guides/123456", #<------- new
xml_insert: str = ""
) -> BeautifulSoup:
xml_str = f"""
<records>
<record>
<header>
<identifier>{header_identifier}</identifier> #<---------- templated here
<datestamp>2023-05-31T19:49:21Z</datestamp>
<setSpec>guides</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
{xml_insert}
</oai_dc:dc>
</metadata>
</record>
<records
"""
return BeautifulSoup(xml_str, "xml")
Just a thought exercise though, as I agree the method used above works well too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update from @ghukill : Whole header
element should be editable via param to stub method.
cd5569a
to
dd69c7d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, some very minor suggestions
tests/sources/xml/test_oai_dc.py
Outdated
def test_get_identifiers_transforms_correctly_if_fields_blank( | ||
oai_dc_record_optional_fields_blank, | ||
): | ||
oai_dc_record_optional_fields_blank.header.identifier.clear() | ||
assert OaiDc.get_identifiers(oai_dc_record_optional_fields_blank) == [] | ||
|
||
|
||
def test_get_identifiers_transforms_correctly_if_fields_missing( | ||
oai_dc_record_optional_fields_missing, | ||
): | ||
oai_dc_record_optional_fields_missing.header.identifier.decompose() | ||
assert OaiDc.get_identifiers(oai_dc_record_optional_fields_missing) == [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that works
And let's resolve the |
a3ee3cf
to
45a75bc
Compare
return str(xml.find("dc:identifier").string) | ||
if source_link := source_record.find("dc:identifier", string=True): | ||
return str(source_link.string) | ||
return "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, the method would throw an attribute error if dc:identifier
did not appear in the XML:
AttributeError: 'NoneType' object has no attribute 'string'
Given that source_link
is a required field, raising an error seems to be the right approach, but is there a cleaner way to handle this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your instinct to use SkippedRecordEvent
here is correct, @ghukill your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. If we can't pull a dc:identifier
value, then we don't have a source link to send people to, and it feels like it's not worth keeping the record (punting the awareness of this to later observability/metrics discussions).
So yes, raising a SkippedRecordEvent
feels like the right move!
@ghukill @ehanson8 The PR has now includes required updates to the Note(s):
|
45a75bc
to
a05fc00
Compare
tests/sources/xml/test_oai_dc.py
Outdated
from bs4 import BeautifulSoup | ||
|
||
import transmogrifier.models as timdex | ||
from transmogrifier.sources.xml.oaidc import OaiDc | ||
|
||
|
||
def create_oai_dc_source_record_stub(xml_insert: str = "") -> BeautifulSoup: | ||
def create_oaidc_source_record_stub(xml_insert: str = "") -> BeautifulSoup: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the default, I'll update my PR with that. Actually, let's just use the default with the new global stub function that will be created
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had agreed with the comment until the strikethrough, lolz!
See this comment about how maybe keeping the function closer to the tests is handy, given we can modify the signature to set values as needed in the stub. Maybe good fodder for our discussion today.
To me, feels like either approach is still kind of on the table.
) | ||
|
||
|
||
def test_get_dates_transforms_correctly_and_logs_error_if_date_invalid( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect example of an edge case test!
@@ -51,16 +52,16 @@ def get_dates(self, source_record_id: str, xml: Tag) -> list[timdex.Date]: | |||
) | |||
return dates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in the other PR, this return
can be updated with or None
return str(xml.find("dc:identifier").string) | ||
if source_link := source_record.find("dc:identifier", string=True): | ||
return str(source_link.string) | ||
return "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your instinct to use SkippedRecordEvent
here is correct, @ghukill your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Submitting some comments now for our discussion, with the intent of completing a further review later.
return str(xml.find("dc:identifier").string) | ||
if source_link := source_record.find("dc:identifier", string=True): | ||
return str(source_link.string) | ||
return "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. If we can't pull a dc:identifier
value, then we don't have a source link to send people to, and it feels like it's not worth keeping the record (punting the awareness of this to later observability/metrics discussions).
So yes, raising a SkippedRecordEvent
feels like the right move!
tests/sources/xml/test_oai_dc.py
Outdated
from bs4 import BeautifulSoup | ||
|
||
import transmogrifier.models as timdex | ||
from transmogrifier.sources.xml.oaidc import OaiDc | ||
|
||
|
||
def create_oai_dc_source_record_stub(xml_insert: str = "") -> BeautifulSoup: | ||
def create_oaidc_source_record_stub(xml_insert: str = "") -> BeautifulSoup: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had agreed with the comment until the strikethrough, lolz!
See this comment about how maybe keeping the function closer to the tests is handy, given we can modify the signature to set values as needed in the stub. Maybe good fodder for our discussion today.
To me, feels like either approach is still kind of on the table.
transmogrifier/sources/xml/oaidc.py
Outdated
|
||
# languages: not set in this transformation | ||
|
||
# links | ||
fields["links"] = self.get_links(source_record_id, xml) or None | ||
fields["links"] = self.get_links(source_record, source_record_id) or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need to pass source_record_id
, if the source_record
contains it?
And/or, could the method get_links()
just call the other method get_source_record_id()
again if not?
Feeling like maybe this is touched on in other comments. Full disclaimer: I wrote these transforms fairly early on, and recall they had requirements that deviated from previous Transmog sources (as I think you've commented on as well @jonavellecuerdo, like the identifier vs source link). So it feels possible that now is a good time to revisit some of those awkward edges. Silver lining: these are not yet live, so I think we're fairly free to modify them somewhat heavily if needed.
a05fc00
to
8b946ff
Compare
@ghukill @ehanson8 Please see the latest three (3) commits for updates based on our sync yesterday! See Summary of changes via Field Method Refactor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great, just a few more suggestions!
""" | ||
<dc:creator>Ye Li</dc:creator> | ||
""" | ||
metadata_insert=( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call on named args!
tests/sources/xml/test_oai_dc.py
Outdated
): | ||
oai_dc_record_all_fields.header.identifier.clear() | ||
assert OaiDc.get_source_record_id(oai_dc_record_all_fields) == "" | ||
def test_get_source_record_id_transforms_properly_if_fields_blank(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to indicate it's raising the exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to test_get_source_record_id_raises_skipped_record_event_if_fields_blank()
.
tests/sources/xml/test_oai_dc.py
Outdated
): | ||
oai_dc_record_all_fields.header.identifier.decompose() | ||
assert OaiDc.get_source_record_id(oai_dc_record_all_fields) == "" | ||
def test_get_source_record_id_transforms_properly_if_fields_missing(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to test_get_source_record_id_raises_skipped_record_event_if_fields_missing
.
"The 'identifier' was either missing from the header element or blank." | ||
), | ||
): | ||
SpringshareOaiDc.get_source_link("", "", source_record=source_record) | ||
|
||
|
||
def test_get_source_link_transforms_correctly_if_required_fields_missing(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another rename to describe exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to test_get_source_link_raises_skipped_record_event_if_required_fields_blank
and test_get_source_link_raises_skipped_record_event_if_required_fields_missing
.
tests/sources/xml/test_oai_dc.py
Outdated
""" | ||
), | ||
) | ||
assert OaiDc.get_dates(source_record=source_record) == [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I think named args for a single param method might be overkill 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My rule of thumb -- like anything, open to exceptions -- is use positional when the function signature is positional, and use named when the function signature is named. Keeps it simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghukill I'll follow this convention moving forward. I updated the function calls in tests to use positional arguments to match function signatures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, thinking it's going well! Appreciating that OaiDc
and Springshare
are kind of finicky, given there is heavy inheritance implications.
Left a couple of comments / questions.
message = ( | ||
"Record skipped because 'source_record_id' could not be derived. " | ||
"The 'identifier' was either missing from the header element or blank." | ||
) | ||
raise SkippedRecordEvent(message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether or not we end up layering in more granular exceptions, I think this is a great exception message. Ignoring when or how this is logged, this does explain why this record would be skipped.
One thing we might want to consider is how to control these messages so we could group them, even across sources. I think that's where more granular Exception
types could come in. Kind of a balance/art between the Exception
type, and the message. Will be easy to group and log and store metrics by exception type, but the string itself becomes most helpful likely for deep debugging a single record.
@@ -69,15 +73,20 @@ def get_links(self, source_record_id: str, xml: Tag) -> list[timdex.Link] | None | |||
url=str(identifier.string), | |||
) | |||
) | |||
|
|||
# [TODO]: Message is logged even if the condition is met. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonavellecuerdo - was this intended to be left in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes. I think this is one of the logging messages that this issue refers to. I'll add a reference to this line of code as part of the issue and remove the comment to avoid linting errors.
|
||
def get_links(self, source_record_id: str, xml: Tag) -> list[timdex.Link] | None: | ||
def get_links(self, source_record: Tag) -> list[timdex.Link] | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be an instance method because SpringshareOaiDc.get_links
uses the instance attribute source_name
in the derivation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonavellecuerdo - thanks for the explanation! It's altogether possible that I introduced that when I first worked on these. I feel like there is a bit of code smell there, and I'm probably to blame, but feels like maybe it's worth revisiting during the 2nd orchestration phase of this work.
header_insert=( | ||
""" | ||
<identifier>oai:libguides.com:guides/175846</identifier> | ||
""" | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem harmful here, but I'm wondering if the header_insert
is needed? Wouldn't that only be needed for tests that are looking to pull the identifier from it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is required by the logging that uses source_record_id
; OaiDc.get_source_record_id()
derives a value that pulls from the 'identifier' subelement of the 'header' element.
32ce150
to
dde85aa
Compare
dde85aa
to
445448f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks for the changes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved, but remove references to source_record_id
from the docstrings
source_record_id: Source record id | ||
xml: A BeautifulSoup Tag representing a single OAI DC XML record. | ||
source_record: A BeautifulSoup Tag representing a single OAI DC record in XML. | ||
source_record_id: Source record ID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer needed in the docstring since the param was removed
source_record_id: Source record id | ||
xml: A BeautifulSoup Tag representing a single OAI DC XML record. | ||
source_record: A BeautifulSoup Tag representing a single OAI DC record in XML. | ||
source_record_id: Source record ID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer needed in the docstring since the param was removed
transmogrifier/sources/xml/oaidc.py
Outdated
source_record_id: Source record id | ||
xml: A BeautifulSoup Tag representing a single OAI DC XML record. | ||
source_record: A BeautifulSoup Tag representing a single OAI DC record in XML. | ||
source_record_id: Source record ID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer needed in the docstring since the param was removed
445448f
to
3534a01
Compare
Why these changes are being introduced: * These updates are required to implement the architecture described in the following ADR: https://github.com/MITLibraries/transmogrifier/blob/main/docs/adrs/0005-field-methods.md How this addresses that need: * Refactor base transform class OaiDc to contain get_* methods for optional fields * Make 'source_record' the first argument in OaiDc field methods * Move 'or None' to the return statements of field methods * Remove 'source_record_id' as param, replace with call inside field method instead * Raise SkippedRecordEvent * OaiDc.get_source_record_id * SpringshareOaiDc.get_source_link * Update unit tests * Remove OaiDc and SpringshareOaiDc record_* fixtures Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-283
3534a01
to
537f6ec
Compare
Purpose and background context
Field method refactor for base transform class
OaiDc
.Note: Linters are failing for
transmogrifiers.sources.xml.springshare
, which will be addressed in a later PR!How can a reviewer manually see the effects of these changes?
make test
and verify all tests are passing.Includes new or updated dependencies?
NO
Changes expectations for external applications?
NO
What are the relevant tickets?
Developer
Code Reviewer(s)