Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 1st set of DspaceDim field methods #194

Merged
merged 2 commits into from
Jun 18, 2024
Merged

Conversation

ehanson8
Copy link
Contributor

@ehanson8 ehanson8 commented Jun 17, 2024

Purpose and background context

Refactors the first set of DspaceDim fields as separate field methods.

How can a reviewer manually see the effects of these changes?

Run the following command to see that the DspaceDim transform still transforms a source file and skips an invalid content type according to the Whoas transform:

pipenv run transform -i tests/fixtures/dspace/dspace_dim_records.xml -o output/dspace-transformed-records.json -s whoas

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed and verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:
* Refactor DspaceDim to use field methods

How this addresses that need:
* Add create_dspace_dim_source_record_stub function to DspaceDim test module
* Rename param xml > source_record
* Add field methods and associated private methods for alternate_titles, citation, and content_type
* Refactor Whoas.get_content_type method to raise exception for more explicit flow control
* Add unit tests for new field methods

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-282
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All in all, I think it looks great! Left a couple comments about yielding + list extending, but I'm finding these really easy to parse and reason about. Nice work on the changes, and the PR chunking.

Comment on lines +251 to +254
yield timdex.Contributor(
value=str(creator.string),
kind="Creator",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious (and same for _get_contributors...() below), why the yield?

On one hand, I kind of like it: no need for list comprehensions, just yield Contributor objects as they are found. On the other hand: feels a bit different than other places?

FWIW, I do kind of like the yield? Wonder -- not pressing by any means -- if opportunity to move in this direction for other methods as we encounter them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually copying the pattern from Datacite, I believe we've used it in a few places but we could check for consistency when this refactor is done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, those might make for a couple of nice PRs when the first pass is over: "Normalize yield..", "Normalize BS4 string casting...", etc. Though I think by and large things are pretty good.

transmogrifier/sources/xml/dspace_dim.py Outdated Show resolved Hide resolved
Comment on lines +189 to +198
alternate_titles = [
timdex.AlternateTitle(
value=str(alternate_title.string),
kind=alternate_title["qualifier"],
)
for alternate_title in source_record.find_all(
"dim:field", element="title", string=True
)
if alternate_title.get("qualifier")
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether we ultimately keep BS4, or move into Xpath/lxml, I'm finding this pattern increasingly easy to read.

kind="coverage",
)
else:
yield timdex.Date(note=coverage_value, kind="coverage")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, more yields! Maybe I just missed them in other transforms. Digging this pattern.

transmogrifier/sources/xml/dspace_dim.py Outdated Show resolved Hide resolved
Comment on lines +372 to +381
timdex.Contributor(value="Jamerson, James", kind="Creator"),
timdex.Contributor(
value="LaFountain, James R.",
kind="author",
),
timdex.Contributor(
value="Oldenbourg, Rudolf",
kind="author",
),
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting now, but don't think we should make these changes in this refactor, but this feels like a good opportunity at some point to normalized kind values. Seeing Creator (title cased) and author (lowercase). I know author comes from the XML itself, and it's always risky to programatically convert case.

Maybe normalization is some kind of post-processing / post record transform? Again, not needed now, just noting for the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was almost certain we had a Jira ticket for this but we don't, I'll create an issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating the ticket. Maybe when this field method refactor is done, before we launch into orchestration, we could meet and survey the tickets created? Get a feel for anything we may want to also tackle during the orchestration? or after?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before sounds like a good idea!

if cls.valid_content_types(content_types):
return content_types
message = f'Record skipped based on content type: "{content_types}"'
raise SkippedRecordEvent(message, cls.get_source_record_id(source_record))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

* Remove list() from .extend calls
@ehanson8 ehanson8 requested a review from ghukill June 18, 2024 13:41
@ehanson8
Copy link
Contributor Author

@ghukill Had to add some type hinting in get_contributors after removing `list(...)

@ghukill
Copy link
Contributor

ghukill commented Jun 18, 2024

@ghukill Had to add some type hinting in get_contributors after removing `list(...)

Ahhh, didn't think of that. Please feel free to disregard the removal of list(...). As mentioned, really not super consequential, just feeling like a time to comment on patterns given the refactoring work. But both make sense, and maybe the linting / typing advantages outweigh anything else!

@ehanson8
Copy link
Contributor Author

ehanson8 commented Jun 18, 2024

I prefer it without list(...) so I'm fine with a occasional inter-method type hinting as a trade-off. I'll clean up the Datacite class in a separate commit in my next PR

Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍


@classmethod
def get_contributors(cls, source_record: Tag) -> list[timdex.Contributor] | None:
contributors: list[timdex.Contributor] = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. I'm also a little surprised that mypy needed this to understand what what was coming from those methods which are both clearly typed.

Comment on lines +251 to +254
yield timdex.Contributor(
value=str(creator.string),
kind="Creator",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, those might make for a couple of nice PRs when the first pass is over: "Normalize yield..", "Normalize BS4 string casting...", etc. Though I think by and large things are pretty good.

@ehanson8 ehanson8 merged commit 00d2024 into main Jun 18, 2024
3 of 5 checks passed
@ehanson8 ehanson8 deleted the TIMX-282-dspacedim-fmr-origins branch June 18, 2024 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants