-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timx 232 springshare ids #101
Conversation
Why these changes are being introduced: After some analysis of transformed Springshare records, it was discovered that some timdex_record_ids were undesirable / poorly formed. For a subset of records, the dc:identifier value was a PHP URL pattern, that while a functional link, did not result in a good timdex record identifier. Previously, the source record id and the source base url were tightly coupled in the transformer class, and could not be overridden by transformations. This meant identifiers for the record must be concate-able with the base url. In this situation, this was not possible for all records in the set. How this addresses that need: * added two new methods in the base transformer class, get_source_link() and get_timdex_record_id() * update springshare transform to utilize the get_source_link() to generate a valid URL * update springshare transform to fallback on the more widely used pattern of using part of the OAI identifier as the timdex record id Side effects of this change: Two new OPTIONAL methods are available for transformers, get_source_link() and get_timdex_record_id(). Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-232
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two comments
|
||
@classmethod | ||
def get_source_link( | ||
cls, source_base_url: str, source_record_id: str, xml: Tag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need xml
as a parameter here if it's not being used? Couldn't the method just be overridden in the source transforms with an xml
parameter when it's needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, and I had also paused on this. I think if these are getting decoupled with the understanding transformers may override them, it's plausible they might need the document XML, so I included here.
Because the base transform calls this method, it will always use the same signature, so a sub-class transformer couldn't add xml:Tag
on their own.
We could adopt a kwargs
approach, where the sub-class transformer could pass anything they want as named arguments, but we'd lose the typing this provides. Other than xml:Tag
, it's hard to guess what they may want to pass?
See pros and cons both ways. My thinking was to model these after other methods that are designed to overridden, like get_valid_title()
or get_source_record_id()
which both pass xml:Tag
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if helpful, we could add the docstring comment:
xml: A BeautifulSoup Tag representing a single XML record.
- not used by default implementation, but could be useful for subclass
overrides
which get_timdex_record_id()
contains below. I had actually intended to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just add the comment, I did some more reading and see your point on losing the type hinting
source_link = self.get_source_link(self.source_base_url, source_record_id, xml) | ||
timdex_record_id = self.get_timdex_record_id(self.source, source_record_id, xml) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: you could call these directly in the dict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to change if you'd like, and I considered that too. But at the risk of a being a bit more verbose, I thought it helped break it up a bit. There is some important stuff happening here -- establishing record title, primary timdex id, link, etc. -- and I found it helpful to see those steps as distinct statements vs embedded in a dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional is always just a suggestion 🙂
What does this PR do?
This PR adds two new OPTIONAL methods to the base transform:
get_source_link()
get_timdex_record_id()
For transformations that require it, this allows for decoupling the logic of constructing the timdex record id from the source link. This directly supports these new sources, and it's believed will also be helpful for future sources where this tight coupling may not be ideal.
Dependencies are also updated, with slight type check updates for new
flake8
requirements inead
transformation.Helpful background context
While analyzing the transformed records from new
libguides
andresearchdatabases
sources, it was found that some timdex record ids were poorly formed, e.g.libguides:c.php?g=175847
.The original approach was splitting the
dc:identifier
and using that for the timdex record id and concatenating with the source base url. However, in a small subset, this resulted in poorly formed ids like the one above.Given the terse nature of these source records, there is not a meaningful identifier in the record that could serve both the timdex record id and the source link.
How can a reviewer manually see the effects of these changes?
For the following fixture (showing subset of fields here):
Run a
libguides
transformation:Note the following fields are set:
Demonstrating the source link, due to the overriden method, is decoupled from the timdex id.
What are the relevant tickets?
https://mitlibraries.atlassian.net/browse/TIMX-232
Developer
Code Reviewer
(not just this pull request message)
Includes new or updated dependencies?
YES