Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timx 232 springshare ids #101

Merged
merged 3 commits into from
Aug 7, 2023
Merged

Timx 232 springshare ids #101

merged 3 commits into from
Aug 7, 2023

Conversation

ghukill
Copy link
Contributor

@ghukill ghukill commented Aug 4, 2023

What does this PR do?

This PR adds two new OPTIONAL methods to the base transform:

  • get_source_link()
  • get_timdex_record_id()

For transformations that require it, this allows for decoupling the logic of constructing the timdex record id from the source link. This directly supports these new sources, and it's believed will also be helpful for future sources where this tight coupling may not be ideal.

Dependencies are also updated, with slight type check updates for new flake8 requirements in ead transformation.

Helpful background context

While analyzing the transformed records from new libguides and researchdatabases sources, it was found that some timdex record ids were poorly formed, e.g. libguides:c.php?g=175847.

The original approach was splitting the dc:identifier and using that for the timdex record id and concatenating with the source base url. However, in a small subset, this resulted in poorly formed ids like the one above.

Given the terse nature of these source records, there is not a meaningful identifier in the record that could serve both the timdex record id and the source link.

How can a reviewer manually see the effects of these changes?

For the following fixture (showing subset of fields here):

<?xml version="1.0" encoding="UTF-8"?>
<records>
  <record>
    <header>
      <identifier>oai:libguides.com:guides/175846</identifier>  <!-- OAI identifier -->
      ...
    </header>
    <metadata>
      <oai_dc:dc>
        ...
        <dc:identifier>https://libguides.mit.edu/materials</dc:identifier>  <!-- DC identifier -->
        ...
      </oai_dc:dc>
    </metadata>
  </record>
</records>

Run a libguides transformation:

pipenv run transform -s libguides -i "tests/fixtures/oai_dc/springshare/libguides/libguides_record_all_fields.xml" -o "output/libguides.json"

Note the following fields are set:

timdex_record_id="libguides:guides-175846"
source_link="https://libguides.mit.edu/materials",

Demonstrating the source link, due to the overriden method, is decoupled from the timdex id.

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/TIMX-232

Developer

  • All new ENV is documented in README
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer

  • The commit message is clear and follows our guidelines
    (not just this pull request message)
  • There are appropriate tests covering any new functionality
  • The documentation has been updated or is unnecessary
  • The changes have been verified
  • New dependencies are appropriate or there were no changes

Includes new or updated dependencies?

YES

Why these changes are being introduced:
After some analysis of transformed Springshare records, it was discovered that some timdex_record_ids were undesirable / poorly formed.  For a subset of records, the dc:identifier value was a PHP URL pattern, that while a functional link, did not result in a good timdex record identifier.

Previously, the source record id and the source base url were tightly coupled in the transformer class, and could not be overridden by transformations.  This meant identifiers for the record must be concate-able with the base url.  In this situation, this was not possible for all records in the set.

How this addresses that need:
  * added two new methods in the base transformer class, get_source_link() and get_timdex_record_id()
  * update springshare transform to utilize the get_source_link() to generate a valid URL
  * update springshare transform to fallback on the more widely used pattern of using part of the OAI identifier as the timdex record id

Side effects of this change:
Two new OPTIONAL methods are available for transformers, get_source_link() and get_timdex_record_id().

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-232
@ghukill ghukill requested a review from ehanson8 August 4, 2023 13:28
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments


@classmethod
def get_source_link(
cls, source_base_url: str, source_record_id: str, xml: Tag
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need xml as a parameter here if it's not being used? Couldn't the method just be overridden in the source transforms with an xml parameter when it's needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, and I had also paused on this. I think if these are getting decoupled with the understanding transformers may override them, it's plausible they might need the document XML, so I included here.

Because the base transform calls this method, it will always use the same signature, so a sub-class transformer couldn't add xml:Tag on their own.

We could adopt a kwargs approach, where the sub-class transformer could pass anything they want as named arguments, but we'd lose the typing this provides. Other than xml:Tag, it's hard to guess what they may want to pass?

See pros and cons both ways. My thinking was to model these after other methods that are designed to overridden, like get_valid_title() or get_source_record_id() which both pass xml:Tag.

Copy link
Contributor Author

@ghukill ghukill Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if helpful, we could add the docstring comment:

xml: A BeautifulSoup Tag representing a single XML record.
    - not used by default implementation, but could be useful for subclass
      overrides

which get_timdex_record_id() contains below. I had actually intended to do so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just add the comment, I did some more reading and see your point on losing the type hinting

Comment on lines +121 to +122
source_link = self.get_source_link(self.source_base_url, source_record_id, xml)
timdex_record_id = self.get_timdex_record_id(self.source, source_record_id, xml)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: you could call these directly in the dict

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to change if you'd like, and I considered that too. But at the risk of a being a bit more verbose, I thought it helped break it up a bit. There is some important stuff happening here -- establishing record title, primary timdex id, link, etc. -- and I found it helpful to see those steps as distinct statements vs embedded in a dictionary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional is always just a suggestion 🙂

@ghukill ghukill requested a review from ehanson8 August 4, 2023 20:46
@ghukill ghukill merged commit 4fcb617 into main Aug 7, 2023
3 checks passed
@ghukill ghukill deleted the TIMX-232-springshare-ids branch August 18, 2023 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants