Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More MITAardvark methods #110

Merged
merged 4 commits into from
Jan 5, 2024
Merged

More MITAardvark methods #110

merged 4 commits into from
Jan 5, 2024

Conversation

ehanson8
Copy link
Contributor

Helpful background context

This completes the MITAardvark class methods however, there will undoubtedly be updates as we begin testing the records in the full TIMDEX pipeline.

How can a reviewer manually see the effects of these changes?

Updates to the CLI are needed before this can be tested manually.

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer

  • The commit message is clear and follows our guidelines
    (not just this pull request message)
  • There are appropriate tests covering any new functionality
  • The documentation has been updated or is unnecessary
  • The changes have been verified
  • New dependencies are appropriate or there were no changes

Includes new or updated dependencies?

NO

Why these changes are being introduced:
* More methods are needed to populate all of the TIMDEX fields used by the MITAardvark class

How this addresses that need:
* Add parse_geodata_string, parse_solr_date_range_string functions to helpers module with corresponding unit tests
* Add get_dates, get_identifiers, get_links, and get_locations methods along with calls in get_optional_fields and corresponding unit tests
* Update aardvark_record_all_fields fixture to include new fields
* Update Location.value field to add None default given it is an optional field

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/GDT-54
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looking good! I'm continuing to like the general patterns, and I find it really easy to follow along.

My comments and questions are mostly around syntax and style, as I was able to run the tests and confirm the outputs seem correct.

As you stated in the PR, I imagine we'll be revisiting both this and the GeoHarvester transforms some during a 2nd QA pass, so readability and ease of maintenence seems to be the name of the game at this point.

But I'm digging it! Continues to read well to me.

transmogrifier/sources/json/aardvark.py Outdated Show resolved Hide resolved
transmogrifier/sources/json/aardvark.py Show resolved Hide resolved
transmogrifier/sources/json/aardvark.py Outdated Show resolved Hide resolved
transmogrifier/sources/json/aardvark.py Outdated Show resolved Hide resolved
transmogrifier/helpers.py Outdated Show resolved Hide resolved
transmogrifier/helpers.py Outdated Show resolved Hide resolved
transmogrifier/helpers.py Show resolved Hide resolved
transmogrifier/helpers.py Show resolved Hide resolved
* Update aardvark_records fixture to reflect required fields and corresponding unit test
* Add gismit and gisogm sources to config.py
* Refactor parse_source_file methods to use smart_open
* Add get_timdex_record_id method for more specific ID processing
* Refactor get_dates to use new private methods for each date type
* Refactor get_locations to improve readability and error processing
* Shift parse_solr_date_range function to MITAardvark class method as well as corresponding unit tests
@ehanson8
Copy link
Contributor Author

ehanson8 commented Jan 3, 2024

@ghukill @jonavellecuerdo Pushed a new commit

@ghukill ghukill self-requested a review January 4, 2024 18:06
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Eric! I think this is looking good! Here are some initial comments. I'm probably going to do a second pass to get better acquainted with this repo! 😅

transmogrifier/sources/json/aardvark.py Outdated Show resolved Hide resolved
transmogrifier/helpers.py Show resolved Hide resolved
@@ -70,6 +91,7 @@ def get_optional_fields(self, source_record: dict) -> dict | None:
fields["contributors"] = self.get_contributors(source_record) or None

# dates
fields["dates"] = self.get_dates(source_record, source_record_id) or None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, what is the definition for the dates field of a TIMDEX record?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Date class is defined here. In general, it's for any type of date, with a kind qualifier strongly preferred

Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, changes looking real good. Addressed previous comments and discussions for me.

Requested a couple of new changes inline where possible.

One more that doesn't lend itself to inline comment is around the TimdexRecord.source_link field.

If I'm reading correctly, this is set by the JSONTransformer (as the Aardvark sub-class does not override) to be:

return source_base_url + source_record_id

As discussed, this is the first instance in TIMDEX where the "source link" will likely actually point to the item page in the TIMDEX UI. While we may eventually have something like geoweb.libraries.mit.edu we do currently have search.libraries.mit.edu. I would propose that we use that + the identifier to create this source link. That would provide a functional URL for the TIMDEX record, pointing back to the item page in the UI, ultimately something along the lines of:

https://search.libraries.mit.edu/record/gismit:abc123

The only wrinkle here: that base URL changes between Dev1/Stage/Prod, so defining once in the source config dictionary might not be an option. For the time being, perhaps the base-url in those configurations could be https://search.libraries.mit.edu/record/?

UPDATE: there are environment specific search UIs we could use:

  • Dev1: https://timdex-ui-pipeline-dev.herokuapp.com/
  • Stage: https://timdex-ui-pipeline-stage.herokuapp.com/
  • Prod: https://search.libraries.mit.edu/

UPDATE 2: given some discussions on slack, maybe just hardcoding the prod URL in the source config dictionary is a good solutuion for now?

transmogrifier/sources/json/aardvark.py Outdated Show resolved Hide resolved
transmogrifier/sources/transformer.py Outdated Show resolved Hide resolved
transmogrifier/sources/json/aardvark.py Show resolved Hide resolved
* Update get_source_record_id method and docstring
* Update open calls to smart_open in transformer.py
* Update base-url for gis source in config.py
@ehanson8 ehanson8 requested a review from ghukill January 4, 2024 19:46
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Comment on lines +106 to +115
"gismit": {
"name": "MIT GIS Resources",
"base-url": "https://search.libraries.mit.edu/record/",
"transform-class": "transmogrifier.sources.json.aardvark.MITAardvark",
},
"gisogm": {
"name": "OpenGeoMetadata GIS Resources",
"base-url": "https://search.libraries.mit.edu/record/",
"transform-class": "transmogrifier.sources.json.aardvark.OGMAardvark",
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

@ehanson8 ehanson8 force-pushed the GDT-54-more-MITAardvark-methods branch from b990ab4 to f5c8331 Compare January 4, 2024 22:08
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

Comment on lines +51 to +53
return source_base_url + cls.get_timdex_record_id(
"gismit", source_record_id, source_record
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach. Had thought we don't need to call another method, as we know the logic there, but this was ensures it's tied to that logic should it change.

@ehanson8 ehanson8 merged commit a941359 into main Jan 5, 2024
3 of 5 checks passed
@ehanson8 ehanson8 deleted the GDT-54-more-MITAardvark-methods branch April 23, 2024 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants