-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More MITAardvark methods #110
Conversation
Why these changes are being introduced: * More methods are needed to populate all of the TIMDEX fields used by the MITAardvark class How this addresses that need: * Add parse_geodata_string, parse_solr_date_range_string functions to helpers module with corresponding unit tests * Add get_dates, get_identifiers, get_links, and get_locations methods along with calls in get_optional_fields and corresponding unit tests * Update aardvark_record_all_fields fixture to include new fields * Update Location.value field to add None default given it is an optional field Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/GDT-54
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looking good! I'm continuing to like the general patterns, and I find it really easy to follow along.
My comments and questions are mostly around syntax and style, as I was able to run the tests and confirm the outputs seem correct.
As you stated in the PR, I imagine we'll be revisiting both this and the GeoHarvester transforms some during a 2nd QA pass, so readability and ease of maintenence seems to be the name of the game at this point.
But I'm digging it! Continues to read well to me.
* Update aardvark_records fixture to reflect required fields and corresponding unit test * Add gismit and gisogm sources to config.py * Refactor parse_source_file methods to use smart_open * Add get_timdex_record_id method for more specific ID processing * Refactor get_dates to use new private methods for each date type * Refactor get_locations to improve readability and error processing * Shift parse_solr_date_range function to MITAardvark class method as well as corresponding unit tests
@ghukill @jonavellecuerdo Pushed a new commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Eric! I think this is looking good! Here are some initial comments. I'm probably going to do a second pass to get better acquainted with this repo! 😅
@@ -70,6 +91,7 @@ def get_optional_fields(self, source_record: dict) -> dict | None: | |||
fields["contributors"] = self.get_contributors(source_record) or None | |||
|
|||
# dates | |||
fields["dates"] = self.get_dates(source_record, source_record_id) or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, what is the definition for the dates
field of a TIMDEX record?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Date
class is defined here. In general, it's for any type of date, with a kind
qualifier strongly preferred
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, changes looking real good. Addressed previous comments and discussions for me.
Requested a couple of new changes inline where possible.
One more that doesn't lend itself to inline comment is around the TimdexRecord.source_link
field.
If I'm reading correctly, this is set by the JSONTransformer
(as the Aardvark
sub-class does not override) to be:
return source_base_url + source_record_id
As discussed, this is the first instance in TIMDEX where the "source link" will likely actually point to the item page in the TIMDEX UI. While we may eventually have something like geoweb.libraries.mit.edu
we do currently have search.libraries.mit.edu
. I would propose that we use that + the identifier to create this source link. That would provide a functional URL for the TIMDEX record, pointing back to the item page in the UI, ultimately something along the lines of:
https://search.libraries.mit.edu/record/gismit:abc123
The only wrinkle here: that base URL changes between Dev1/Stage/Prod, so defining once in the source config dictionary might not be an option. For the time being, perhaps the base-url
in those configurations could be https://search.libraries.mit.edu/record/
?
UPDATE: there are environment specific search UIs we could use:
- Dev1:
https://timdex-ui-pipeline-dev.herokuapp.com/
- Stage:
https://timdex-ui-pipeline-stage.herokuapp.com/
- Prod:
https://search.libraries.mit.edu/
UPDATE 2: given some discussions on slack, maybe just hardcoding the prod URL in the source config dictionary is a good solutuion for now?
* Update get_source_record_id method and docstring * Update open calls to smart_open in transformer.py * Update base-url for gis source in config.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
"gismit": { | ||
"name": "MIT GIS Resources", | ||
"base-url": "https://search.libraries.mit.edu/record/", | ||
"transform-class": "transmogrifier.sources.json.aardvark.MITAardvark", | ||
}, | ||
"gisogm": { | ||
"name": "OpenGeoMetadata GIS Resources", | ||
"base-url": "https://search.libraries.mit.edu/record/", | ||
"transform-class": "transmogrifier.sources.json.aardvark.OGMAardvark", | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
b990ab4
to
f5c8331
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
return source_base_url + cls.get_timdex_record_id( | ||
"gismit", source_record_id, source_record | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this approach. Had thought we don't need to call another method, as we know the logic there, but this was ensures it's tied to that logic should it change.
Helpful background context
This completes the
MITAardvark
class methods however, there will undoubtedly be updates as we begin testing the records in the full TIMDEX pipeline.How can a reviewer manually see the effects of these changes?
Updates to the CLI are needed before this can be tested manually.
What are the relevant tickets?
Developer
Code Reviewer
(not just this pull request message)
Includes new or updated dependencies?
NO