Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timx 232 springshare ids #101

Merged
merged 3 commits into from
Aug 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 29 additions & 29 deletions Pipfile.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 4 additions & 4 deletions tests/test_springshare.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
LIBGUIDES_BLANK_OR_MISSING_OPTIONAL_FIELDS_TIMDEX = timdex.TimdexRecord(
source="LibGuides",
source_link="https://libguides.mit.edu/materials",
timdex_record_id="libguides:materials",
timdex_record_id="libguides:guides-175846",
title="Materials Science & Engineering",
citation="Materials Science & Engineering. libguides. "
"https://libguides.mit.edu/materials",
Expand All @@ -33,7 +33,7 @@
RESEARCHDATABASES_BLANK_OR_MISSING_OPTIONAL_FIELDS_TIMDEX = timdex.TimdexRecord(
source="Research Databases",
source_link="https://libguides.mit.edu/llba",
timdex_record_id="researchdatabases:llba",
timdex_record_id="researchdatabases:az-65257807",
title="Linguistics and Language Behavior Abstracts (LLBA)",
citation="Linguistics and Language Behavior Abstracts (LLBA). researchdatabases. "
"https://libguides.mit.edu/llba",
Expand Down Expand Up @@ -94,7 +94,7 @@ def test_libguide_transform_with_all_fields_transforms_correctly():
assert next(output_records) == timdex.TimdexRecord(
source="LibGuides",
source_link="https://libguides.mit.edu/materials",
timdex_record_id="libguides:materials",
timdex_record_id="libguides:guides-175846",
title="Materials Science & Engineering",
citation="Ye Li. Materials Science & Engineering. MIT Libraries. libguides. "
"https://libguides.mit.edu/materials",
Expand Down Expand Up @@ -154,7 +154,7 @@ def test_research_databases_transform_with_all_fields_transforms_correctly():
assert next(output_records) == timdex.TimdexRecord(
source="Research Databases",
source_link="https://libguides.mit.edu/llba",
timdex_record_id="researchdatabases:llba",
timdex_record_id="researchdatabases:az-65257807",
title="Linguistics and Language Behavior Abstracts (LLBA)",
citation="Linguistics and Language Behavior Abstracts (LLBA). "
"researchdatabases. https://libguides.mit.edu/llba",
Expand Down
4 changes: 2 additions & 2 deletions transmogrifier/sources/ead.py
Original file line number Diff line number Diff line change
Expand Up @@ -445,8 +445,8 @@ def parse_mixed_value(
"""
if skipped_elements is None:
skipped_elements = []
if type(item) == NavigableString and item.strip():
if isinstance(item, NavigableString) and item.strip():
yield str(item.strip())
elif type(item) == Tag and item.name not in skipped_elements:
elif isinstance(item, Tag) and item.name not in skipped_elements:
for child in item.children:
yield from cls.parse_mixed_value(child, skipped_elements)
34 changes: 20 additions & 14 deletions transmogrifier/sources/springshare.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,25 +80,31 @@ def get_links(self, source_record_id: str, xml: Tag) -> Optional[List[timdex.Lin
]

@classmethod
def get_source_record_id(cls, xml: Tag) -> str:
def get_source_link(
cls, source_base_url: str, source_record_id: str, xml: Tag
) -> str:
"""
Get the source record ID from a Springshare OAI DC XML record.
Override for default source_link behavior.

Overrides metaclass get_source_record_id() method.
Springshare resources contain the source link in their dc:identifier fields.
However, this cannot be reliably split and combined with the source base url,
as this either provides poorly formed timdex record ids OR source links that
do not work.

The URL path of the Springshare resource is used as the source record id, which
results in a timdex record id like "libguides:materials" or
"researchdatabases:llba". This is preferred over the OAI-PMH identifier, a
numeric value, which cannot be used to construct an accessible source link.
Example libguides OAI identifier and <dc:identifier>:
- oai:libguides.com:guides/175846, https://libguides.mit.edu/materials
- oai:libguides.com:guides/175847, https://libguides.mit.edu/c.php?g=175847

Libguides example:
"https://libguides.mit.edu/materials" -> "materials"
Example researchdatabases OAI identifier and <dc:identifier>:
- oai:libguides.com:az/65257807, https://libguides.mit.edu/llba

AZ (Research Database) example:
"https://libguides.mit.edu/llba" -> "llba"
It is preferable to split the OAI header identifier and use this as the TIMDEX
record id, but then take the dc:identifier wholesale and use this for the source
link.

Args:
xml: A BeautifulSoup Tag representing a single Springshare OAI DC XML record.
source_base_url: Source base URL.
source_record_id: Record identifier for the source record.
xml: A BeautifulSoup Tag representing a single XML record.
"""

return str(xml.find("dc:identifier").string).split("/")[-1]
return str(xml.find("dc:identifier").string)
49 changes: 46 additions & 3 deletions transmogrifier/sources/transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,11 +117,16 @@ def get_required_fields(self, xml: Tag) -> dict:
xml: A BeautifulSoup Tag representing a single OAI-PMH XML record.
"""
source_record_id = self.get_source_record_id(xml)

# run methods to generate required fields
source_link = self.get_source_link(self.source_base_url, source_record_id, xml)
timdex_record_id = self.get_timdex_record_id(self.source, source_record_id, xml)
Comment on lines +122 to +123
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: you could call these directly in the dict

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to change if you'd like, and I considered that too. But at the risk of a being a bit more verbose, I thought it helped break it up a bit. There is some important stuff happening here -- establishing record title, primary timdex id, link, etc. -- and I found it helpful to see those steps as distinct statements vs embedded in a dictionary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional is always just a suggestion 🙂

title = self.get_valid_title(source_record_id, xml)

return {
"source": self.source_name,
"source_link": self.source_base_url + source_record_id,
"timdex_record_id": f"{self.source}:{source_record_id.replace('/', '-')}",
"source_link": source_link,
"timdex_record_id": timdex_record_id,
"title": title,
}

Expand Down Expand Up @@ -180,7 +185,7 @@ def get_valid_title(cls, source_record_id: str, xml: Tag) -> str:
source_record_id,
all_titles,
)
if all_titles and type(all_titles[0]) == str:
if all_titles and isinstance(all_titles[0], str):
title = all_titles[0]
elif all_titles and all_titles[0].string:
title = all_titles[0].string
Expand All @@ -191,3 +196,41 @@ def get_valid_title(cls, source_record_id: str, xml: Tag) -> str:
)
title = "Title not provided"
return title

@classmethod
def get_source_link(
cls, source_base_url: str, source_record_id: str, xml: Tag
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need xml as a parameter here if it's not being used? Couldn't the method just be overridden in the source transforms with an xml parameter when it's needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, and I had also paused on this. I think if these are getting decoupled with the understanding transformers may override them, it's plausible they might need the document XML, so I included here.

Because the base transform calls this method, it will always use the same signature, so a sub-class transformer couldn't add xml:Tag on their own.

We could adopt a kwargs approach, where the sub-class transformer could pass anything they want as named arguments, but we'd lose the typing this provides. Other than xml:Tag, it's hard to guess what they may want to pass?

See pros and cons both ways. My thinking was to model these after other methods that are designed to overridden, like get_valid_title() or get_source_record_id() which both pass xml:Tag.

Copy link
Contributor Author

@ghukill ghukill Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if helpful, we could add the docstring comment:

xml: A BeautifulSoup Tag representing a single XML record.
    - not used by default implementation, but could be useful for subclass
      overrides

which get_timdex_record_id() contains below. I had actually intended to do so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just add the comment, I did some more reading and see your point on losing the type hinting

) -> str:
"""
Class method to set the source link for the item.

May be overridden by source subclasses if needed.

Default behavior is to concatenate the source base URL + source record id.

Args:
source_base_url: Source base URL.
source_record_id: Record identifier for the source record.
xml: A BeautifulSoup Tag representing a single XML record.
- not used by default implementation, but could be useful for subclass
overrides
"""
return source_base_url + source_record_id

@classmethod
def get_timdex_record_id(cls, source: str, source_record_id: str, xml: Tag) -> str:
"""
Class method to set the TIMDEX record id.

May be overridden by source subclasses if needed.

Default behavior is to concatenate the source name + source record id.

Args:
source: Source name.
source_record_id: Record identifier for the source record.
xml: A BeautifulSoup Tag representing a single XML record.
- not used by default implementation, but could be useful for subclass
overrides
"""
return f"{source}:{source_record_id.replace('/', '-')}"
Loading