Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Refactor Metropolitan Museum of Art to use ProviderDataIngester #674

Merged
merged 30 commits into from
Aug 30, 2022
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
c7e1903
very first draft refactor with additional tags
rwidom Aug 16, 2022
eca69f5
linting
rwidom Aug 16, 2022
f1988ce
update dag docs
rwidom Aug 17, 2022
62a02ff
address issue 641 missing titles
rwidom Aug 17, 2022
0de8ae7
update notes in dag docs
rwidom Aug 17, 2022
db7a861
csv handling for data quality analysis
rwidom Aug 17, 2022
c42c8bc
artist name, title, and commenting
rwidom Aug 17, 2022
75aaf10
Use the image store for filetype
rwidom Aug 19, 2022
16d51e0
add creator url options to comments
rwidom Aug 19, 2022
ae99654
use set not list with in
rwidom Aug 19, 2022
6ee7f87
move api url commentary to the doc string
rwidom Aug 19, 2022
f431932
keep default delay
rwidom Aug 19, 2022
67f6673
more concise if no title, use object name
rwidom Aug 19, 2022
8ac2941
remove unnecessary code from debugging
rwidom Aug 19, 2022
b54f605
use zip to make loops more legible
rwidom Aug 19, 2022
bc55683
use if/else to skip falsey titles
rwidom Aug 19, 2022
5cca07b
keep production start date
rwidom Aug 19, 2022
bb98bdd
Merge branch 'main' into issue/588-refactor-met-museum-to-ingester
rwidom Aug 19, 2022
57ec40e
regenerate dag docs
rwidom Aug 19, 2022
1c2e0f5
adding none and {} tests
rwidom Aug 19, 2022
e441b79
add test IDs for complex paramters
rwidom Aug 19, 2022
e78b6d9
use or to make json navigation more concise
rwidom Aug 19, 2022
8bec026
add args for class param
rwidom Aug 20, 2022
248b1a1
name config parameters explicitly
rwidom Aug 22, 2022
90d3727
spell out metropolitan in default provider const.
rwidom Aug 22, 2022
3658dc4
Merge branch 'main' into issue/588-refactor-met-museum-to-ingester
rwidom Aug 24, 2022
1dcc93d
clarify objects vs records
rwidom Aug 24, 2022
a1670e5
default to empty dict for query params if no date
rwidom Aug 24, 2022
3f7765e
only one batch per dag run
rwidom Aug 25, 2022
17d0aff
additional pytest parameter set names
rwidom Aug 26, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 17 additions & 2 deletions DAGs.md
Original file line number Diff line number Diff line change
Expand Up @@ -349,8 +349,23 @@ ETL Process: Use the API to identify all CC0 artworks.
Output: TSV file containing the image, their respective
meta-data.

Notes: https://metmuseum.github.io/
No rate limit specified.
Notes: https://metmuseum.github.io/#search
"Please limit requests to 80 requests per second." May need to
bump up the delay (e.g. to 3 seconds), to avoid of blocking
during local development testing.

Some analysis to improve data quality was conducted using a
separate csv file here: https://github.com/metmuseum/openaccess

Get a list of object IDs:
https://collectionapi.metmuseum.org/public/collection/v1/objects?metadataDate=2022-08-10
Get a specific object:
https://collectionapi.metmuseum.org/public/collection/v1/objects/1027
The search functionality requires a specific query (term search)
in addition to date and public domain. It seems like it won't
connect with just date and license.
https://collectionapi.metmuseum.org/public/collection/v1/search?isPublicDomain=true&metadataDate=2022-08-07




Expand Down
1 change: 1 addition & 0 deletions openverse_catalog/dags/common/loader/provider_details.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
SMITHSONIAN_DEFAULT_PROVIDER = "smithsonian"
BROOKLYN_DEFAULT_PROVIDER = "brooklynmuseum"
CLEVELAND_DEFAULT_PROVIDER = "clevelandmuseum"
METROPOLITAN_MUSEUM_DEFAULT_PROVIDER = "met"
VICTORIA_DEFAULT_PROVIDER = "museumsvictoria"
NYPL_DEFAULT_PROVIDER = "nypl"
RAWPIXEL_DEFAULT_PROVIDER = "rawpixel"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,138 +6,167 @@
Output: TSV file containing the image, their respective
meta-data.

Notes: https://metmuseum.github.io/
No rate limit specified.
Notes: https://metmuseum.github.io/#search
"Please limit requests to 80 requests per second." May need to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for always adding excellent notes ✨

bump up the delay (e.g. to 3 seconds), to avoid of blocking
during local development testing.

Some analysis to improve data quality was conducted using a
separate csv file here: https://github.com/metmuseum/openaccess

Get a list of object IDs:
https://collectionapi.metmuseum.org/public/collection/v1/objects?metadataDate=2022-08-10
Get a specific object:
https://collectionapi.metmuseum.org/public/collection/v1/objects/1027
The search functionality requires a specific query (term search)
in addition to date and public domain. It seems like it won't
connect with just date and license.
https://collectionapi.metmuseum.org/public/collection/v1/search?isPublicDomain=true&metadataDate=2022-08-07

"""

import argparse
import logging

from common.licenses import get_license_info
from common.requester import DelayedRequester
from common.storage.image import ImageStore

from common.loader import provider_details as prov
from provider_data_ingester import ProviderDataIngester

DELAY = 1.0 # time delay (in seconds)
PROVIDER = "met"
ENDPOINT = "https://collectionapi.metmuseum.org/public/collection/v1/objects"
DEFAULT_LICENSE_INFO = get_license_info(license_="cc0", license_version="1.0")

logging.basicConfig(
format="%(asctime)s - %(name)s - %(levelname)s: %(message)s", level=logging.INFO
)
logger = logging.getLogger(__name__)

delayed_requester = DelayedRequester(DELAY)
image_store = ImageStore(provider=PROVIDER)


def main(date=None):
"""
This script pulls the data for a given date from the Metropolitan
Museum of Art API, and writes it into a .TSV file to be eventually
read into our DB.

Required Arguments:

date: Date String in the form YYYY-MM-DD. This is the date for
which running the script will pull data.
"""

logger.info(f"Begin: Met Museum API requests for date: {date}")

fetch_the_object_id = _get_object_ids(date)
if fetch_the_object_id:
logger.info(f"Total object found {fetch_the_object_id[0]}")
_extract_the_data(fetch_the_object_id[1])

total_images = image_store.commit()
logger.info(f"Total CC0 images received {total_images}")


def _get_object_ids(date, endpoint=ENDPOINT):
query_params = ""
if date:
query_params = {"metadataDate": date}

response = _get_response_json(query_params, endpoint)

if response:
total_object_ids = response["total"]
object_ids = response["objectIDs"]
else:
logger.warning("No content available")
return None
return [total_object_ids, object_ids]


def _get_response_json(
query_params,
endpoint,
retries=5,
):
response_json = delayed_requester.get_response_json(
endpoint, query_params=query_params, retries=retries
)

return response_json


def _extract_the_data(object_ids):
for i in object_ids:
_get_data_for_image(i)


def _get_data_for_image(object_id):
object_json = _get_and_validate_object_json(object_id)
if not object_json:
logger.warning(f"Could not retrieve object_json for object_id: {object_id}")
return

main_image = object_json.get("primaryImage")
other_images = object_json.get("additionalImages", [])
image_list = [main_image] + other_images

meta_data = _create_meta_data(object_json)

for img in image_list:
foreign_id = _build_foreign_id(object_id, img)
image_store.add_item(
foreign_landing_url=object_json.get("objectURL"),
image_url=img,
license_info=DEFAULT_LICENSE_INFO,
foreign_identifier=foreign_id,
creator=object_json.get("artistDisplayName"),
title=object_json.get("title"),
meta_data=meta_data,
class MetMuseumDataIngester(ProviderDataIngester):
providers = {"image": prov.METROPOLITAN_MUSEUM_DEFAULT_PROVIDER}
endpoint = "https://collectionapi.metmuseum.org/public/collection/v1/objects"
DEFAULT_LICENSE_INFO = get_license_info(license_="cc0", license_version="1.0")

# adding args for automatically generated parameters from generate_tsv_filenames
def __init__(self, conf: dict = None, date: str = None):
super(MetMuseumDataIngester, self).__init__(conf=conf, date=date)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW in python 3 this can just be:

Suggested change
super(MetMuseumDataIngester, self).__init__(conf=conf, date=date)
super().__init__(conf=conf, date=date)

We don't need to specify the parent class unless the Module Resolution Order needs to be specifically navigated:

The parameterless call to super() is recommended and sufficient for most use cases, and needing to change the search hierarchy regularly could be indicative of a larger design issue. 1

Footnotes

  1. https://realpython.com/python-super/#a-super-deep-dive

self.retries = 5

self.query_param = None
if date:
self.query_param = {"metadataDate": date}

# this seems like useful information to track for context on the existing load
# metrics, but just adding them to the log in aggregate for now rather than
# logging each record individually or doing something fancier in airflow.
Comment on lines +58 to +60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

self.object_ids_retrieved = 0 # total object IDs based on date
self.non_cc0_objects = 0 # number checked and ignored because of licensing

def get_next_query_params(self, prev_query_params=None):
return self.query_param

def get_batch_data(self, response_json):
if response_json:
self.object_ids_retrieved = response_json["total"]
logger.info(f"Total objects found {self.object_ids_retrieved}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't incremented, so this will always be the number of ids retrieved in that batch, is that right? Would it make more sense to add to the total each time, or else change the wording of the log message?

Copy link
Collaborator Author

@rwidom rwidom Aug 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The met museum returns a json with two things: a list of object ids, and the length of that list. The additional API calls are just to get the details about any given object ID, not to get another list of object IDs. So, I think it's ok to just take this from the source once for each dag run.

Copy link
Contributor

@stacimc stacimc Aug 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is reporting once for every batch, indicating the number of ids in that batch as opposed to the total number ingested by the run. I see it multiple times in the logs for a particular dagrun. Is that the intention, or am I wrong about what that number means?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... Maybe I'm confused about what a batch means and how it's functioning here. My sense is that there should only be one batch per dag-run. I'll take another look.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully this enforces the one batch per dagrun thing! 3f7765e

return response_json["objectIDs"]
else:
logger.warning("No content available")
return None

def get_record_data(self, object_id):
object_endpoint = f"{self.endpoint}/{object_id}"
object_json = self.delayed_requester.get_response_json(
object_endpoint, self.retries
)


def _get_and_validate_object_json(object_id, endpoint=ENDPOINT):
object_endpoint = f"{endpoint}/{object_id}"
object_json = _get_response_json(None, object_endpoint)
if not object_json.get("isPublicDomain"):
logger.warning("CC0 license not detected")
object_json = None
return object_json


def _build_foreign_id(object_id, image_url):
unique_identifier = image_url.split("/")[-1].split(".")[0]
return f"{object_id}-{unique_identifier}"


def _create_meta_data(object_json):
meta_data = {
"accession_number": object_json.get("accessionNumber"),
"classification": object_json.get("classification"),
"culture": object_json.get("culture"),
"date": object_json.get("objectDate"),
"medium": object_json.get("medium"),
"credit_line": object_json.get("creditLine"),
}
meta_data = {k: v for k, v in meta_data.items() if v is not None}
return meta_data
if object_json.get("isPublicDomain") is False:
self.non_cc0_objects += 1
if self.non_cc0_objects % self.batch_limit == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool :) What do you think about also reporting the final number at the end of ingestion?

logger.info(f"Retrieved {self.non_cc0_objects} non-CC0 records.")
return None

main_image = object_json.get("primaryImage")
other_images = object_json.get("additionalImages", [])
image_list = [main_image] + other_images

meta_data = self._get_meta_data(object_json)
raw_tags = self._get_tag_list(object_json)
title = self._get_title(object_json)
artist = self._get_artist_name(object_json)

# We aren't currently populating creator_url. In theory we could url encode
# f"https://collectionapi.metmuseum.org/public/collection/v1/search?artistOrCulture={artist}"
# per API guide here: https://metmuseum.github.io/#search
# but it seems fairly buggy (i.e. nonresponsive), at least when tested with
# "Chelsea Porcelain Manufactory" and "Minton(s)" and "Jean Pucelle"
# Should we use artistWikidata_URL or artistULAN_URL? They're populated approx
# 65% of the time.

return [
{
"foreign_landing_url": object_json.get("objectURL"),
"image_url": img,
"license_info": self.DEFAULT_LICENSE_INFO,
"foreign_identifier": self._get_foreign_id(object_id, img),
"creator": artist,
"title": title,
"meta_data": meta_data,
"raw_tags": raw_tags,
}
for img in image_list
]

def _get_foreign_id(self, object_id: int, image_url: str):
unique_identifier = image_url.split("/")[-1].split(".")[0]
return f"{object_id}-{unique_identifier}"

def _get_meta_data(self, object_json):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but this and other methods that don't access self can be declared static.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, what are the benefits of making that declaration? The main one I'm familiar with is being able to call them from outside a specific instance of the class. Would that make testing more efficient in this case? Are there other benefits I'm not thinking of?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's usually a good practice to make functions that you know won't need to access or modify the class's state via self, staticmethods. This makes the scope clear to folks reading the function, can make testing easier in certain cases (since you don't need to instantiate a class in order to test that specific function), and prevents cases where a function that seems like it should be static but is not explicitly set as such has access to self and can thus modify class state (either intentionally or unintentionally).

if object_json is None:
return
if object_json.get("accessionNumber"):
return {
"accession_number": object_json.get("accessionNumber"),
}

def _get_tag_list(self, object_json):
if object_json is None:
return
tag_list = [
tag
for tag in [
object_json.get("department"),
object_json.get("medium"),
object_json.get("culture"),
object_json.get("objectName"),
self._get_artist_name(object_json),
object_json.get("classification"),
object_json.get("objectDate"),
object_json.get("creditLine"),
object_json.get("period"),
]
if tag
]
if object_json.get("tags"):
tag_list += [tag["term"] for tag in object_json.get("tags")]
return tag_list
stacimc marked this conversation as resolved.
Show resolved Hide resolved

def _get_title(self, record):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

if record is not None:
# Use or to skip false-y (empty) titles: ""
return record.get("title") or record.get("objectName")

def _get_artist_name(self, record):
stacimc marked this conversation as resolved.
Show resolved Hide resolved
if record is None:
return
return record.get("artistDisplayName")

def get_media_type(self, record):
# This provider only supports Images.
return "image"


def main(date: str):
logger.info("Begin: Metropolitan Museum data ingestion")
ingester = MetMuseumDataIngester(date)
ingester.ingest_records()


if __name__ == "__main__":
Expand Down
2 changes: 2 additions & 0 deletions openverse_catalog/dags/providers/provider_workflows.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from providers.provider_api_scripts.cleveland_museum import ClevelandDataIngester
from providers.provider_api_scripts.finnish_museums import FinnishMuseumsDataIngester
from providers.provider_api_scripts.inaturalist import INaturalistDataIngester
from providers.provider_api_scripts.metropolitan_museum import MetMuseumDataIngester
from providers.provider_api_scripts.museum_victoria import VictoriaDataIngester
from providers.provider_api_scripts.provider_data_ingester import ProviderDataIngester
from providers.provider_api_scripts.science_museum import ScienceMuseumDataIngester
Expand Down Expand Up @@ -132,6 +133,7 @@ def __post_init__(self):
),
ProviderWorkflow(
provider_script="metropolitan_museum",
ingester_class=MetMuseumDataIngester,
start_date=datetime(2016, 9, 1),
schedule_string="@daily",
dated=True,
Expand Down
Loading