-
Notifications
You must be signed in to change notification settings - Fork 54
Refactor Metropolitan Museum of Art to use ProviderDataIngester #674
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice and clean!
openverse_catalog/dags/providers/provider_api_scripts/metropolitan_museum.py
Outdated
Show resolved
Hide resolved
tag_list += [tag["term"] for tag in object_json.get("tags")] | ||
return tag_list | ||
|
||
def _get_title(self, record): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
# Treating "unidentified" the same as missing, but maybe it would be useful in | ||
# in search? Maybe in the art world it has a more specific outsider art | ||
# connotation? | ||
if artist in ["Unidentified", "Unidentified artist"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Madison usually says that it's better to use a set
when checking with in
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 Yes, that makes total sense, I'll fix it tonight. Though hopefully for a static list of two distinct items it wouldn't make a huge difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it actually does make a noticeable difference! Who knew? ~7%!
(openverse) % python3 -m timeit '[n in [1, 2] for n in range(4)]'
1000000 loops, best of 5: 390 nsec per loop
(openverse) % python3 -m timeit '[n in {1, 2} for n in range(4)]'
1000000 loops, best of 5: 360 nsec per loop
@@ -0,0 +1,158 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome, @rwidom!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much!
Co-authored-by: Olga Bulat <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic, thank you for tackling it!
# get a list of object IDs for this week: | ||
# https://collectionapi.metmuseum.org/public/collection/v1/objects?metadataDate=2022-08-10 | ||
# get a specific object: | ||
# https://collectionapi.metmuseum.org/public/collection/v1/objects/1027 | ||
# The search functionality requires a specific query (term search) in addition to date | ||
# and public domain. It seems like it won't connect with just date and license. | ||
# https://collectionapi.metmuseum.org/public/collection/v1/search?isPublicDomain=true&metadataDate=2022-08-07 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be great to go into the docstring!
def __init__(self, date: str = None): | ||
super(MetMuseumDataIngester, self).__init__(date=date) | ||
self.retries = 5 | ||
self.delayed_requester._DELAY = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're comfortable with it, I think it'd be okay to leave this at the 1 second default rather than bumping it up to 3. We run this DAG every day and haven't encountered any issues or rate limiting. I'd love to keep your comments in the docstring though, just so we know if we do encounter rate limiting going forward we have a known way to address it.
Also, I'll make an issue for changing the delay from a private to a public attribute. There are definitely cases where we'll want to bump it up or adjust it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally. Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet, I made https://github.com/WordPress/openverse-catalog/issues/687 as well!
# this seems like useful information to track for context on the existing load | ||
# metrics, but just adding them to the log in aggregate for now rather than | ||
# logging each record individually or doing something fancier in airflow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
openverse_catalog/dags/providers/provider_api_scripts/metropolitan_museum.py
Outdated
Show resolved
Hide resolved
tests/dags/providers/provider_api_scripts/test_metropolitan_museum.py
Outdated
Show resolved
Hide resolved
"case_name, response_json, expected", | ||
[ | ||
( | ||
"single image", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would be reasonable to add the empty case for this and the next 3 tests? E.g. response_json
is {}
or None
"case_name, response_json, expected", | ||
[ | ||
( | ||
"single image", | ||
single_object_response, | ||
single_expected_data[0].get("meta_data"), | ||
), | ||
("multi images", full_object_response, full_expected_data[0].get("meta_data")), | ||
], | ||
) | ||
def test_get_meta_data(case_name, response_json, expected): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like in these tests that case_name
isn't actually used. If you're looking to document the particular test cases, you can add comments to each parameter set:
"case_name, response_json, expected", | |
[ | |
( | |
"single image", | |
single_object_response, | |
single_expected_data[0].get("meta_data"), | |
), | |
("multi images", full_object_response, full_expected_data[0].get("meta_data")), | |
], | |
) | |
def test_get_meta_data(case_name, response_json, expected): | |
"response_json, expected", | |
[ | |
# Single image response | |
( | |
single_object_response, | |
single_expected_data[0].get("meta_data"), | |
), | |
# Multi-image response | |
(images", full_object_response, full_expected_data[0].get("meta_data")), | |
], | |
) | |
def test_get_meta_data(response_json, expected): |
Here's another example:
openverse-catalog/tests/dags/providers/test_factory_utils.py
Lines 97 to 116 in ee474f2
################################################## | |
# Function-based | |
################################################## | |
# Happy path | |
(fake_provider_module.main, ["image"], [fake_provider_module.image_store]), | |
# Multiple types | |
( | |
fake_provider_module.main, | |
["image", "audio"], | |
[fake_provider_module.image_store, fake_provider_module.audio_store], | |
), | |
# Empty case, no media types provided | |
(fake_provider_module.main, [], []), | |
################################################## | |
# Class-based | |
################################################## | |
# Happy path | |
(FakeDataIngesterClass, ["image", "audio"], list(fdi.media_stores.values())), | |
# No media types provided, ingester class still supplies stores | |
(FakeDataIngesterClass, 2, list(fdi.media_stores.values())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to get the test case description to appear in the results if the test failed, but I think comments would get dropped in the abbreviation. I'll try it, or is there a better way to accomplish that? Maybe it's just making the values themselves describe the case, where there is a string value... ??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! If you're looking to edit the test IDs specifically and how they show up in the node names during testing, here's a few ways to do that: https://docs.pytest.org/en/7.1.x/example/parametrize.html#different-options-for-test-ids
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much!
tests/dags/providers/provider_api_scripts/test_metropolitan_museum.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Madison Swain-Bowden <[email protected]>
Co-authored-by: Madison Swain-Bowden <[email protected]>
Co-authored-by: Madison Swain-Bowden <[email protected]>
return tag_list | ||
|
||
def _get_title(self, record): | ||
return record.get("title", record.get("objectName")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AetherUnbound , I should have tested before committing the suggestion here. It looks like this really only works for None
, where ""
evaluates to False
as well. :/ Some of the API responses have all of the fields, but then an empty string, and I want to be able to handle that situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! That totally makes sense, you could potentially do return record.get("title") or record.get("objectName")
. I was going to suggest that at first but thought that using the default might be better 😅 The or
syntax will appropriately handle empty strings I believe!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new changes look great! I tried running this locally (with the same altered start date you originally committed) but every run I attempt has skipped 🤔 Does that happen on your end? This should be pulling at least a little bit of data every day.
Yes. It wasn't happening before, but things got weird after I merged in the changes from main. See 8bec026 which is there to fix an issue with the file name generation step. I'm going to put back the delay and the more recent start date and see if I can figure out what's going on. |
openverse_catalog/dags/providers/provider_api_scripts/metropolitan_museum.py
Outdated
Show resolved
Hide resolved
openverse_catalog/dags/providers/provider_api_scripts/metropolitan_museum.py
Outdated
Show resolved
Hide resolved
Notes: https://metmuseum.github.io/ | ||
No rate limit specified. | ||
Notes: https://metmuseum.github.io/#search | ||
"Please limit requests to 80 requests per second." May need to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for always adding excellent notes ✨
openverse_catalog/dags/providers/provider_api_scripts/metropolitan_museum.py
Outdated
Show resolved
Hide resolved
Big thanks to @stacimc, this seems to be all set now. I just needed to properly add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran this locally with success! Looking great! Only one extra comment.
|
||
# adding args for automatically generated parameters from generate_tsv_filenames | ||
def __init__(self, conf: dict = None, date: str = None): | ||
super(MetMuseumDataIngester, self).__init__(conf=conf, date=date) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW in python 3 this can just be:
super(MetMuseumDataIngester, self).__init__(conf=conf, date=date) | |
super().__init__(conf=conf, date=date) |
We don't need to specify the parent class unless the Module Resolution Order needs to be specifically navigated:
The parameterless call to super() is recommended and sufficient for most use cases, and needing to change the search hierarchy regularly could be indicative of a larger design issue. 1
Footnotes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking so good! 🥳 Just a few notes mostly the unexpected behavior with the tags. Edit: The tags look fine, my bad 😄
I love the improvements to the data quality, although I'm a tiny bit hesitant to make too many extra changes while refactoring for fear of having to revert all of it if something unexpected happens. Since Metropolitan is dated
, we should make a note that the changes won't be backfilled until reingestion workflows are up again. I think it should be fine in this case though 👍
I'd just like to confirm a couple of things, maybe with some of the more frontend-knowledgeable folks:
- Is it okay to remove the information that was previously in
meta_data
and move it toraw_tags
? I'm not 100% sure how that metadata was being used. - Are there any concerns with the number and type of tags being added? I'd certainly imagine the more relevant tags the merrier, but I'm pretty sure this would be a big change so may be worth mentioning. 🤞
return meta_data | ||
if object_json.get("isPublicDomain") is False: | ||
self.non_cc0_objects += 1 | ||
if self.non_cc0_objects % self.batch_limit == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool :) What do you think about also reporting the final number at the end of ingestion?
def get_batch_data(self, response_json): | ||
if response_json: | ||
self.object_ids_retrieved = response_json["total"] | ||
logger.info(f"Total objects found {self.object_ids_retrieved}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't incremented, so this will always be the number of ids retrieved in that batch, is that right? Would it make more sense to add to the total each time, or else change the wording of the log message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The met museum returns a json with two things: a list of object ids, and the length of that list. The additional API calls are just to get the details about any given object ID, not to get another list of object IDs. So, I think it's ok to just take this from the source once for each dag run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this is reporting once for every batch, indicating the number of ids in that batch as opposed to the total number ingested by the run. I see it multiple times in the logs for a particular dagrun. Is that the intention, or am I wrong about what that number means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh... Maybe I'm confused about what a batch means and how it's functioning here. My sense is that there should only be one batch per dag-run. I'll take another look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully this enforces the one batch per dagrun thing! 3f7765e
openverse_catalog/dags/providers/provider_api_scripts/metropolitan_museum.py
Outdated
Show resolved
Hide resolved
unique_identifier = image_url.split("/")[-1].split(".")[0] | ||
return f"{object_id}-{unique_identifier}" | ||
|
||
def _get_meta_data(self, object_json): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, but this and other methods that don't access self
can be declared static.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, what are the benefits of making that declaration? The main one I'm familiar with is being able to call them from outside a specific instance of the class. Would that make testing more efficient in this case? Are there other benefits I'm not thinking of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's usually a good practice to make functions that you know won't need to access or modify the class's state via self
, staticmethod
s. This makes the scope clear to folks reading the function, can make testing easier in certain cases (since you don't need to instantiate a class in order to test that specific function), and prevents cases where a function that seems like it should be static but is not explicitly set as such has access to self
and can thus modify class state (either intentionally or unintentionally).
openverse_catalog/dags/providers/provider_api_scripts/metropolitan_museum.py
Show resolved
Hide resolved
@rwidom I take back the only blocking suggestion I had, as it was a misunderstanding on my part. I'm just going to try to verify my two questions about moving tags out of |
Yes, absolutely. I would love to know the answers here as well. I somehow got the impression that metadata does not make it to elastic search, that title and tags are the default information in any search, but I could be totally wrong and it's definitely something I want to learn more about, @stacimc . |
Looking at the infrastructure repo, it doesn't seem like we do any indexing on the $ rg -g "*.py" meta_data
ingestion_server/authority.py
12:The authority can be set from the catalog layer through the meta_data field
ingestion_server/elasticsearch_models.py
93: meta = row[schema["meta_data"]]
135: def get_license_url(meta_data):
140: if meta_data and "license_url" in meta_data:
141: return meta_data["license_url"]
146: def get_maturity(meta_data, api_maturity_flag):
149: :param meta_data: The metadata column, which may have a 'mature'
153: we will ignore the meta_data column and mark the work 'mature'.
157: if meta_data and "mature" in meta_data:
158: _mature = meta_data["mature"]
164: def get_authority_boost(meta_data, source):
166: if meta_data and "authority_boost" in meta_data:
168: authority_boost = float(meta_data["authority_boost"])
241: meta = row[schema["meta_data"]]
323: meta = row[schema["meta_data"]] There are zero references to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, the changes from meta_data
to tags
should be okay -- and thanks @AetherUnbound for checking as well. Very exciting!
We'll keep an eye on this to be safe, but since this is a dated DAG and reingestion is not yet enabled, it will naturally roll out only to new data for now. 🤞 I hope this is the beginning of much needed relevancy improvements across more of our providers, thanks @rwidom!
Fixes
Fixes WordPress/openverse#1487 by @obulat
Fixes WordPress/openverse#1519 by @stacimc
Description
This refactors the metropolitan museum to use the provider data ingester class, and hopefully improve search results by populating title and tags more reliably.
Testing Instructions
When I was testing the dag in airflow, I kept hitting non-responses, until I set the delay at 3 seconds, rather than the standard 1 second. I'm not sure how sustainable that will be for production, since we need to hit the API once for each and every record, including non-CC0 records that we do not load. So, I do recommend testing out the dag in airflow, but with the rate limit caveat.
And then go to
db-shell
and check that title and tags are always populated, try clicking through some of the urls, and so on. For example...Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin