Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Refactor Metropolitan Museum of Art to use ProviderDataIngester #674

Merged
merged 30 commits into from
Aug 30, 2022

Conversation

rwidom
Copy link
Collaborator

@rwidom rwidom commented Aug 18, 2022

Fixes

Fixes WordPress/openverse#1487 by @obulat
Fixes WordPress/openverse#1519 by @stacimc

Description

This refactors the metropolitan museum to use the provider data ingester class, and hopefully improve search results by populating title and tags more reliably.

Testing Instructions

When I was testing the dag in airflow, I kept hitting non-responses, until I set the delay at 3 seconds, rather than the standard 1 second. I'm not sure how sustainable that will be for production, since we need to hit the API once for each and every record, including non-CC0 records that we do not load. So, I do recommend testing out the dag in airflow, but with the rate limit caveat.

image

*Provider*: `metropolitan_museum`
*Date range*: 2022-08-08 -> 2022-08-09
*Duration of data pull task*: _No data_
*Number of records upserted per media type*:
  - `image`: 9,857 _(985 duplicate foreign IDs, 2,758 duplicate URLs)_

[2022-08-17, 17:46:31 UTC] {taskinstance.py:1395} INFO - Marking task as SUCCESS. dag_id=metropolitan_museum_workflow, task_id=report_load_completion, execution_date=20220808T000000, start_date=20220817T174631, end_date=20220817T174631
[2022-08-17, 17:46:31 UTC] {local_task_job.py:156} INFO - Task exited with return code 0
[2022-08-17, 17:46:31 UTC] {local_task_job.py:273} INFO - 0 downstream tasks scheduled from follow-on schedule check

And then go to db-shell and check that title and tags are always populated, try clicking through some of the urls, and so on. For example...

image

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@rwidom rwidom requested a review from a team as a code owner August 18, 2022 00:09
@rwidom rwidom requested review from krysal and AetherUnbound August 18, 2022 00:09
@openverse-bot openverse-bot added ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🟧 priority: high Stalls work on the project or its dependents labels Aug 18, 2022
Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice and clean!

tag_list += [tag["term"] for tag in object_json.get("tags")]
return tag_list

def _get_title(self, record):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

# Treating "unidentified" the same as missing, but maybe it would be useful in
# in search? Maybe in the art world it has a more specific outsider art
# connotation?
if artist in ["Unidentified", "Unidentified artist"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Madison usually says that it's better to use a set when checking with in :)

Copy link
Collaborator Author

@rwidom rwidom Aug 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 Yes, that makes total sense, I'll fix it tonight. Though hopefully for a static list of two distinct items it wouldn't make a huge difference.

Copy link
Contributor

@AetherUnbound AetherUnbound Aug 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, you know me too well @obulat 😁 And you're right @rwidom, it won't make a huge difference here but it's a good habit to get into because it can have significant performance benefits when there are tons of iterations happening!

Copy link
Collaborator Author

@rwidom rwidom Aug 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it actually does make a noticeable difference! Who knew? ~7%!

(openverse) % python3 -m timeit '[n in [1, 2] for n in range(4)]'
1000000 loops, best of 5: 390 nsec per loop
(openverse) % python3 -m timeit '[n in {1, 2} for n in range(4)]'
1000000 loops, best of 5: 360 nsec per loop

@@ -0,0 +1,158 @@
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, @rwidom!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much!

Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic, thank you for tackling it!

Comment on lines 31 to 37
# get a list of object IDs for this week:
# https://collectionapi.metmuseum.org/public/collection/v1/objects?metadataDate=2022-08-10
# get a specific object:
# https://collectionapi.metmuseum.org/public/collection/v1/objects/1027
# The search functionality requires a specific query (term search) in addition to date
# and public domain. It seems like it won't connect with just date and license.
# https://collectionapi.metmuseum.org/public/collection/v1/search?isPublicDomain=true&metadataDate=2022-08-07
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be great to go into the docstring!

def __init__(self, date: str = None):
super(MetMuseumDataIngester, self).__init__(date=date)
self.retries = 5
self.delayed_requester._DELAY = 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're comfortable with it, I think it'd be okay to leave this at the 1 second default rather than bumping it up to 3. We run this DAG every day and haven't encountered any issues or rate limiting. I'd love to keep your comments in the docstring though, just so we know if we do encounter rate limiting going forward we have a known way to address it.

Also, I'll make an issue for changing the delay from a private to a public attribute. There are definitely cases where we'll want to bump it up or adjust it!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally. Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +54 to +56
# this seems like useful information to track for context on the existing load
# metrics, but just adding them to the log in aggregate for now rather than
# logging each record individually or doing something fancier in airflow.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

openverse_catalog/dags/providers/provider_workflows.py Outdated Show resolved Hide resolved
"case_name, response_json, expected",
[
(
"single image",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be reasonable to add the empty case for this and the next 3 tests? E.g. response_json is {} or None

Comment on lines 65 to 75
"case_name, response_json, expected",
[
(
"single image",
single_object_response,
single_expected_data[0].get("meta_data"),
),
("multi images", full_object_response, full_expected_data[0].get("meta_data")),
],
)
def test_get_meta_data(case_name, response_json, expected):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like in these tests that case_name isn't actually used. If you're looking to document the particular test cases, you can add comments to each parameter set:

Suggested change
"case_name, response_json, expected",
[
(
"single image",
single_object_response,
single_expected_data[0].get("meta_data"),
),
("multi images", full_object_response, full_expected_data[0].get("meta_data")),
],
)
def test_get_meta_data(case_name, response_json, expected):
"response_json, expected",
[
# Single image response
(
single_object_response,
single_expected_data[0].get("meta_data"),
),
# Multi-image response
(images", full_object_response, full_expected_data[0].get("meta_data")),
],
)
def test_get_meta_data(response_json, expected):

Here's another example:

##################################################
# Function-based
##################################################
# Happy path
(fake_provider_module.main, ["image"], [fake_provider_module.image_store]),
# Multiple types
(
fake_provider_module.main,
["image", "audio"],
[fake_provider_module.image_store, fake_provider_module.audio_store],
),
# Empty case, no media types provided
(fake_provider_module.main, [], []),
##################################################
# Class-based
##################################################
# Happy path
(FakeDataIngesterClass, ["image", "audio"], list(fdi.media_stores.values())),
# No media types provided, ingester class still supplies stores
(FakeDataIngesterClass, 2, list(fdi.media_stores.values())),

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to get the test case description to appear in the results if the test failed, but I think comments would get dropped in the abbreviation. I'll try it, or is there a better way to accomplish that? Maybe it's just making the values themselves describe the case, where there is a string value... ??

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! If you're looking to edit the test IDs specifically and how they show up in the node names during testing, here's a few ways to do that: https://docs.pytest.org/en/7.1.x/example/parametrize.html#different-options-for-test-ids

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much!

return tag_list

def _get_title(self, record):
return record.get("title", record.get("objectName"))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AetherUnbound , I should have tested before committing the suggestion here. It looks like this really only works for None, where "" evaluates to False as well. :/ Some of the API responses have all of the fields, but then an empty string, and I want to be able to handle that situation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! That totally makes sense, you could potentially do return record.get("title") or record.get("objectName"). I was going to suggest that at first but thought that using the default might be better 😅 The or syntax will appropriately handle empty strings I believe!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Thank you.

Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new changes look great! I tried running this locally (with the same altered start date you originally committed) but every run I attempt has skipped 🤔 Does that happen on your end? This should be pulling at least a little bit of data every day.

@rwidom
Copy link
Collaborator Author

rwidom commented Aug 20, 2022

The new changes look great! I tried running this locally (with the same altered start date you originally committed) but every run I attempt has skipped 🤔 Does that happen on your end? This should be pulling at least a little bit of data every day.

Yes. It wasn't happening before, but things got weird after I merged in the changes from main. See 8bec026 which is there to fix an issue with the file name generation step. I'm going to put back the delay and the more recent start date and see if I can figure out what's going on.

@rwidom rwidom marked this pull request as draft August 20, 2022 10:19
openverse_catalog/dags/common/loader/provider_details.py Outdated Show resolved Hide resolved
Notes: https://metmuseum.github.io/
No rate limit specified.
Notes: https://metmuseum.github.io/#search
"Please limit requests to 80 requests per second." May need to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for always adding excellent notes ✨

@rwidom rwidom marked this pull request as ready for review August 22, 2022 21:45
@rwidom
Copy link
Collaborator Author

rwidom commented Aug 22, 2022

The new changes look great! I tried running this locally (with the same altered start date you originally committed) but every run I attempt has skipped 🤔 Does that happen on your end? This should be pulling at least a little bit of data every day.

Big thanks to @stacimc, this seems to be all set now. I just needed to properly add conf parameters from #650

Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran this locally with success! Looking great! Only one extra comment.


# adding args for automatically generated parameters from generate_tsv_filenames
def __init__(self, conf: dict = None, date: str = None):
super(MetMuseumDataIngester, self).__init__(conf=conf, date=date)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW in python 3 this can just be:

Suggested change
super(MetMuseumDataIngester, self).__init__(conf=conf, date=date)
super().__init__(conf=conf, date=date)

We don't need to specify the parent class unless the Module Resolution Order needs to be specifically navigated:

The parameterless call to super() is recommended and sufficient for most use cases, and needing to change the search hierarchy regularly could be indicative of a larger design issue. 1

Footnotes

  1. https://realpython.com/python-super/#a-super-deep-dive

Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking so good! 🥳 Just a few notes mostly the unexpected behavior with the tags. Edit: The tags look fine, my bad 😄

I love the improvements to the data quality, although I'm a tiny bit hesitant to make too many extra changes while refactoring for fear of having to revert all of it if something unexpected happens. Since Metropolitan is dated, we should make a note that the changes won't be backfilled until reingestion workflows are up again. I think it should be fine in this case though 👍

I'd just like to confirm a couple of things, maybe with some of the more frontend-knowledgeable folks:

  • Is it okay to remove the information that was previously in meta_data and move it to raw_tags? I'm not 100% sure how that metadata was being used.
  • Are there any concerns with the number and type of tags being added? I'd certainly imagine the more relevant tags the merrier, but I'm pretty sure this would be a big change so may be worth mentioning. 🤞

return meta_data
if object_json.get("isPublicDomain") is False:
self.non_cc0_objects += 1
if self.non_cc0_objects % self.batch_limit == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool :) What do you think about also reporting the final number at the end of ingestion?

def get_batch_data(self, response_json):
if response_json:
self.object_ids_retrieved = response_json["total"]
logger.info(f"Total objects found {self.object_ids_retrieved}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't incremented, so this will always be the number of ids retrieved in that batch, is that right? Would it make more sense to add to the total each time, or else change the wording of the log message?

Copy link
Collaborator Author

@rwidom rwidom Aug 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The met museum returns a json with two things: a list of object ids, and the length of that list. The additional API calls are just to get the details about any given object ID, not to get another list of object IDs. So, I think it's ok to just take this from the source once for each dag run.

Copy link
Contributor

@stacimc stacimc Aug 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is reporting once for every batch, indicating the number of ids in that batch as opposed to the total number ingested by the run. I see it multiple times in the logs for a particular dagrun. Is that the intention, or am I wrong about what that number means?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... Maybe I'm confused about what a batch means and how it's functioning here. My sense is that there should only be one batch per dag-run. I'll take another look.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully this enforces the one batch per dagrun thing! 3f7765e

unique_identifier = image_url.split("/")[-1].split(".")[0]
return f"{object_id}-{unique_identifier}"

def _get_meta_data(self, object_json):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but this and other methods that don't access self can be declared static.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, what are the benefits of making that declaration? The main one I'm familiar with is being able to call them from outside a specific instance of the class. Would that make testing more efficient in this case? Are there other benefits I'm not thinking of?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's usually a good practice to make functions that you know won't need to access or modify the class's state via self, staticmethods. This makes the scope clear to folks reading the function, can make testing easier in certain cases (since you don't need to instantiate a class in order to test that specific function), and prevents cases where a function that seems like it should be static but is not explicitly set as such has access to self and can thus modify class state (either intentionally or unintentionally).

@stacimc
Copy link
Contributor

stacimc commented Aug 24, 2022

@rwidom I take back the only blocking suggestion I had, as it was a misunderstanding on my part. I'm just going to try to verify my two questions about moving tags out of meta_data and populating a large list, and then this will LGTM! Thanks for your patience as always, I would be so excited to get these improvements in ✨

@rwidom rwidom requested a review from a team as a code owner August 24, 2022 20:57
@rwidom
Copy link
Collaborator Author

rwidom commented Aug 24, 2022

I'd just like to confirm a couple of things, maybe with some of the more frontend-knowledgeable folks:

* Is it okay to remove the information that was previously in `meta_data` and move it to `raw_tags`? I'm not 100% sure how that metadata was being used.

* Are there any concerns with the number and type of tags being added? I'd certainly _imagine_ the more relevant tags the merrier, but I'm pretty sure this would be a big change so may be worth mentioning. 🤞

Yes, absolutely. I would love to know the answers here as well. I somehow got the impression that metadata does not make it to elastic search, that title and tags are the default information in any search, but I could be totally wrong and it's definitely something I want to learn more about, @stacimc .

@AetherUnbound
Copy link
Contributor

Is it okay to remove the information that was previously in meta_data and move it to raw_tags? I'm not 100% sure how that metadata was being used.

Looking at the infrastructure repo, it doesn't seem like we do any indexing on the meta_data and AFAICT the only thing it's used for is the authority boost:

$ rg -g "*.py" meta_data
ingestion_server/authority.py
12:The authority can be set from the catalog layer through the meta_data field

ingestion_server/elasticsearch_models.py
93:        meta = row[schema["meta_data"]]
135:    def get_license_url(meta_data):
140:        if meta_data and "license_url" in meta_data:
141:            return meta_data["license_url"]
146:    def get_maturity(meta_data, api_maturity_flag):
149:        :param meta_data: The metadata column, which may have a 'mature'
153:        we will ignore the meta_data column and mark the work 'mature'.
157:        if meta_data and "mature" in meta_data:
158:            _mature = meta_data["mature"]
164:    def get_authority_boost(meta_data, source):
166:        if meta_data and "authority_boost" in meta_data:
168:                authority_boost = float(meta_data["authority_boost"])
241:        meta = row[schema["meta_data"]]
323:        meta = row[schema["meta_data"]]

There are zero references to authority_boost in the catalog repo, so I suspect that's not something we're using at all currently. I think moving that info into the tags, if you believe it would help with search relevancy, is a good idea 🙂

Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, the changes from meta_data to tags should be okay -- and thanks @AetherUnbound for checking as well. Very exciting!

We'll keep an eye on this to be safe, but since this is a dated DAG and reingestion is not yet enabled, it will naturally roll out only to new data for now. 🤞 I hope this is the beginning of much needed relevancy improvements across more of our providers, thanks @rwidom!

@rwidom rwidom merged commit 8ac257c into main Aug 30, 2022
@rwidom rwidom deleted the issue/588-refactor-met-museum-to-ingester branch August 30, 2022 10:13
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟧 priority: high Stalls work on the project or its dependents
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor Metropolitan Museum to use ProviderDataIngester Correct title extraction for Metropolitan museum
5 participants