Refactor Europeana to use `ProviderDataIngester` #821

sarayourfriend · 2022-10-24T04:40:57Z

Fixes

Fixes WordPress/openverse#1523 by @stacimc

Description

Refactors Europeana's provider script to use the ProviderDataIngester base class.
Adds a simple default implementation for get_media_type to the base class to remove boilerplate for providers that only have a single media type. This seemed like an easy improvement to me but there may be a reason why this isn't a good default implementation.

Testing Instructions

Checkout the unit tests. Many of them had to be removed and new ones added because they were testing functionality that doesn't exist anymore (is abstracted away) or had to test new functionality. The existing tests that pertained to the data transformation I tried to keep as close as possible to the original tests. I caught some bugs in my refactor this way, which was very helpful! I did not add any new data transformation tests as the existing ones seemed comprehensive enough to me.

Finally, run the DAG. Follow the instructions in the README to run Airflow locally and then run the DAG. You can run it without an API key for testing. Currently this doesn't work though, at least not for me locally, due to an issue with the filename generation task. cc @stacimc for help debugging this. Here's the log output from the task:

[2022-10-24, 04:38:11 UTC] {taskinstance.py:1851} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 175, in execute
    return_value = self.execute_callable()
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 193, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/usr/local/airflow/openverse_catalog/dags/providers/factory_utils.py", line 66, in generate_tsv_filenames
    stores = _load_provider_script_stores(ingestion_callable, media_types)
  File "/usr/local/airflow/openverse_catalog/dags/providers/factory_utils.py", line 37, in _load_provider_script_stores
    raise ValueError(
ValueError: Expected stores in providers.provider_api_scripts.europeana were missing: ['image']

I haven't dug into this yet but it seems like probably a configuration issue on my end.

Checklist

My pull request has a descriptive title (not a vague title like Update index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

…ngle media type

obulat · 2022-10-24T07:14:30Z

openverse_catalog/dags/providers/provider_api_scripts/europeana.py

 logger = logging.getLogger(__name__)
+logging.getLogger(common.urls.__name__).setLevel(logging.WARNING)


Should we set the urls logging to WARNING in all provider scripts?

stacimc · 2022-10-24T19:17:13Z

Currently this doesn't work though, at least not for me locally, due to an issue with the filename generation task. cc @stacimc for help debugging this.

I think you need to update the ProviderWorkflow configuration for Europeana, here by adding your new class as the ingestion_callable. That should take care of it!

Since Europeana is dated, we'll also need to add it to the ProviderReingestionWorkflow configuration here. That should be all you need to make the reingestion workflow work as well, but if there's any issue that can always be a separate PR!

AetherUnbound

This is great, thanks for your additional contributions to the base class & testing mocks! I have a few small comments. I think the big required changes are the ones Staci has mentioned 🙂

AetherUnbound · 2022-10-24T20:10:07Z

openverse_catalog/dags/providers/provider_api_scripts/provider_data_ingester.py

+        etc.) If a provider only supports a single media type, this method defaults
+        to returning the only media type defined in the ``providers`` attribute.
        """
-        pass
+        if len(self.providers) == 1:
+            return list(self.providers.keys())[0]
+
+        raise NotImplementedError(
+            "Provider scripts that support multiple media types "
+            "must provide an override for ``get_media_type``."
+        )


Great idea!

AetherUnbound · 2022-10-24T20:12:11Z

openverse_catalog/dags/providers/provider_api_scripts/europeana.py

+            "image_url": data.get("edmIsShownBy")[0],
+            "foreign_identifier": data.get("id"),
+            "meta_data": self.get_meta_data_dict(data),
+            "title": data.get("title")[0],


For both of these get(<x>)[0] calls, it's possible get could return None, which would cause the indexing into None to raise an error. Could we add defaults or extra checking steps for those two fields here?

I considered this when writing it, but if either title or image_url are missing for something, wouldn't we want it to fail out? Is there a different way to fail it out that works better with Airflow's expectations?

I think we'd want to skip this record in that case, but not necessarily fail the entire ingestion run. A lot of our providers operate under the assumption that attributes missing from a particular record, even if necessary, would just be ignored and processing would continue. I feel like that might be the best approach here: log that something which was expected to be present was missing, then return None from the get_record_data function.

Right on, makes sense! I wonder if we'd want some kind of "incomplete record store" or something, but a discussion for the a different day 🙂

Do you happen to know the complete list of fields that are required for a particular record to be valid? Maybe this is something the base class could handle instead by verifying that all required fields are truthy (not None and != "", I suppose)?

@obulat responded well to this in another comment on this PR, and I added a note there too about recently added documentation 😄

openverse_catalog/dags/providers/provider_api_scripts/europeana.py

AetherUnbound · 2022-10-24T20:17:23Z

openverse_catalog/dags/providers/provider_api_scripts/europeana.py

+        if response_json.get("success") != "True":
+            logger.warning('Request failed with ``success = "False"``')
+            return False


I think this logic might be better suited in get_batch_data, since get_should_continue is typically concerned with the token/cursor iteration rather than error handling. @stacimc what do you think?

What would you suggest should change here, specifically? Would get_batch_data just return an empty list in this case?

I don't personally see a big difference here... but I like that all the "is this response usable and can we move forward" is encapsulated in a single method rather than spread across two methods. But there may be an issue with making sure these methods are called in an order that makes sense for each check. If "success" is False then we can't process the batch at all (in addition to not continuing, to maintain parity behaviour). 🤔

Yes, I would suggest moving L158 and L159 here to get_batch_data, and add a return None within the condition rather than return False (although False would work too because the check that follows is falsy anyways). Here are some examples from other providers:

Brooklyn Museum

Finnish Museum

My interpretation of these methods is that if we try to get the batch and there's an error, there's functionally no batch to process (as you mention) and get_batch_data should return None. The get_should_continue mechanism is for more peculiar cases (like this one, or Wikimedia) where the API additionally provides some token or other mechanism which tells us whether we should continue processing.

My interpretation of these methods is that if we try to get the batch and there's an error, there's functionally no batch to process (as you mention) and get_batch_data should return None

Interesting! This makes sense, but then my instinct is that get_batch_data should raise an error in these cases that is caught and handled differently. Returning None seems ambiguous to me (though this is just my first time working on a provider script, so take that with a grain of salt).

I'll make the changes you suggested, thanks, Madison.

For context on this, these scripts have historically operated in a manner that prioritizes gathering available records and preventing errors as much as possible. This was because (previously) an error condition meant that however x many hours of processing that had already occurred had to be done again if we wanted to re-run a given day. We'd still get all the data we had gotten so far, but we hadn't had a mechanism until #650 (and with further improvement coming via #702) to pick up at a specific set of query params.

I think maintaining that resiliency within the scripts can be a good thing! However, particularly as we finally get everything back up and running we can start operating more under the principal of letting things break, because that tells us our assumptions about the API responses were wrong.

stacimc

This is fantastic 🥳 I like your handling of the cursor and the RecordBuilder. The default implementation of get_media_type is 💯 as well!

After updating the ProviderWorkflow configuration for Europeana, I was able to test this locally and compare it to the output of the legacy implementation. Everything looked good except for missing description in the metadata, which I mentioned in one of the comments. If you want to reproduce that, I was running the DAG for 2022-11-03.

This doesn't necessarily need to change in this PR but a note for testers: the start date for Europeana is 2011-09-01 which means it starts backfilling from 2011 and does not get any data. (My fault, I set overly cautious start dates awhile back.) I adjusted this locally to a more reasonable 2020-11-01, which starts the backfill from around the time the script was last ran.

openverse_catalog/dags/providers/provider_api_scripts/europeana.py

stacimc · 2022-10-24T20:17:20Z

openverse_catalog/dags/providers/provider_api_scripts/europeana.py

-    )
+    def get_next_query_params(self, prev_query_params) -> dict:
+        if not prev_query_params:
+            return self.base_request_body


From looking at the original code, do you not need "cursor": "*" in the initial request? It seems like it's working fine as is, though 🤔

Added back in 1ea1a50

stacimc · 2022-10-24T20:22:29Z

openverse_catalog/dags/providers/provider_api_scripts/europeana.py

+    def get_record_data(self, data: dict) -> dict:
+        record = {
+            "foreign_landing_url": self.get_foreign_landing_url(data),
+            "image_url": data.get("edmIsShownBy")[0],


Is it possible for this (or foreign_landing_url, foreign_id, or license_info) to be None? We typically get the required fields first and return None early from get_record_data if they cannot be extracted.

Are those fields allows to be None? 😮

In other scripts, if a required value is None, we stop the current item's step early by using either return None or continue/break inside a loop. Here is an example:

openverse-catalog/openverse_catalog/dags/providers/provider_api_scripts/cleveland_museum.py

Lines 46 to 48 in 1506d6d

foreign_id = data.get("id")

if foreign_id is None:

return None

or

openverse-catalog/openverse_catalog/dags/providers/provider_api_scripts/brooklyn_museum.py

Lines 117 to 120 in 1506d6d

for image in image_info:

foreign_id = image.get("id")

if foreign_id is None:

continue

I'd love a list of the required fields, maybe on the base class, to reference at the end of the get_record_data method... or potentially as something that the base class itself checks. With that list we could iterate through the key/values and if any required fields are empty (None or "") we could debug log information about the record and skip it. These kinds of checks seem like boilerplate and distributed, undocumented knowledge that the base class would be perfect for.

@stacimc since you're the original architect here, is that something that would make sense to add? Anything to keep in mind if we do so or any good reasons not to that should be documented?

The list of required fields is in the ImageStore.add_item method's docs (the same is available in the AudioStore):

openverse-catalog/openverse_catalog/dags/common/storage/image.py

Lines 67 to 77 in 1506d6d

Required Arguments:

foreign_landing_url: URL of page where the image lives on the

source website.

image_url: Direct link to the image file

license_info: LicenseInfo object that has

- the URL of the license for the image,

- string representation of the license,

- version of the license,

- raw license URL that was by provider,

if different from canonical URL

Do you think there's a better place for this list?

The checks are boilerplate, but not boilerplate enough to extract them because all API responses are different, and the required pieces of information can be located in different stages of the script. The idea behind returning None is to fail early and skip the unnecessary work of extracting the metadata (which can be quite tedious for some API responses) if the item does not have the required fields and will be discarded anyways.

Staci also recently added documentation around this exact thing as part of #790, which can be found here: https://github.com/WordPress/openverse-catalog/blob/main/openverse_catalog/docs/data_models.md

It's also a goal of the Data Normalization project (I believe) to improve on this understanding and ensure that we have documentation around it.

@obulat's summary is exactly right -- these fields being None shouldn't cause errors. I believe the MediaStore should successfully discard them (if I'm wrong and that's not the case I agree we should add this behavior in the Data normalization project). The point of the None checks is mostly to halt processing of a record early if it's just going to be discarded anyway, as a general best practice.

We could update the base class to, instead of having just an abstract get_record_data, have abstract get_required_record_data and get_optional_record_data methods. Or even individual abstract methods for fetching each required field (get_license_info, get_foreign_identifier...). Then we could make get_record_data on the base class do the None checks for us.

I'll make an issue.

openverse_catalog/dags/providers/provider_api_scripts/europeana.py

openverse_catalog/dags/providers/provider_api_scripts/provider_data_ingester.py

Also update record builder tests to not call individual methods, removing the assumption that the record builder methods operate individually. Making all methods other than `get_record_data` on the record builder class solidifies this.

obulat · 2022-10-25T03:24:31Z

openverse_catalog/dags/providers/provider_api_scripts/europeana.py

+
+    def _get_license_url(self, license_field) -> str | None:
+        if len(license_field) > 1:
+            logger.warning("More than one license field found")


Not sure if it should be done in this PR, but I think this log can be downgraded to debug.

Co-authored-by: Madison Swain-Bowden <[email protected]>

sarayourfriend · 2022-10-26T02:57:04Z

@stacimc @AetherUnbound @obulat please re-review at your leisure. I've added code to short circuit record building when certain fields are empty.

obulat · 2022-10-26T11:47:37Z

openverse_catalog/dags/providers/provider_api_scripts/europeana.py

-        image_list, next_cursor, total_number_of_images = _get_image_list(
-            start_timestamp, end_timestamp, cursor
-        )
+    def get_record_data(self, data: dict) -> dict:


Suggested change

def get_record_data(self, data: dict) -> dict:

def get_record_data(self, data: dict) -> dict | None:

obulat · 2022-10-26T11:54:14Z

openverse_catalog/dags/providers/provider_api_scripts/europeana.py

+    def _get_foreign_identifier(self, data: dict) -> str | None:
+        return data.get("id")
+
+    @raise_if_empty


A title is not required for a media item, some of them have "" as the title. This might not be needed for Europeana, but is necessary for some providers who have user-provided items (such as Flickr).

Agreed -- I'm not sure if it's absolutely necessary to remove this check for Europeana or if we can assume all records will have a title, but since this will halt ingestion entirely if it raises I'd prefer to remove this check.

obulat

I'm approving because I wouldn't want to block a working script, but I'm not sure I like the current implementation of short-circuiting.

I don't know if it's just because I'm more used to the old way, but it seems like the checks for None are much shorter than the decorated functions that replace them here. Overall, it seems more difficult to read than the old 'check for None and return early' way.

obulat · 2022-10-26T13:36:28Z

tests/dags/providers/provider_api_scripts/test_europeana.py

 RESOURCES = os.path.join(
    os.path.abspath(os.path.dirname(__file__)), "resources/europeana"
 )



Suggested change

RESOURCES = os.path.join(

os.path.abspath(os.path.dirname(__file__)), "resources/europeana"

)

RESOURCES = Path(__file__).parent / "resources/europeana"

obulat · 2022-10-26T13:36:59Z

tests/dags/providers/provider_api_scripts/test_europeana.py

@@ -1,22 +1,18 @@
 import json
-import logging
 import os


Suggested change

import os

from pathlib import Path

obulat · 2022-10-26T13:37:28Z

tests/dags/providers/provider_api_scripts/test_europeana.py

-    format="%(asctime)s - %(name)s - %(levelname)s:  %(message)s",
-    level=logging.DEBUG,
-)
-

 def _get_resource_json(json_name):
    with open(os.path.join(RESOURCES, json_name)) as f:


Suggested change

with open(os.path.join(RESOURCES, json_name)) as f:

with open(RESOURCES / json_name) as f:

sarayourfriend · 2022-10-26T22:40:19Z

I don't know if it's just because I'm more used to the old way, but it seems like the checks for None are much shorter than the decorated functions that replace them here. Overall, it seems more difficult to read than the old 'check for None and return early' way.

That's fair. I guess I was trying to avoid doing this run around (just a psuedo-code kind of example):

foreign_identifier = data.get("guid")
if not foreign_identifier:
    logger.warning(f"foreign_identifier was empty: {foreign_identifier}")
    return None

# ... etc
# then much later in the function

return {
    "foreign_identifier": foreign_identifier,
    # etc... repeat x3? x4?
}

The decorator approach feels like it gets rid of the boilerplate checks a bit and also automatically adds logging around it. It's definitely not as straightforward, so I can appreciate the sense you have about it being more complicated and harder to follow. As a pattern I feel like it might be an easier to manage approach, or at least one that ties into one of the suggestions @stacimc made of having the base class require methods like get_foreign_identifier and then allow the base class to check these things...

I could very well be over-engineering this though and am totally open to changing it if others are on the same page as you, Olga.

stacimc · 2022-10-27T19:20:49Z

Regarding the decorators -- I like them personally, but I understand having some reservations especially about consistency among providers and the possibility that it might be more confusing to implement for a newcomer to the project.

As a pattern I feel like it might be an easier to manage approach, or at least one that ties into one of the suggestions @stacimc made of having the base class require methods like get_foreign_identifier and then allow the base class to check these things...

This is exactly what I'm thinking -- I would love to use this approach in the base class as part of WordPress/openverse#1378. If the pattern needs to be re-implemented in each individual provider as we're doing currently, then I favor recommending the more straightforward old approach.

I'm personally fine with merging this into Europeana for now, though. How about making a suggestion on WordPress/openverse#1378 to extract the logic from Europeana when we work on that issue? I'll add some notes there.

stacimc

This works great! I'd prefer to remove the exception on empty titles but from looking at the API docs I don't believe it should be an issue for Europeana in the same way that it would for more user-generated content, so I don't think anything here should block merge. Really nice work, I love the additions to the base class 😍

AetherUnbound

I think I'm with the general consensus here - I do appreciate the decorator approach, but doing so in just this provider could make it more confusing and less approachable for folks (especially since all of the other providers follow a different pattern). That said, I'd also be okay merging this as-is.

My only concern is there seems to be an output file that was erroneously committed!

qq

AetherUnbound

🚀 Awesome work!!

sarayourfriend added 2 commits October 24, 2022 15:34

Add default implementation for get_media_type for providers with si…

cc4b2f9

…ngle media type

Refactor Europeana to use ProviderDataIngester base class

2781620

sarayourfriend requested a review from a team as a code owner October 24, 2022 04:40

sarayourfriend requested review from krysal and stacimc October 24, 2022 04:40

sarayourfriend changed the title ~~Refactor/europeana provider base class~~ Refactor Europeana provider script to use ingester base class Oct 24, 2022

openverse-bot added ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🟨 priority: medium Not blocking but should be addressed soon labels Oct 24, 2022

obulat reviewed Oct 24, 2022

View reviewed changes

AetherUnbound changed the title ~~Refactor Europeana provider script to use ingester base class~~ Refactor Europeana to use ProviderDataIngester Oct 24, 2022

AetherUnbound reviewed Oct 24, 2022

View reviewed changes

stacimc suggested changes Oct 24, 2022

View reviewed changes

sarayourfriend mentioned this pull request Apr 17, 2023

Remove unnecessary boilerplate implementations of get_media_type WordPress/openverse#1385

Closed

1 task

sarayourfriend added 3 commits October 25, 2022 09:24

Fix description missing if en or def are empty

dfac506

Also update record builder tests to not call individual methods, removing the assumption that the record builder methods operate individually. Making all methods other than `get_record_data` on the record builder class solidifies this.

Remove unnecessary batch_limit override

384380d

Update provider workflow config for Europeana

b54f8af

obulat reviewed Oct 25, 2022

View reviewed changes

sarayourfriend and others added 2 commits October 25, 2022 16:05

Add back default cursor from previous implementation

1ea1a50

Move success check into get_batch_data

a34b7d8

Co-authored-by: Madison Swain-Bowden <[email protected]>

stacimc mentioned this pull request Apr 17, 2023

Move boilerplate required field None checks into ProviderDataIngester WordPress/openverse#1378

Open

1 task

Short curcuit empty fields into None record

17985e8

sarayourfriend requested a review from stacimc October 26, 2022 02:56

obulat reviewed Oct 26, 2022

View reviewed changes

obulat approved these changes Oct 26, 2022

View reviewed changes

obulat reviewed Oct 26, 2022

View reviewed changes

stacimc approved these changes Oct 27, 2022

View reviewed changes

AetherUnbound suggested changes Oct 27, 2022

View reviewed changes

qq Outdated Show resolved Hide resolved

Remove erroneously added file

f02a474

sarayourfriend requested a review from AetherUnbound October 27, 2022 23:20

AetherUnbound approved these changes Oct 27, 2022

View reviewed changes

sarayourfriend merged commit 91ab102 into main Oct 27, 2022

sarayourfriend deleted the refactor/europeana-provider-base-class branch October 27, 2022 23:45

This was referenced Oct 28, 2022

Update Europeana endpoint and accomodate v2 API changes WordPress/openverse#1727

Closed

Fix Europeana success test, activate reingestion workflows #848

Merged

zaharoian mentioned this pull request Apr 8, 2024

Refactor: Remove get_media_type() redundant override in providers with a single media type WordPress/openverse#4061

Merged

8 tasks

		logger = logging.getLogger(__name__)
		logging.getLogger(common.urls.__name__).setLevel(logging.WARNING)

	foreign_id = data.get("id")
	if foreign_id is None:
	return None

	for image in image_info:
	foreign_id = image.get("id")
	if foreign_id is None:
	continue

	Required Arguments:
	foreign_landing_url: URL of page where the image lives on the
	source website.
	image_url: Direct link to the image file

	license_info: LicenseInfo object that has
	- the URL of the license for the image,
	- string representation of the license,
	- version of the license,
	- raw license URL that was by provider,
	if different from canonical URL

	def get_record_data(self, data: dict) -> dict:
	def get_record_data(self, data: dict) -> dict \| None:

	with open(os.path.join(RESOURCES, json_name)) as f:
	with open(RESOURCES / json_name) as f:

Refactor Europeana to use ProviderDataIngester #821

Refactor Europeana to use ProviderDataIngester #821

Conversation

sarayourfriend commented Oct 24, 2022

Fixes

Description

Testing Instructions

Checklist

Developer Certificate of Origin

Choose a reason for hiding this comment

stacimc commented Oct 24, 2022

AetherUnbound left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarayourfriend Oct 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stacimc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

obulat Oct 25, 2022 • edited Loading

Choose a reason for hiding this comment

sarayourfriend Oct 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stacimc Oct 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

obulat Oct 25, 2022 • edited Loading

Choose a reason for hiding this comment

sarayourfriend commented Oct 26, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

obulat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarayourfriend commented Oct 26, 2022

stacimc commented Oct 27, 2022

stacimc left a comment

Choose a reason for hiding this comment

AetherUnbound left a comment

Choose a reason for hiding this comment

AetherUnbound left a comment

Choose a reason for hiding this comment

Refactor Europeana to use `ProviderDataIngester` #821

Refactor Europeana to use `ProviderDataIngester` #821

sarayourfriend Oct 24, 2022 •

edited

Loading

obulat Oct 25, 2022 •

edited

Loading

sarayourfriend Oct 25, 2022 •

edited

Loading

stacimc Oct 25, 2022 •

edited

Loading

obulat Oct 25, 2022 •

edited

Loading