Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Refactor Europeana to use ProviderDataIngester #821

Merged
merged 9 commits into from
Oct 27, 2022

Conversation

sarayourfriend
Copy link
Contributor

Fixes

Fixes WordPress/openverse#1523 by @stacimc

Description

  1. Refactors Europeana's provider script to use the ProviderDataIngester base class.
  2. Adds a simple default implementation for get_media_type to the base class to remove boilerplate for providers that only have a single media type. This seemed like an easy improvement to me but there may be a reason why this isn't a good default implementation.

Testing Instructions

Checkout the unit tests. Many of them had to be removed and new ones added because they were testing functionality that doesn't exist anymore (is abstracted away) or had to test new functionality. The existing tests that pertained to the data transformation I tried to keep as close as possible to the original tests. I caught some bugs in my refactor this way, which was very helpful! I did not add any new data transformation tests as the existing ones seemed comprehensive enough to me.

Finally, run the DAG. Follow the instructions in the README to run Airflow locally and then run the DAG. You can run it without an API key for testing. Currently this doesn't work though, at least not for me locally, due to an issue with the filename generation task. cc @stacimc for help debugging this. Here's the log output from the task:

[2022-10-24, 04:38:11 UTC] {taskinstance.py:1851} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 175, in execute
    return_value = self.execute_callable()
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 193, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/usr/local/airflow/openverse_catalog/dags/providers/factory_utils.py", line 66, in generate_tsv_filenames
    stores = _load_provider_script_stores(ingestion_callable, media_types)
  File "/usr/local/airflow/openverse_catalog/dags/providers/factory_utils.py", line 37, in _load_provider_script_stores
    raise ValueError(
ValueError: Expected stores in providers.provider_api_scripts.europeana were missing: ['image']

I haven't dug into this yet but it seems like probably a configuration issue on my end.

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@sarayourfriend sarayourfriend requested a review from a team as a code owner October 24, 2022 04:40
@sarayourfriend sarayourfriend changed the title Refactor/europeana provider base class Refactor Europeana provider script to use ingester base class Oct 24, 2022
@openverse-bot openverse-bot added ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🟨 priority: medium Not blocking but should be addressed soon labels Oct 24, 2022
logger = logging.getLogger(__name__)
logging.getLogger(common.urls.__name__).setLevel(logging.WARNING)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set the urls logging to WARNING in all provider scripts?

@stacimc
Copy link
Contributor

stacimc commented Oct 24, 2022

Currently this doesn't work though, at least not for me locally, due to an issue with the filename generation task. cc @stacimc for help debugging this.

I think you need to update the ProviderWorkflow configuration for Europeana, here by adding your new class as the ingestion_callable. That should take care of it!

Since Europeana is dated, we'll also need to add it to the ProviderReingestionWorkflow configuration here. That should be all you need to make the reingestion workflow work as well, but if there's any issue that can always be a separate PR!

@AetherUnbound AetherUnbound changed the title Refactor Europeana provider script to use ingester base class Refactor Europeana to use ProviderDataIngester Oct 24, 2022
Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks for your additional contributions to the base class & testing mocks! I have a few small comments. I think the big required changes are the ones Staci has mentioned 🙂

Comment on lines +377 to +386
etc.) If a provider only supports a single media type, this method defaults
to returning the only media type defined in the ``providers`` attribute.
"""
pass
if len(self.providers) == 1:
return list(self.providers.keys())[0]

raise NotImplementedError(
"Provider scripts that support multiple media types "
"must provide an override for ``get_media_type``."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea!

Comment on lines 35 to 38
"image_url": data.get("edmIsShownBy")[0],
"foreign_identifier": data.get("id"),
"meta_data": self.get_meta_data_dict(data),
"title": data.get("title")[0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For both of these get(<x>)[0] calls, it's possible get could return None, which would cause the indexing into None to raise an error. Could we add defaults or extra checking steps for those two fields here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this when writing it, but if either title or image_url are missing for something, wouldn't we want it to fail out? Is there a different way to fail it out that works better with Airflow's expectations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd want to skip this record in that case, but not necessarily fail the entire ingestion run. A lot of our providers operate under the assumption that attributes missing from a particular record, even if necessary, would just be ignored and processing would continue. I feel like that might be the best approach here: log that something which was expected to be present was missing, then return None from the get_record_data function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right on, makes sense! I wonder if we'd want some kind of "incomplete record store" or something, but a discussion for the a different day 🙂

Do you happen to know the complete list of fields that are required for a particular record to be valid? Maybe this is something the base class could handle instead by verifying that all required fields are truthy (not None and != "", I suppose)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@obulat responded well to this in another comment on this PR, and I added a note there too about recently added documentation 😄

Comment on lines 158 to 160
if response_json.get("success") != "True":
logger.warning('Request failed with ``success = "False"``')
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this logic might be better suited in get_batch_data, since get_should_continue is typically concerned with the token/cursor iteration rather than error handling. @stacimc what do you think?

Copy link
Contributor Author

@sarayourfriend sarayourfriend Oct 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you suggest should change here, specifically? Would get_batch_data just return an empty list in this case?

I don't personally see a big difference here... but I like that all the "is this response usable and can we move forward" is encapsulated in a single method rather than spread across two methods. But there may be an issue with making sure these methods are called in an order that makes sense for each check. If "success" is False then we can't process the batch at all (in addition to not continuing, to maintain parity behaviour). 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would suggest moving L158 and L159 here to get_batch_data, and add a return None within the condition rather than return False (although False would work too because the check that follows is falsy anyways). Here are some examples from other providers:

My interpretation of these methods is that if we try to get the batch and there's an error, there's functionally no batch to process (as you mention) and get_batch_data should return None. The get_should_continue mechanism is for more peculiar cases (like this one, or Wikimedia) where the API additionally provides some token or other mechanism which tells us whether we should continue processing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My interpretation of these methods is that if we try to get the batch and there's an error, there's functionally no batch to process (as you mention) and get_batch_data should return None

Interesting! This makes sense, but then my instinct is that get_batch_data should raise an error in these cases that is caught and handled differently. Returning None seems ambiguous to me (though this is just my first time working on a provider script, so take that with a grain of salt).

I'll make the changes you suggested, thanks, Madison.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context on this, these scripts have historically operated in a manner that prioritizes gathering available records and preventing errors as much as possible. This was because (previously) an error condition meant that however x many hours of processing that had already occurred had to be done again if we wanted to re-run a given day. We'd still get all the data we had gotten so far, but we hadn't had a mechanism until #650 (and with further improvement coming via #702) to pick up at a specific set of query params.

I think maintaining that resiliency within the scripts can be a good thing! However, particularly as we finally get everything back up and running we can start operating more under the principal of letting things break, because that tells us our assumptions about the API responses were wrong.

Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic 🥳 I like your handling of the cursor and the RecordBuilder. The default implementation of get_media_type is 💯 as well!

After updating the ProviderWorkflow configuration for Europeana, I was able to test this locally and compare it to the output of the legacy implementation. Everything looked good except for missing description in the metadata, which I mentioned in one of the comments. If you want to reproduce that, I was running the DAG for 2022-11-03.

This doesn't necessarily need to change in this PR but a note for testers: the start date for Europeana is 2011-09-01 which means it starts backfilling from 2011 and does not get any data. (My fault, I set overly cautious start dates awhile back.) I adjusted this locally to a more reasonable 2020-11-01, which starts the backfill from around the time the script was last ran.

)
def get_next_query_params(self, prev_query_params) -> dict:
if not prev_query_params:
return self.base_request_body
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From looking at the original code, do you not need "cursor": "*" in the initial request? It seems like it's working fine as is, though 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added back in 1ea1a50

def get_record_data(self, data: dict) -> dict:
record = {
"foreign_landing_url": self.get_foreign_landing_url(data),
"image_url": data.get("edmIsShownBy")[0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for this (or foreign_landing_url, foreign_id, or license_info) to be None? We typically get the required fields first and return None early from get_record_data if they cannot be extracted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are those fields allows to be None? 😮

Copy link
Contributor

@obulat obulat Oct 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other scripts, if a required value is None, we stop the current item's step early by using either return None or continue/break inside a loop. Here is an example:

foreign_id = data.get("id")
if foreign_id is None:
return None

or

for image in image_info:
foreign_id = image.get("id")
if foreign_id is None:
continue

Copy link
Contributor Author

@sarayourfriend sarayourfriend Oct 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love a list of the required fields, maybe on the base class, to reference at the end of the get_record_data method... or potentially as something that the base class itself checks. With that list we could iterate through the key/values and if any required fields are empty (None or "") we could debug log information about the record and skip it. These kinds of checks seem like boilerplate and distributed, undocumented knowledge that the base class would be perfect for.

@stacimc since you're the original architect here, is that something that would make sense to add? Anything to keep in mind if we do so or any good reasons not to that should be documented?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list of required fields is in the ImageStore.add_item method's docs (the same is available in the AudioStore):

Required Arguments:
foreign_landing_url: URL of page where the image lives on the
source website.
image_url: Direct link to the image file
license_info: LicenseInfo object that has
- the URL of the license for the image,
- string representation of the license,
- version of the license,
- raw license URL that was by provider,
if different from canonical URL

Do you think there's a better place for this list?

The checks are boilerplate, but not boilerplate enough to extract them because all API responses are different, and the required pieces of information can be located in different stages of the script. The idea behind returning None is to fail early and skip the unnecessary work of extracting the metadata (which can be quite tedious for some API responses) if the item does not have the required fields and will be discarded anyways.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Staci also recently added documentation around this exact thing as part of #790, which can be found here: https://github.com/WordPress/openverse-catalog/blob/main/openverse_catalog/docs/data_models.md

It's also a goal of the Data Normalization project (I believe) to improve on this understanding and ensure that we have documentation around it.

Copy link
Contributor

@stacimc stacimc Oct 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@obulat's summary is exactly right -- these fields being None shouldn't cause errors. I believe the MediaStore should successfully discard them (if I'm wrong and that's not the case I agree we should add this behavior in the Data normalization project). The point of the None checks is mostly to halt processing of a record early if it's just going to be discarded anyway, as a general best practice.

We could update the base class to, instead of having just an abstract get_record_data, have abstract get_required_record_data and get_optional_record_data methods. Or even individual abstract methods for fetching each required field (get_license_info, get_foreign_identifier...). Then we could make get_record_data on the base class do the None checks for us.

I'll make an issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#834

Also update record builder tests to not call individual methods, removing the assumption that the record builder methods operate individually. Making all methods other than `get_record_data` on the record builder class solidifies this.

def _get_license_url(self, license_field) -> str | None:
if len(license_field) > 1:
logger.warning("More than one license field found")
Copy link
Contributor

@obulat obulat Oct 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it should be done in this PR, but I think this log can be downgraded to debug.

@sarayourfriend
Copy link
Contributor Author

@stacimc @AetherUnbound @obulat please re-review at your leisure. I've added code to short circuit record building when certain fields are empty.

image_list, next_cursor, total_number_of_images = _get_image_list(
start_timestamp, end_timestamp, cursor
)
def get_record_data(self, data: dict) -> dict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_record_data(self, data: dict) -> dict:
def get_record_data(self, data: dict) -> dict | None:

def _get_foreign_identifier(self, data: dict) -> str | None:
return data.get("id")

@raise_if_empty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A title is not required for a media item, some of them have "" as the title. This might not be needed for Europeana, but is necessary for some providers who have user-provided items (such as Flickr).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed -- I'm not sure if it's absolutely necessary to remove this check for Europeana or if we can assume all records will have a title, but since this will halt ingestion entirely if it raises I'd prefer to remove this check.

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving because I wouldn't want to block a working script, but I'm not sure I like the current implementation of short-circuiting.

I don't know if it's just because I'm more used to the old way, but it seems like the checks for None are much shorter than the decorated functions that replace them here. Overall, it seems more difficult to read than the old 'check for None and return early' way.

Comment on lines 12 to 15
RESOURCES = os.path.join(
os.path.abspath(os.path.dirname(__file__)), "resources/europeana"
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RESOURCES = os.path.join(
os.path.abspath(os.path.dirname(__file__)), "resources/europeana"
)
RESOURCES = Path(__file__).parent / "resources/europeana"

@@ -1,22 +1,18 @@
import json
import logging
import os
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
import os
from pathlib import Path

format="%(asctime)s - %(name)s - %(levelname)s: %(message)s",
level=logging.DEBUG,
)


def _get_resource_json(json_name):
with open(os.path.join(RESOURCES, json_name)) as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with open(os.path.join(RESOURCES, json_name)) as f:
with open(RESOURCES / json_name) as f:

@sarayourfriend
Copy link
Contributor Author

I don't know if it's just because I'm more used to the old way, but it seems like the checks for None are much shorter than the decorated functions that replace them here. Overall, it seems more difficult to read than the old 'check for None and return early' way.

That's fair. I guess I was trying to avoid doing this run around (just a psuedo-code kind of example):

foreign_identifier = data.get("guid")
if not foreign_identifier:
    logger.warning(f"foreign_identifier was empty: {foreign_identifier}")
    return None

# ... etc
# then much later in the function

return {
    "foreign_identifier": foreign_identifier,
    # etc... repeat x3? x4?
}

The decorator approach feels like it gets rid of the boilerplate checks a bit and also automatically adds logging around it. It's definitely not as straightforward, so I can appreciate the sense you have about it being more complicated and harder to follow. As a pattern I feel like it might be an easier to manage approach, or at least one that ties into one of the suggestions @stacimc made of having the base class require methods like get_foreign_identifier and then allow the base class to check these things...

I could very well be over-engineering this though and am totally open to changing it if others are on the same page as you, Olga.

@stacimc
Copy link
Contributor

stacimc commented Oct 27, 2022

Regarding the decorators -- I like them personally, but I understand having some reservations especially about consistency among providers and the possibility that it might be more confusing to implement for a newcomer to the project.

As a pattern I feel like it might be an easier to manage approach, or at least one that ties into one of the suggestions @stacimc made of having the base class require methods like get_foreign_identifier and then allow the base class to check these things...

This is exactly what I'm thinking -- I would love to use this approach in the base class as part of WordPress/openverse#1378. If the pattern needs to be re-implemented in each individual provider as we're doing currently, then I favor recommending the more straightforward old approach.

I'm personally fine with merging this into Europeana for now, though. How about making a suggestion on WordPress/openverse#1378 to extract the logic from Europeana when we work on that issue? I'll add some notes there.

Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works great! I'd prefer to remove the exception on empty titles but from looking at the API docs I don't believe it should be an issue for Europeana in the same way that it would for more user-generated content, so I don't think anything here should block merge. Really nice work, I love the additions to the base class 😍

Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm with the general consensus here - I do appreciate the decorator approach, but doing so in just this provider could make it more confusing and less approachable for folks (especially since all of the other providers follow a different pattern). That said, I'd also be okay merging this as-is.

My only concern is there seems to be an output file that was erroneously committed!

qq Outdated Show resolved Hide resolved
Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 Awesome work!!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor Europeana to use ProviderDataIngester
5 participants