-
Notifications
You must be signed in to change notification settings - Fork 54
Refactor Europeana to use ProviderDataIngester
#821
Conversation
logger = logging.getLogger(__name__) | ||
logging.getLogger(common.urls.__name__).setLevel(logging.WARNING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we set the urls
logging to WARNING in all provider scripts?
I think you need to update the Since Europeana is dated, we'll also need to add it to the |
ProviderDataIngester
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks for your additional contributions to the base class & testing mocks! I have a few small comments. I think the big required changes are the ones Staci has mentioned 🙂
etc.) If a provider only supports a single media type, this method defaults | ||
to returning the only media type defined in the ``providers`` attribute. | ||
""" | ||
pass | ||
if len(self.providers) == 1: | ||
return list(self.providers.keys())[0] | ||
|
||
raise NotImplementedError( | ||
"Provider scripts that support multiple media types " | ||
"must provide an override for ``get_media_type``." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea!
"image_url": data.get("edmIsShownBy")[0], | ||
"foreign_identifier": data.get("id"), | ||
"meta_data": self.get_meta_data_dict(data), | ||
"title": data.get("title")[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For both of these get(<x>)[0]
calls, it's possible get
could return None
, which would cause the indexing into None
to raise an error. Could we add defaults or extra checking steps for those two fields here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered this when writing it, but if either title
or image_url
are missing for something, wouldn't we want it to fail out? Is there a different way to fail it out that works better with Airflow's expectations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd want to skip this record in that case, but not necessarily fail the entire ingestion run. A lot of our providers operate under the assumption that attributes missing from a particular record, even if necessary, would just be ignored and processing would continue. I feel like that might be the best approach here: log that something which was expected to be present was missing, then return None
from the get_record_data
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right on, makes sense! I wonder if we'd want some kind of "incomplete record store" or something, but a discussion for the a different day 🙂
Do you happen to know the complete list of fields that are required for a particular record to be valid? Maybe this is something the base class could handle instead by verifying that all required fields are truthy (not None
and != ""
, I suppose)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@obulat responded well to this in another comment on this PR, and I added a note there too about recently added documentation 😄
openverse_catalog/dags/providers/provider_api_scripts/europeana.py
Outdated
Show resolved
Hide resolved
if response_json.get("success") != "True": | ||
logger.warning('Request failed with ``success = "False"``') | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this logic might be better suited in get_batch_data
, since get_should_continue
is typically concerned with the token/cursor iteration rather than error handling. @stacimc what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you suggest should change here, specifically? Would get_batch_data
just return an empty list in this case?
I don't personally see a big difference here... but I like that all the "is this response usable and can we move forward" is encapsulated in a single method rather than spread across two methods. But there may be an issue with making sure these methods are called in an order that makes sense for each check. If "success" is False then we can't process the batch at all (in addition to not continuing, to maintain parity behaviour). 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would suggest moving L158 and L159 here to get_batch_data
, and add a return None
within the condition rather than return False
(although False
would work too because the check that follows is falsy anyways). Here are some examples from other providers:
My interpretation of these methods is that if we try to get the batch and there's an error, there's functionally no batch to process (as you mention) and get_batch_data
should return None
. The get_should_continue
mechanism is for more peculiar cases (like this one, or Wikimedia) where the API additionally provides some token or other mechanism which tells us whether we should continue processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My interpretation of these methods is that if we try to get the batch and there's an error, there's functionally no batch to process (as you mention) and get_batch_data should return None
Interesting! This makes sense, but then my instinct is that get_batch_data
should raise an error in these cases that is caught and handled differently. Returning None
seems ambiguous to me (though this is just my first time working on a provider script, so take that with a grain of salt).
I'll make the changes you suggested, thanks, Madison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For context on this, these scripts have historically operated in a manner that prioritizes gathering available records and preventing errors as much as possible. This was because (previously) an error condition meant that however x many hours of processing that had already occurred had to be done again if we wanted to re-run a given day. We'd still get all the data we had gotten so far, but we hadn't had a mechanism until #650 (and with further improvement coming via #702) to pick up at a specific set of query params.
I think maintaining that resiliency within the scripts can be a good thing! However, particularly as we finally get everything back up and running we can start operating more under the principal of letting things break, because that tells us our assumptions about the API responses were wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic 🥳 I like your handling of the cursor and the RecordBuilder. The default implementation of get_media_type
is 💯 as well!
After updating the ProviderWorkflow
configuration for Europeana, I was able to test this locally and compare it to the output of the legacy implementation. Everything looked good except for missing description
in the metadata, which I mentioned in one of the comments. If you want to reproduce that, I was running the DAG for 2022-11-03.
This doesn't necessarily need to change in this PR but a note for testers: the start date for Europeana is 2011-09-01
which means it starts backfilling from 2011 and does not get any data. (My fault, I set overly cautious start dates awhile back.) I adjusted this locally to a more reasonable 2020-11-01
, which starts the backfill from around the time the script was last ran.
openverse_catalog/dags/providers/provider_api_scripts/europeana.py
Outdated
Show resolved
Hide resolved
) | ||
def get_next_query_params(self, prev_query_params) -> dict: | ||
if not prev_query_params: | ||
return self.base_request_body |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From looking at the original code, do you not need "cursor": "*"
in the initial request? It seems like it's working fine as is, though 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added back in 1ea1a50
def get_record_data(self, data: dict) -> dict: | ||
record = { | ||
"foreign_landing_url": self.get_foreign_landing_url(data), | ||
"image_url": data.get("edmIsShownBy")[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for this (or foreign_landing_url
, foreign_id
, or license_info
) to be None
? We typically get the required fields first and return None
early from get_record_data
if they cannot be extracted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are those fields allows to be None? 😮
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other scripts, if a required value is None, we stop the current item's step early by using either return None
or continue
/break
inside a loop. Here is an example:
openverse-catalog/openverse_catalog/dags/providers/provider_api_scripts/cleveland_museum.py
Lines 46 to 48 in 1506d6d
foreign_id = data.get("id") | |
if foreign_id is None: | |
return None |
or
openverse-catalog/openverse_catalog/dags/providers/provider_api_scripts/brooklyn_museum.py
Lines 117 to 120 in 1506d6d
for image in image_info: | |
foreign_id = image.get("id") | |
if foreign_id is None: | |
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love a list of the required fields, maybe on the base class, to reference at the end of the get_record_data
method... or potentially as something that the base class itself checks. With that list we could iterate through the key/values and if any required fields are empty (None or "") we could debug log information about the record and skip it. These kinds of checks seem like boilerplate and distributed, undocumented knowledge that the base class would be perfect for.
@stacimc since you're the original architect here, is that something that would make sense to add? Anything to keep in mind if we do so or any good reasons not to that should be documented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list of required fields is in the ImageStore.add_item
method's docs (the same is available in the AudioStore):
openverse-catalog/openverse_catalog/dags/common/storage/image.py
Lines 67 to 77 in 1506d6d
Required Arguments: | |
foreign_landing_url: URL of page where the image lives on the | |
source website. | |
image_url: Direct link to the image file | |
license_info: LicenseInfo object that has | |
- the URL of the license for the image, | |
- string representation of the license, | |
- version of the license, | |
- raw license URL that was by provider, | |
if different from canonical URL |
Do you think there's a better place for this list?
The checks are boilerplate, but not boilerplate enough to extract them because all API responses are different, and the required pieces of information can be located in different stages of the script. The idea behind returning None
is to fail early and skip the unnecessary work of extracting the metadata (which can be quite tedious for some API responses) if the item does not have the required fields and will be discarded anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Staci also recently added documentation around this exact thing as part of #790, which can be found here: https://github.com/WordPress/openverse-catalog/blob/main/openverse_catalog/docs/data_models.md
It's also a goal of the Data Normalization project (I believe) to improve on this understanding and ensure that we have documentation around it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@obulat's summary is exactly right -- these fields being None shouldn't cause errors. I believe the MediaStore should successfully discard them (if I'm wrong and that's not the case I agree we should add this behavior in the Data normalization project). The point of the None checks is mostly to halt processing of a record early if it's just going to be discarded anyway, as a general best practice.
We could update the base class to, instead of having just an abstract get_record_data
, have abstract get_required_record_data
and get_optional_record_data
methods. Or even individual abstract methods for fetching each required field (get_license_info
, get_foreign_identifier
...). Then we could make get_record_data
on the base class do the None checks for us.
I'll make an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#834
openverse_catalog/dags/providers/provider_api_scripts/provider_data_ingester.py
Show resolved
Hide resolved
Also update record builder tests to not call individual methods, removing the assumption that the record builder methods operate individually. Making all methods other than `get_record_data` on the record builder class solidifies this.
|
||
def _get_license_url(self, license_field) -> str | None: | ||
if len(license_field) > 1: | ||
logger.warning("More than one license field found") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it should be done in this PR, but I think this log can be downgraded to debug
.
Co-authored-by: Madison Swain-Bowden <[email protected]>
@stacimc @AetherUnbound @obulat please re-review at your leisure. I've added code to short circuit record building when certain fields are empty. |
image_list, next_cursor, total_number_of_images = _get_image_list( | ||
start_timestamp, end_timestamp, cursor | ||
) | ||
def get_record_data(self, data: dict) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def get_record_data(self, data: dict) -> dict: | |
def get_record_data(self, data: dict) -> dict | None: |
def _get_foreign_identifier(self, data: dict) -> str | None: | ||
return data.get("id") | ||
|
||
@raise_if_empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A title is not required for a media item, some of them have ""
as the title. This might not be needed for Europeana, but is necessary for some providers who have user-provided items (such as Flickr).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed -- I'm not sure if it's absolutely necessary to remove this check for Europeana or if we can assume all records will have a title, but since this will halt ingestion entirely if it raises I'd prefer to remove this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm approving because I wouldn't want to block a working script, but I'm not sure I like the current implementation of short-circuiting.
I don't know if it's just because I'm more used to the old way, but it seems like the checks for None are much shorter than the decorated functions that replace them here. Overall, it seems more difficult to read than the old 'check for None and return early' way.
RESOURCES = os.path.join( | ||
os.path.abspath(os.path.dirname(__file__)), "resources/europeana" | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RESOURCES = os.path.join( | |
os.path.abspath(os.path.dirname(__file__)), "resources/europeana" | |
) | |
RESOURCES = Path(__file__).parent / "resources/europeana" | |
@@ -1,22 +1,18 @@ | |||
import json | |||
import logging | |||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import os | |
from pathlib import Path |
format="%(asctime)s - %(name)s - %(levelname)s: %(message)s", | ||
level=logging.DEBUG, | ||
) | ||
|
||
|
||
def _get_resource_json(json_name): | ||
with open(os.path.join(RESOURCES, json_name)) as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with open(os.path.join(RESOURCES, json_name)) as f: | |
with open(RESOURCES / json_name) as f: |
That's fair. I guess I was trying to avoid doing this run around (just a psuedo-code kind of example): foreign_identifier = data.get("guid")
if not foreign_identifier:
logger.warning(f"foreign_identifier was empty: {foreign_identifier}")
return None
# ... etc
# then much later in the function
return {
"foreign_identifier": foreign_identifier,
# etc... repeat x3? x4?
} The decorator approach feels like it gets rid of the boilerplate checks a bit and also automatically adds logging around it. It's definitely not as straightforward, so I can appreciate the sense you have about it being more complicated and harder to follow. As a pattern I feel like it might be an easier to manage approach, or at least one that ties into one of the suggestions @stacimc made of having the base class require methods like I could very well be over-engineering this though and am totally open to changing it if others are on the same page as you, Olga. |
Regarding the decorators -- I like them personally, but I understand having some reservations especially about consistency among providers and the possibility that it might be more confusing to implement for a newcomer to the project.
This is exactly what I'm thinking -- I would love to use this approach in the base class as part of WordPress/openverse#1378. If the pattern needs to be re-implemented in each individual provider as we're doing currently, then I favor recommending the more straightforward old approach. I'm personally fine with merging this into Europeana for now, though. How about making a suggestion on WordPress/openverse#1378 to extract the logic from Europeana when we work on that issue? I'll add some notes there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works great! I'd prefer to remove the exception on empty titles but from looking at the API docs I don't believe it should be an issue for Europeana in the same way that it would for more user-generated content, so I don't think anything here should block merge. Really nice work, I love the additions to the base class 😍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm with the general consensus here - I do appreciate the decorator approach, but doing so in just this provider could make it more confusing and less approachable for folks (especially since all of the other providers follow a different pattern). That said, I'd also be okay merging this as-is.
My only concern is there seems to be an output file that was erroneously committed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 Awesome work!!
Fixes
Fixes WordPress/openverse#1523 by @stacimc
Description
get_media_type
to the base class to remove boilerplate for providers that only have a single media type. This seemed like an easy improvement to me but there may be a reason why this isn't a good default implementation.Testing Instructions
Checkout the unit tests. Many of them had to be removed and new ones added because they were testing functionality that doesn't exist anymore (is abstracted away) or had to test new functionality. The existing tests that pertained to the data transformation I tried to keep as close as possible to the original tests. I caught some bugs in my refactor this way, which was very helpful! I did not add any new data transformation tests as the existing ones seemed comprehensive enough to me.
Finally, run the DAG. Follow the instructions in the README to run Airflow locally and then run the DAG. You can run it without an API key for testing. Currently this doesn't work though, at least not for me locally, due to an issue with the filename generation task. cc @stacimc for help debugging this. Here's the log output from the task:
I haven't dug into this yet but it seems like probably a configuration issue on my end.
Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin