This repository has been archived by the owner on Aug 4, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 54
Add a Nappy provider DAG using ProviderDataIngester #796
Merged
+509
−15
Merged
Changes from 10 commits
Commits
Show all changes
37 commits
Select commit
Hold shift + click to select a range
6cb6d7f
_-prefix methods that should not be overridden
stacimc a507cd7
Initial template
stacimc 98fef3c
Add initial docs
stacimc 277ecab
Update template, add test template file
stacimc 6e5fd04
Add script to generate template files
stacimc a208913
Update docs to reference script
stacimc b073485
Moving more documentation into the code
stacimc eeaebfd
Reformat docs
stacimc 8f1ad6d
Small tweaks
stacimc 889d61d
Remove unused 'license_url' from nappy and comment out unused test im…
zackkrida 471c266
Remove unused 'license_url' from nappy and comment out unused test im…
zackkrida 8794a0a
Add width, height, and filesize
zackkrida ff3713c
write small helper fn for filesizes
zackkrida 35fd929
Add UA string header
zackkrida 9815feb
move thumbnail_url to metadata for now
zackkrida 7e4c21d
rename thumbnail_url metadata field to thumbnail
zackkrida aea5a20
Merge branch 'main' into nappy-provider-dag
rwidom 6d82d3e
add dag start date
rwidom 5f2677d
no header in next params & add thumbnail_url
rwidom 43810a0
add tests and test resources
rwidom c38d1fe
remove questionable tag from test image
rwidom b0685b3
update docs
rwidom 896aac1
add popularity metrics to metadata
rwidom 2b9ae7e
Add url to source docs
rwidom e4d4138
remove template comment from next query params
rwidom 9b89294
remove template comment on optional fields
rwidom 1f01065
remove template comment on get batch
rwidom 09ae681
remove template comment from main
rwidom 46c037e
remove template comment from get_record_data
rwidom 26bff3b
pass batch_limit to the API
rwidom f160dda
tests for batch limit API parameter
rwidom 751ad04
point to popularity metrics
rwidom 5e02316
template test directory fix
rwidom 3871691
make license info a class variable
rwidom c24d729
Remove outdated/duplicated template creation files
AetherUnbound a81d523
Update DAG documentation
AetherUnbound 53eb8f9
fortify and test convert filesize
rwidom File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
111 changes: 111 additions & 0 deletions
111
openverse_catalog/dags/providers/provider_api_scripts/nappy.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,111 @@ | ||||||
""" | ||||||
Content Provider: Nappy | ||||||
|
||||||
ETL Process: Use the API to identify all CC0-licensed images. | ||||||
|
||||||
Output: TSV file containing the image meta-data. | ||||||
|
||||||
Notes: This api was written specially for Openverse. | ||||||
There are no known limits or restrictions. | ||||||
rwidom marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
""" | ||||||
import logging | ||||||
|
||||||
from common import constants | ||||||
from common.licenses import get_license_info | ||||||
from common.loader import provider_details as prov | ||||||
from providers.provider_api_scripts.provider_data_ingester import ProviderDataIngester | ||||||
|
||||||
|
||||||
logger = logging.getLogger(__name__) | ||||||
|
||||||
|
||||||
class NappyDataIngester(ProviderDataIngester): | ||||||
providers = {"image": prov.NAPPY_DEFAULT_PROVIDER} | ||||||
endpoint = "https://api.nappy.co/v1/openverse/images" | ||||||
# TODO The following are set to their default values. Remove them if the defaults | ||||||
# are acceptible, or override them. | ||||||
delay = 1 | ||||||
retries = 3 | ||||||
headers = {"Accept": "application/json"} | ||||||
zackkrida marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
def get_next_query_params(self, prev_query_params: dict | None, **kwargs) -> dict: | ||||||
# On the first request, `prev_query_params` will be `None`. We can detect this | ||||||
# and return our default params. | ||||||
rwidom marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
if not prev_query_params: | ||||||
return { | ||||||
"page": 1, | ||||||
} | ||||||
else: | ||||||
return { | ||||||
**prev_query_params, | ||||||
"page": prev_query_params["page"] + 1, | ||||||
} | ||||||
|
||||||
def get_batch_data(self, response_json): | ||||||
# Takes the raw API response from calling `get` on the endpoint, and returns | ||||||
# the list of records to process. | ||||||
rwidom marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
if response_json: | ||||||
return response_json.get("images") | ||||||
return None | ||||||
|
||||||
def get_should_continue(self, response_json): | ||||||
return bool(response_json.get("next_page")) | ||||||
|
||||||
def get_media_type(self, record: dict): | ||||||
return constants.IMAGE | ||||||
|
||||||
def get_record_data(self, data: dict) -> dict | list[dict] | None: | ||||||
# Parse out the necessary info from the record data into a dictionary. | ||||||
|
||||||
rwidom marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
if (foreign_landing_url := data.get("foreign_landing_url")) is None: | ||||||
return None | ||||||
|
||||||
if (image_url := data.get("url")) is None: | ||||||
return None | ||||||
|
||||||
# Hardoded to CC0, the only license Nappy.co uses | ||||||
license_info = get_license_info( | ||||||
"https://creativecommons.org/publicdomain/zero/1.0/" | ||||||
) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If all results are CC0, we should set this as a value on the class or instance and use it there rather than calling this function for every record! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oops! Totally, yes. |
||||||
if license_info is None: | ||||||
return None | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since
Suggested change
|
||||||
|
||||||
# OPTIONAL FIELDS | ||||||
# Obtain as many optional fields as possible. | ||||||
rwidom marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
foreign_identifier = data.get("foreign_identifier") | ||||||
thumbnail_url = data.get("url") + "?auto=format&w=600&q=75" | ||||||
filesize = data.get("filesize") | ||||||
filetype = data.get("filetype") | ||||||
creator = data.get("creator") | ||||||
creator_url = data.get("creator_url") | ||||||
title = data.get("title") | ||||||
meta_data = data.get("meta_data") | ||||||
raw_tags = data.get("tags").split(",") | ||||||
|
||||||
return { | ||||||
"foreign_landing_url": foreign_landing_url, | ||||||
"image_url": image_url, | ||||||
"license_info": license_info, | ||||||
"foreign_identifier": foreign_identifier, | ||||||
"thumbnail_url": thumbnail_url, | ||||||
"filesize": filesize, | ||||||
"filetype": filetype, | ||||||
"creator": creator, | ||||||
"creator_url": creator_url, | ||||||
"title": title, | ||||||
"meta_data": meta_data, | ||||||
"raw_tags": raw_tags, | ||||||
} | ||||||
|
||||||
|
||||||
def main(): | ||||||
# Allows running ingestion from the CLI without Airflow running for debugging | ||||||
# purposes. | ||||||
rwidom marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
logger.info("Begin: Nappy data ingestion") | ||||||
ingester = NappyDataIngester() | ||||||
ingester.ingest_records() | ||||||
|
||||||
|
||||||
if __name__ == "__main__": | ||||||
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sorted this list alphabetically, but can revert if desired.