Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Add a Nappy provider DAG using ProviderDataIngester #796

Merged
merged 37 commits into from
Jan 16, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
6cb6d7f
_-prefix methods that should not be overridden
stacimc Oct 10, 2022
a507cd7
Initial template
stacimc Oct 10, 2022
98fef3c
Add initial docs
stacimc Oct 10, 2022
277ecab
Update template, add test template file
stacimc Oct 11, 2022
6e5fd04
Add script to generate template files
stacimc Oct 11, 2022
a208913
Update docs to reference script
stacimc Oct 11, 2022
b073485
Moving more documentation into the code
stacimc Oct 11, 2022
eeaebfd
Reformat docs
stacimc Oct 13, 2022
8f1ad6d
Small tweaks
stacimc Oct 13, 2022
889d61d
Remove unused 'license_url' from nappy and comment out unused test im…
zackkrida Oct 14, 2022
471c266
Remove unused 'license_url' from nappy and comment out unused test im…
zackkrida Oct 14, 2022
8794a0a
Add width, height, and filesize
zackkrida Oct 14, 2022
ff3713c
write small helper fn for filesizes
zackkrida Oct 14, 2022
35fd929
Add UA string header
zackkrida Oct 14, 2022
9815feb
move thumbnail_url to metadata for now
zackkrida Oct 14, 2022
7e4c21d
rename thumbnail_url metadata field to thumbnail
zackkrida Oct 14, 2022
aea5a20
Merge branch 'main' into nappy-provider-dag
rwidom Dec 28, 2022
6d82d3e
add dag start date
rwidom Dec 29, 2022
5f2677d
no header in next params & add thumbnail_url
rwidom Dec 29, 2022
43810a0
add tests and test resources
rwidom Dec 29, 2022
c38d1fe
remove questionable tag from test image
rwidom Dec 29, 2022
b0685b3
update docs
rwidom Dec 30, 2022
896aac1
add popularity metrics to metadata
rwidom Dec 30, 2022
2b9ae7e
Add url to source docs
rwidom Jan 2, 2023
e4d4138
remove template comment from next query params
rwidom Jan 2, 2023
9b89294
remove template comment on optional fields
rwidom Jan 2, 2023
1f01065
remove template comment on get batch
rwidom Jan 2, 2023
09ae681
remove template comment from main
rwidom Jan 2, 2023
46c037e
remove template comment from get_record_data
rwidom Jan 2, 2023
26bff3b
pass batch_limit to the API
rwidom Jan 2, 2023
f160dda
tests for batch limit API parameter
rwidom Jan 2, 2023
751ad04
point to popularity metrics
rwidom Jan 2, 2023
5e02316
template test directory fix
rwidom Jan 2, 2023
3871691
make license info a class variable
rwidom Jan 2, 2023
c24d729
Remove outdated/duplicated template creation files
AetherUnbound Jan 2, 2023
a81d523
Update DAG documentation
AetherUnbound Jan 2, 2023
53eb8f9
fortify and test convert filesize
rwidom Jan 3, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions DAGs.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ The following are DAGs grouped by their primary tag:
| [`jamendo_workflow`](#jamendo_workflow) | `@monthly` | `False` | audio |
| [`metropolitan_museum_workflow`](#metropolitan_museum_workflow) | `@daily` | `True` | image |
| `museum_victoria_workflow` | `@monthly` | `False` | image |
| [`nappy_workflow`](#nappy_workflow) | `@monthly` | `False` | image |
| `nypl_workflow` | `@monthly` | `False` | image |
| [`phylopic_workflow`](#phylopic_workflow) | `@daily` | `True` | image |
| [`rawpixel_workflow`](#rawpixel_workflow) | `@monthly` | `False` | image |
Expand Down Expand Up @@ -105,6 +106,7 @@ The following is documentation associated with each DAG (where available):
1. [`jamendo_workflow`](#jamendo_workflow)
1. [`metropolitan_museum_reingestion_workflow`](#metropolitan_museum_reingestion_workflow)
1. [`metropolitan_museum_workflow`](#metropolitan_museum_workflow)
1. [`nappy_workflow`](#nappy_workflow)
1. [`oauth2_authorization`](#oauth2_authorization)
1. [`oauth2_token_refresh`](#oauth2_token_refresh)
1. [`phylopic_reingestion_workflow`](#phylopic_reingestion_workflow)
Expand Down Expand Up @@ -379,6 +381,17 @@ blocking during local development testing.
connect with just date and license.
https://collectionapi.metmuseum.org/public/collection/v1/search?isPublicDomain=true&metadataDate=2022-08-07

## `nappy_workflow`

Content Provider: Nappy

ETL Process: Use the API to identify all CC0-licensed images.

Output: TSV file containing the image meta-data.

Notes: This api was written specially for Openverse. There are no known limits
or restrictions. https://nappy.co/

## `oauth2_authorization`

### OAuth Provider Authorization
Expand Down
5 changes: 3 additions & 2 deletions docker/local_postgres/0004_openledger_image_view.sql
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,10 @@ INSERT INTO public.image_popularity_metrics (
provider, metric, percentile
) VALUES
('flickr', 'views', 0.85),
('wikimedia', 'global_usage_count', 0.85),
('nappy', 'downloads', 0.85),
('rawpixel', 'download_count', 0.85),
('stocksnap', 'downloads_raw', 0.85),
('rawpixel', 'download_count', 0.85)
('wikimedia', 'global_usage_count', 0.85)
;


Expand Down
24 changes: 13 additions & 11 deletions openverse_catalog/dags/common/loader/provider_details.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,26 +13,27 @@


# Default provider names
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sorted this list alphabetically, but can revert if desired.

FLICKR_DEFAULT_PROVIDER = "flickr"
EUROPEANA_DEFAULT_PROVIDER = "europeana"
WIKIMEDIA_AUDIO_PROVIDER = "wikimedia_audio"
WIKIMEDIA_DEFAULT_PROVIDER = "wikimedia"
SMITHSONIAN_DEFAULT_PROVIDER = "smithsonian"
BROOKLYN_DEFAULT_PROVIDER = "brooklynmuseum"
CLEVELAND_DEFAULT_PROVIDER = "clevelandmuseum"
EUROPEANA_DEFAULT_PROVIDER = "europeana"
FINNISH_DEFAULT_PROVIDER = "finnishmuseums"
FLICKR_DEFAULT_PROVIDER = "flickr"
FREESOUND_DEFAULT_PROVIDER = "freesound"
INATURALIST_DEFAULT_PROVIDER = "inaturalist"
JAMENDO_DEFAULT_PROVIDER = "jamendo"
METROPOLITAN_MUSEUM_DEFAULT_PROVIDER = "met"
VICTORIA_DEFAULT_PROVIDER = "museumsvictoria"
NAPPY_DEFAULT_PROVIDER = "nappy"
NYPL_DEFAULT_PROVIDER = "nypl"
RAWPIXEL_DEFAULT_PROVIDER = "rawpixel"
SCIENCE_DEFAULT_PROVIDER = "sciencemuseum"
SMITHSONIAN_DEFAULT_PROVIDER = "smithsonian"
SMK_DEFAULT_PROVIDER = "smk"
WALTERS_DEFAULT_PROVIDER = "waltersartmuseum"
FINNISH_DEFAULT_PROVIDER = "finnishmuseums"
JAMENDO_DEFAULT_PROVIDER = "jamendo"
STOCKSNAP_DEFAULT_PROVIDER = "stocksnap"
VICTORIA_DEFAULT_PROVIDER = "museumsvictoria"
WALTERS_DEFAULT_PROVIDER = "waltersartmuseum"
WIKIMEDIA_AUDIO_PROVIDER = "wikimedia_audio"
WIKIMEDIA_DEFAULT_PROVIDER = "wikimedia"
WORDPRESS_DEFAULT_PROVIDER = "wordpress"
FREESOUND_DEFAULT_PROVIDER = "freesound"
INATURALIST_DEFAULT_PROVIDER = "inaturalist"
PHYLOPIC_DEFAULT_PROVIDER = "phylopic"

# Finnish parameters
Expand Down Expand Up @@ -138,6 +139,7 @@ class ImageCategory(Enum):
"mccordmuseum": ImageCategory.DIGITIZED_ARTWORK.value,
"met": ImageCategory.DIGITIZED_ARTWORK.value,
"museumsvictoria": ImageCategory.DIGITIZED_ARTWORK.value,
"nappy": ImageCategory.PHOTOGRAPH.value,
"phylopic": ImageCategory.ILLUSTRATION.value,
"rijksmuseum": ImageCategory.DIGITIZED_ARTWORK.value,
"sciencemuseum": ImageCategory.PHOTOGRAPH.value,
Expand Down
5 changes: 3 additions & 2 deletions openverse_catalog/dags/common/popularity/sql.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,10 @@

IMAGE_POPULARITY_METRICS = {
"flickr": {"metric": "views"},
"wikimedia": {"metric": "global_usage_count"},
"stocksnap": {"metric": "downloads_raw"},
"nappy": {"metric": "downloads"},
"rawpixel": {"metric": "download_count"},
"stocksnap": {"metric": "downloads_raw"},
"wikimedia": {"metric": "global_usage_count"},
}

AUDIO_POPULARITY_METRICS = {
Expand Down
121 changes: 121 additions & 0 deletions openverse_catalog/dags/providers/provider_api_scripts/nappy.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
"""
Content Provider: Nappy

ETL Process: Use the API to identify all CC0-licensed images.

Output: TSV file containing the image meta-data.

Notes: This api was written specially for Openverse.
There are no known limits or restrictions.
rwidom marked this conversation as resolved.
Show resolved Hide resolved
https://nappy.co/

"""
import logging

from common import constants
from common.licenses import get_license_info
from common.loader import provider_details as prov
from providers.provider_api_scripts.provider_data_ingester import ProviderDataIngester


logger = logging.getLogger(__name__)


class NappyDataIngester(ProviderDataIngester):
providers = {constants.IMAGE: prov.NAPPY_DEFAULT_PROVIDER}
endpoint = "https://api.nappy.co/v1/openverse/images"
headers = {"User-Agent": prov.UA_STRING, "Accept": "application/json"}

# Hardoded to CC0, the only license Nappy.co uses
license_info = get_license_info(
"https://creativecommons.org/publicdomain/zero/1.0/"
)

def get_next_query_params(self, prev_query_params: dict | None, **kwargs) -> dict:
if not prev_query_params:
return {
"page": 1,
"per_page": self.batch_limit,
}
else:
return {
**prev_query_params,
"page": prev_query_params["page"] + 1,
}

def get_batch_data(self, response_json):
if response_json:
return response_json.get("images")
return None

def get_should_continue(self, response_json):
return bool(response_json.get("next_page"))

def get_media_type(self, record: dict):
return constants.IMAGE

@staticmethod
def _convert_filesize(raw_filesize_string: str) -> int:
"""
Convert sizes from strings to byte integers, ex. "187.8kB" to 188.
"""
FILETYPE_MULTIPLIERS = {"kB": 1000, "MB": 1_000_000, "GB": 1_000_000_000}
if isinstance(raw_filesize_string, str) and len(raw_filesize_string) > 2:
stripped = raw_filesize_string.strip()
if stripped[-2:] in FILETYPE_MULTIPLIERS:
try:
units = float(stripped[:-2])
except ValueError:
return
multiplier = FILETYPE_MULTIPLIERS[stripped[-2:]]
return round(units * multiplier)

def get_record_data(self, data: dict) -> dict | list[dict] | None:
if (foreign_landing_url := data.get("foreign_landing_url")) is None:
return None

if (image_url := data.get("url")) is None:
return None

foreign_identifier = data.get("foreign_identifier")
thumbnail_url = data.get("url") + "?auto=format&w=600&q=75"
filesize = self._convert_filesize(data.get("filesize"))
filetype = data.get("filetype")
creator = data.get("creator")
creator_url = data.get("creator_url")
title = data.get("title")
meta_data = {
"views": data.get("views"),
"saves": data.get("saves"),
"downloads": data.get("downloads"),
}
Comment on lines +87 to +91
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding these! We'll also want to add downloads to the DDL and the image popularity metrics test!

raw_tags = data.get("tags").split(",")
width = data.get("width")
height = data.get("height")

return {
"foreign_landing_url": foreign_landing_url,
"image_url": image_url,
"thumbnail_url": thumbnail_url,
"license_info": self.license_info,
"foreign_identifier": foreign_identifier,
"filesize": filesize,
"filetype": filetype,
"creator": creator,
"creator_url": creator_url,
"title": title,
"meta_data": meta_data,
"raw_tags": raw_tags,
"width": width,
"height": height,
}


def main():
logger.info("Begin: Nappy data ingestion")
ingester = NappyDataIngester()
ingester.ingest_records()


if __name__ == "__main__":
main()
5 changes: 5 additions & 0 deletions openverse_catalog/dags/providers/provider_workflows.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from providers.provider_api_scripts.jamendo import JamendoDataIngester
from providers.provider_api_scripts.metropolitan_museum import MetMuseumDataIngester
from providers.provider_api_scripts.museum_victoria import VictoriaDataIngester
from providers.provider_api_scripts.nappy import NappyDataIngester
from providers.provider_api_scripts.nypl import NyplDataIngester
from providers.provider_api_scripts.phylopic import PhylopicDataIngester
from providers.provider_api_scripts.provider_data_ingester import ProviderDataIngester
Expand Down Expand Up @@ -160,6 +161,10 @@ def __post_init__(self):
ingester_class=VictoriaDataIngester,
start_date=datetime(2020, 1, 1),
),
ProviderWorkflow(
ingester_class=NappyDataIngester,
start_date=datetime(2022, 12, 1),
),
ProviderWorkflow(
ingester_class=NyplDataIngester,
start_date=datetime(2020, 1, 1),
Expand Down
Loading