Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Commit

Permalink
Add a Nappy provider DAG using ProviderDataIngester (#796)
Browse files Browse the repository at this point in the history
* _-prefix methods that should not be overridden

* Initial template

* Add initial docs

* Update template, add test template file

* Add script to generate template files

* Update docs to reference script

* Moving more documentation into the code

* Reformat docs

- Breaks out into several files
- Removes documentation that is redundant (copied from code)
- Prefers documentation within the template
- Explicitly documents advanced options as FAQ
- Some small updates to the templating

* Small tweaks

* Remove unused 'license_url' from nappy and comment out unused test imports

* Remove unused 'license_url' from nappy and comment out unused test imports

* write small helper fn for filesizes

* Add UA string header

* move thumbnail_url to metadata for now

* rename thumbnail_url metadata field to thumbnail

* add dag start date

* no header in next params & add thumbnail_url

* add tests and test resources

* remove questionable tag from test image

* update docs

* add popularity metrics to metadata

* Add url to source docs

Co-authored-by: Madison Swain-Bowden <[email protected]>

* remove template comment from next query params

Co-authored-by: Madison Swain-Bowden <[email protected]>

* remove template comment on optional fields

Co-authored-by: Madison Swain-Bowden <[email protected]>

* remove template comment on get batch

Co-authored-by: Madison Swain-Bowden <[email protected]>

* remove template comment from main

Co-authored-by: Madison Swain-Bowden <[email protected]>

* remove template comment from get_record_data

Co-authored-by: Madison Swain-Bowden <[email protected]>

* pass batch_limit to the API

Co-authored-by: Madison Swain-Bowden <[email protected]>

* tests for batch limit API parameter

* point to popularity metrics

* template test directory fix

* make license info a class variable

* Remove outdated/duplicated template creation files

* Update DAG documentation

* fortify and test convert filesize

Co-authored-by: Staci Cooper <[email protected]>
Co-authored-by: rwidom <[email protected]>
Co-authored-by: rwidom <[email protected]>
Co-authored-by: Madison Swain-Bowden <[email protected]>
  • Loading branch information
5 people authored Jan 16, 2023
1 parent 2a0a1c3 commit 709a466
Show file tree
Hide file tree
Showing 9 changed files with 509 additions and 15 deletions.
13 changes: 13 additions & 0 deletions DAGs.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ The following are DAGs grouped by their primary tag:
| [`jamendo_workflow`](#jamendo_workflow) | `@monthly` | `False` | audio |
| [`metropolitan_museum_workflow`](#metropolitan_museum_workflow) | `@daily` | `True` | image |
| `museum_victoria_workflow` | `@monthly` | `False` | image |
| [`nappy_workflow`](#nappy_workflow) | `@monthly` | `False` | image |
| `nypl_workflow` | `@monthly` | `False` | image |
| [`phylopic_workflow`](#phylopic_workflow) | `@daily` | `True` | image |
| [`rawpixel_workflow`](#rawpixel_workflow) | `@monthly` | `False` | image |
Expand Down Expand Up @@ -105,6 +106,7 @@ The following is documentation associated with each DAG (where available):
1. [`jamendo_workflow`](#jamendo_workflow)
1. [`metropolitan_museum_reingestion_workflow`](#metropolitan_museum_reingestion_workflow)
1. [`metropolitan_museum_workflow`](#metropolitan_museum_workflow)
1. [`nappy_workflow`](#nappy_workflow)
1. [`oauth2_authorization`](#oauth2_authorization)
1. [`oauth2_token_refresh`](#oauth2_token_refresh)
1. [`phylopic_reingestion_workflow`](#phylopic_reingestion_workflow)
Expand Down Expand Up @@ -376,6 +378,17 @@ blocking during local development testing.
connect with just date and license.
https://collectionapi.metmuseum.org/public/collection/v1/search?isPublicDomain=true&metadataDate=2022-08-07

## `nappy_workflow`

Content Provider: Nappy

ETL Process: Use the API to identify all CC0-licensed images.

Output: TSV file containing the image meta-data.

Notes: This api was written specially for Openverse. There are no known limits
or restrictions. https://nappy.co/

## `oauth2_authorization`

### OAuth Provider Authorization
Expand Down
5 changes: 3 additions & 2 deletions docker/local_postgres/0004_openledger_image_view.sql
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,10 @@ INSERT INTO public.image_popularity_metrics (
provider, metric, percentile
) VALUES
('flickr', 'views', 0.85),
('wikimedia', 'global_usage_count', 0.85),
('nappy', 'downloads', 0.85),
('rawpixel', 'download_count', 0.85),
('stocksnap', 'downloads_raw', 0.85),
('rawpixel', 'download_count', 0.85)
('wikimedia', 'global_usage_count', 0.85)
;


Expand Down
24 changes: 13 additions & 11 deletions openverse_catalog/dags/common/loader/provider_details.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,26 +13,27 @@


# Default provider names
FLICKR_DEFAULT_PROVIDER = "flickr"
EUROPEANA_DEFAULT_PROVIDER = "europeana"
WIKIMEDIA_AUDIO_PROVIDER = "wikimedia_audio"
WIKIMEDIA_DEFAULT_PROVIDER = "wikimedia"
SMITHSONIAN_DEFAULT_PROVIDER = "smithsonian"
BROOKLYN_DEFAULT_PROVIDER = "brooklynmuseum"
CLEVELAND_DEFAULT_PROVIDER = "clevelandmuseum"
EUROPEANA_DEFAULT_PROVIDER = "europeana"
FINNISH_DEFAULT_PROVIDER = "finnishmuseums"
FLICKR_DEFAULT_PROVIDER = "flickr"
FREESOUND_DEFAULT_PROVIDER = "freesound"
INATURALIST_DEFAULT_PROVIDER = "inaturalist"
JAMENDO_DEFAULT_PROVIDER = "jamendo"
METROPOLITAN_MUSEUM_DEFAULT_PROVIDER = "met"
VICTORIA_DEFAULT_PROVIDER = "museumsvictoria"
NAPPY_DEFAULT_PROVIDER = "nappy"
NYPL_DEFAULT_PROVIDER = "nypl"
RAWPIXEL_DEFAULT_PROVIDER = "rawpixel"
SCIENCE_DEFAULT_PROVIDER = "sciencemuseum"
SMITHSONIAN_DEFAULT_PROVIDER = "smithsonian"
SMK_DEFAULT_PROVIDER = "smk"
WALTERS_DEFAULT_PROVIDER = "waltersartmuseum"
FINNISH_DEFAULT_PROVIDER = "finnishmuseums"
JAMENDO_DEFAULT_PROVIDER = "jamendo"
STOCKSNAP_DEFAULT_PROVIDER = "stocksnap"
VICTORIA_DEFAULT_PROVIDER = "museumsvictoria"
WALTERS_DEFAULT_PROVIDER = "waltersartmuseum"
WIKIMEDIA_AUDIO_PROVIDER = "wikimedia_audio"
WIKIMEDIA_DEFAULT_PROVIDER = "wikimedia"
WORDPRESS_DEFAULT_PROVIDER = "wordpress"
FREESOUND_DEFAULT_PROVIDER = "freesound"
INATURALIST_DEFAULT_PROVIDER = "inaturalist"
PHYLOPIC_DEFAULT_PROVIDER = "phylopic"

# Finnish parameters
Expand Down Expand Up @@ -138,6 +139,7 @@ class ImageCategory(Enum):
"mccordmuseum": ImageCategory.DIGITIZED_ARTWORK.value,
"met": ImageCategory.DIGITIZED_ARTWORK.value,
"museumsvictoria": ImageCategory.DIGITIZED_ARTWORK.value,
"nappy": ImageCategory.PHOTOGRAPH.value,
"phylopic": ImageCategory.ILLUSTRATION.value,
"rijksmuseum": ImageCategory.DIGITIZED_ARTWORK.value,
"sciencemuseum": ImageCategory.PHOTOGRAPH.value,
Expand Down
5 changes: 3 additions & 2 deletions openverse_catalog/dags/common/popularity/sql.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,10 @@

IMAGE_POPULARITY_METRICS = {
"flickr": {"metric": "views"},
"wikimedia": {"metric": "global_usage_count"},
"stocksnap": {"metric": "downloads_raw"},
"nappy": {"metric": "downloads"},
"rawpixel": {"metric": "download_count"},
"stocksnap": {"metric": "downloads_raw"},
"wikimedia": {"metric": "global_usage_count"},
}

AUDIO_POPULARITY_METRICS = {
Expand Down
121 changes: 121 additions & 0 deletions openverse_catalog/dags/providers/provider_api_scripts/nappy.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
"""
Content Provider: Nappy
ETL Process: Use the API to identify all CC0-licensed images.
Output: TSV file containing the image meta-data.
Notes: This api was written specially for Openverse.
There are no known limits or restrictions.
https://nappy.co/
"""
import logging

from common import constants
from common.licenses import get_license_info
from common.loader import provider_details as prov
from providers.provider_api_scripts.provider_data_ingester import ProviderDataIngester


logger = logging.getLogger(__name__)


class NappyDataIngester(ProviderDataIngester):
providers = {constants.IMAGE: prov.NAPPY_DEFAULT_PROVIDER}
endpoint = "https://api.nappy.co/v1/openverse/images"
headers = {"User-Agent": prov.UA_STRING, "Accept": "application/json"}

# Hardoded to CC0, the only license Nappy.co uses
license_info = get_license_info(
"https://creativecommons.org/publicdomain/zero/1.0/"
)

def get_next_query_params(self, prev_query_params: dict | None, **kwargs) -> dict:
if not prev_query_params:
return {
"page": 1,
"per_page": self.batch_limit,
}
else:
return {
**prev_query_params,
"page": prev_query_params["page"] + 1,
}

def get_batch_data(self, response_json):
if response_json:
return response_json.get("images")
return None

def get_should_continue(self, response_json):
return bool(response_json.get("next_page"))

def get_media_type(self, record: dict):
return constants.IMAGE

@staticmethod
def _convert_filesize(raw_filesize_string: str) -> int:
"""
Convert sizes from strings to byte integers, ex. "187.8kB" to 188.
"""
FILETYPE_MULTIPLIERS = {"kB": 1000, "MB": 1_000_000, "GB": 1_000_000_000}
if isinstance(raw_filesize_string, str) and len(raw_filesize_string) > 2:
stripped = raw_filesize_string.strip()
if stripped[-2:] in FILETYPE_MULTIPLIERS:
try:
units = float(stripped[:-2])
except ValueError:
return
multiplier = FILETYPE_MULTIPLIERS[stripped[-2:]]
return round(units * multiplier)

def get_record_data(self, data: dict) -> dict | list[dict] | None:
if (foreign_landing_url := data.get("foreign_landing_url")) is None:
return None

if (image_url := data.get("url")) is None:
return None

foreign_identifier = data.get("foreign_identifier")
thumbnail_url = data.get("url") + "?auto=format&w=600&q=75"
filesize = self._convert_filesize(data.get("filesize"))
filetype = data.get("filetype")
creator = data.get("creator")
creator_url = data.get("creator_url")
title = data.get("title")
meta_data = {
"views": data.get("views"),
"saves": data.get("saves"),
"downloads": data.get("downloads"),
}
raw_tags = data.get("tags").split(",")
width = data.get("width")
height = data.get("height")

return {
"foreign_landing_url": foreign_landing_url,
"image_url": image_url,
"thumbnail_url": thumbnail_url,
"license_info": self.license_info,
"foreign_identifier": foreign_identifier,
"filesize": filesize,
"filetype": filetype,
"creator": creator,
"creator_url": creator_url,
"title": title,
"meta_data": meta_data,
"raw_tags": raw_tags,
"width": width,
"height": height,
}


def main():
logger.info("Begin: Nappy data ingestion")
ingester = NappyDataIngester()
ingester.ingest_records()


if __name__ == "__main__":
main()
5 changes: 5 additions & 0 deletions openverse_catalog/dags/providers/provider_workflows.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from providers.provider_api_scripts.jamendo import JamendoDataIngester
from providers.provider_api_scripts.metropolitan_museum import MetMuseumDataIngester
from providers.provider_api_scripts.museum_victoria import VictoriaDataIngester
from providers.provider_api_scripts.nappy import NappyDataIngester
from providers.provider_api_scripts.nypl import NyplDataIngester
from providers.provider_api_scripts.phylopic import PhylopicDataIngester
from providers.provider_api_scripts.provider_data_ingester import ProviderDataIngester
Expand Down Expand Up @@ -160,6 +161,10 @@ def __post_init__(self):
ingester_class=VictoriaDataIngester,
start_date=datetime(2020, 1, 1),
),
ProviderWorkflow(
ingester_class=NappyDataIngester,
start_date=datetime(2022, 12, 1),
),
ProviderWorkflow(
ingester_class=NyplDataIngester,
start_date=datetime(2020, 1, 1),
Expand Down
Loading

0 comments on commit 709a466

Please sign in to comment.