Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Commit

Permalink
Merge branch 'extract_media_storage' into add_audio_storage
Browse files Browse the repository at this point in the history
  • Loading branch information
obulat authored Jun 17, 2021
2 parents f3785d0 + 273ea6a commit f08299a
Show file tree
Hide file tree
Showing 20 changed files with 366 additions and 294 deletions.
2 changes: 2 additions & 0 deletions .github/CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# All files are owned by the Openverse Developers team
* @WordPress/openverse-developers
Original file line number Diff line number Diff line change
Expand Up @@ -2,27 +2,28 @@ name: flake8_lint

on:
push:
branches:
- main
paths:
- '**.py'
pull_request:
paths:
- '**.py'


jobs:
build:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v1
- name: Set up Python 3.7
- name: Set up Python 3.9
uses: actions/setup-python@v1
with:
python-version: 3.7
python-version: 3.9
- name: Setup flake8 annotations # To annotate the errors in the pull request
uses: rbialon/flake8-annotations@v1
- name: Lint with flake8
# Lint for only changed files, rather than the entire repository
# Lint only for changed files, rather than the entire repository
run: |
pip install flake8>=3.7.0
git diff HEAD^ | flake8 --diff
Original file line number Diff line number Diff line change
@@ -1,8 +1,15 @@
name: Automated testing workflow
on: [push, pull_request]

on:
push:
branches:
- main
pull_request:

jobs:
test:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v2
- name: Build the stack
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/release-drafter.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ name: Release Drafter
on:
push:
branches:
- develop
- main

jobs:
update_release_draft:
Expand Down
48 changes: 18 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,40 +49,29 @@ The steps above are performed in [`ExtractCCLinks.py`][ex_cc_links].
various API ETL jobs which pull and process data from a number of open APIs on
the internet.

### Common API Workflows
### Daily API Workflows

The Airflow DAGs defined in [`common_api_workflows.py`][api_flows] manage daily
ETL jobs for the following platforms, by running the linked scripts:
Workflows that have a `schedule_string='@daily'` parameter are run daily. The DAG
workflows run `provider_api_scripts` to load and extract media data from the APIs.
Below are some of the daily DAG workflows that run the corresponding `provider_api_scripts`
daily:

- [Met Museum](src/cc_catalog_airflow/dags/provider_api_scripts/metropolitan_museum_of_art.py)
- [PhyloPic](src/cc_catalog_airflow/dags/provider_api_scripts/phylopic.py)
- [Thingiverse](src/cc_catalog_airflow/dags/provider_api_scripts/Thingiverse.py)

[api_flows]: src/cc_catalog_airflow/dags/common_api_workflows.py

### Other Daily API Workflows

Airflow DAGs, defined in their own files, also run the following scripts daily:

- [Flickr](src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py)
- [Wikimedia Commons](src/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py)

In the future, we'll migrate to the latter style of Airflow DAGs and
accompanying Provider API Scripts.
- [Met Museum Workflow](src/cc_catalog_airflow/dags/metropolitan_museum_workflow.py) ( [API script](src/cc_catalog_airflow/dags/provider_api_scripts/metropolitan_museum_of_art.py) )
- [PhyloPic Workflow](src/cc_catalog_airflow/dags/phylopic_workflow.py) ( [API script](src/cc_catalog_airflow/dags/provider_api_scripts/phylopic.py) )
- [Flickr Workflow](src/cc_catalog_airflow/dags/flickr_workflow.py) ( [API script](src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py) )
- [Wikimedia Commons Workflow](src/cc_catalog_airflow/dags/wikimedia_workflow.py) ( [Commons API script](src/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py) )

### Monthly Workflow

The Airflow DAG defined in [`monthlyWorkflow.py`][mon_flow] handles the monthly
jobs that are scheduled to run on the 15th day of each month at 16:00 UTC. This
workflow is reserved for long-running jobs or APIs that do not have date
filtering capabilities so the data is reprocessed monthly to keep the catalog
updated. The following tasks are performed:
Some API ingestion workflows are scheduled to run on the 15th day of each
month at 16:00 UTC. These workflows are reserved for long-running jobs or
APIs that do not have date filtering capabilities so the data is reprocessed
monthly to keep the catalog updated. The following tasks are performed monthly:

- [Cleveland Museum of Art](src/cc_catalog_airflow/dags/provider_api_scripts/ClevelandMuseum.py)
- [RawPixel](src/cc_catalog_airflow/dags/provider_api_scripts/RawPixel.py)
- [Cleveland Museum of Art](src/cc_catalog_airflow/dags/provider_api_scripts/cleveland_museum_of_art.py)
- [RawPixel](src/cc_catalog_airflow/dags/provider_api_scripts/raw_pixel.py)
- [Common Crawl Syncer](src/cc_catalog_airflow/dags/commoncrawl_s3_syncer/SyncImageProviders.py)

[mon_flow]: src/cc_catalog_airflow/dags/monthlyWorkflow.py

### DB_Loader

Expand All @@ -92,11 +81,10 @@ into the upstream database. It includes some data preprocessing steps.

[db_loader]: src/cc_catalog_airflow/dags/loader_workflow.py

### Other API Jobs (not in the workflow)
### Other API Jobs

- [Brooklyn Museum](src/cc_catalog_airflow/dags/provider_api_scripts/BrooklynMuseum.py)
- [NYPL](src/cc_catalog_airflow/dags/provider_api_scripts/NYPL.py)
- Cleveland Public Library
- [Brooklyn Museum](src/cc_catalog_airflow/dags/provider_api_scripts/brooklyn_museum.py)
- [NYPL](src/cc_catalog_airflow/dags/provider_api_scripts/nypl.py)

## Development setup for Airflow and API puller scripts

Expand Down
5 changes: 4 additions & 1 deletion src/cc_catalog_airflow/dags/common/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
# flake8: noqa
from .licenses import constants, licenses
from .storage.image import Image, ImageStore, MockImageStore
from .storage.image import (
Image, ImageStore, MockImageStore
)
from .requester import DelayedRequester
10 changes: 5 additions & 5 deletions src/cc_catalog_airflow/dags/common/requester.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,24 +33,24 @@ def get(self, url, params=None, **kwargs):
params: Dictionary of query string params
**kwargs: Optional arguments that will be passed to `requests.get`
"""
logger.info(f'Processing request for url: {url}')
logger.info(f'Using query parameters {params}')
logger.info(f'Using headers {kwargs.get("headers")}')
self._delay_processing()
self._last_request = time.time()
try:
response = requests.get(url, params=params, **kwargs)
if response.status_code == requests.codes.ok:
logger.info(f'Received response from url {response.url}')
return response
else:
logger.warning(
f'Unable to request URL: {url}. '
f'Unable to request URL: {response.url}. '
f'Status code: {response.status_code}'
)
return response
except Exception as e:
logger.error('There was an error with the request.')
logger.error(f'Error with the request for url: {url}.')
logger.info(f'{type(e).__name__}: {e}')
logger.info(f'Using query parameters {params}')
logger.info(f'Using headers {kwargs.get("headers")}')
return None

def _delay_processing(self):
Expand Down
131 changes: 58 additions & 73 deletions src/cc_catalog_airflow/dags/common/storage/image.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from collections import namedtuple
import logging
from typing import Optional
from typing import Optional, Dict, Union

from common.storage import columns
from common.storage.media import MediaStore
Expand Down Expand Up @@ -89,8 +89,11 @@ def __init__(
media_type="image",
tsv_columns=None,
):
super().__init__(provider, output_file, output_dir, buffer_length, media_type)
self.columns = IMAGE_TSV_COLUMNS if tsv_columns is None else tsv_columns
super().__init__(
provider, output_file, output_dir, buffer_length, media_type
)
self.columns = IMAGE_TSV_COLUMNS \
if tsv_columns is None else tsv_columns

def add_item(
self,
Expand All @@ -101,14 +104,14 @@ def add_item(
license_: Optional[str] = None,
license_version: Optional[str] = None,
foreign_identifier: Optional[str] = None,
width=None,
height=None,
creator=None,
creator_url=None,
title=None,
meta_data=None,
width: Optional[int] = None,
height: Optional[int] = None,
creator: Optional[str] = None,
creator_url: Optional[str] = None,
title: Optional[str] = None,
meta_data: Optional[Union[Dict, str]] = None,
raw_tags=None,
watermarked="f",
watermarked: Optional[str] = "f",
source: Optional[str] = None,
):
"""
Expand Down Expand Up @@ -162,75 +165,57 @@ def add_item(
ImageStore init function is the specific
provider of the image.
"""
image = self._get_image(
foreign_landing_url=foreign_landing_url,
image_url=image_url,
thumbnail_url=thumbnail_url,
license_url=license_url,
license_=license_,
license_version=license_version,
foreign_identifier=foreign_identifier,
width=width,
height=height,
creator=creator,
creator_url=creator_url,
title=title,
meta_data=meta_data,
raw_tags=raw_tags,
watermarked=watermarked,
source=source,
valid_license, raw_license_url = self.get_valid_license_info(
license_url,
license_,
license_version
)
if valid_license.license is None:
logger.debug(
f"Invalid image license : {license_url},"
"{license_}, {license_version}")
return None
image_data = {
'foreign_landing_url': foreign_landing_url,
'image_url': image_url,
'thumbnail_url': thumbnail_url,
'license_url': valid_license.url,
'license_': valid_license.license,
'license_version': valid_license.version,
'raw_license_url': raw_license_url,
'foreign_identifier': foreign_identifier,
'width': width,
'height': height,
'creator': creator,
'creator_url': creator_url,
'title': title,
'meta_data': meta_data,
'raw_tags': raw_tags,
'watermarked': watermarked,
'source': source,
}
image = self._get_image(**image_data)
if image is not None:
self.save_item(image)
return self._total_items
return self.total_items

def _get_image(
self,
foreign_identifier,
foreign_landing_url,
image_url,
thumbnail_url,
width,
height,
license_url,
license_,
license_version,
creator,
creator_url,
title,
meta_data,
raw_tags,
watermarked,
source,
):
valid_license_info, raw_license_url = self.get_valid_license_info(
license_url, license_, license_version
)
if valid_license_info.license is None:
return None
source, meta_data, tags = self.parse_item_metadata(
valid_license_info.url, raw_license_url, source, meta_data, raw_tags
)
def _get_image(self, **kwargs) -> Image:
"""Validates image information and returns Image namedtuple"""

return Image(
foreign_identifier=foreign_identifier,
foreign_landing_url=foreign_landing_url,
image_url=image_url,
thumbnail_url=thumbnail_url,
license_=valid_license_info.license,
license_version=valid_license_info.version,
width=width,
height=height,
filesize=None,
creator=creator,
creator_url=creator_url,
title=title,
meta_data=meta_data,
tags=tags,
watermarked=watermarked,
provider=self._PROVIDER,
source=source,
kwargs['source'] = self.get_source(kwargs['source'])
kwargs['meta_data'], kwargs['tags'] = self.parse_item_metadata(
kwargs['license_url'],
kwargs.get('raw_license_url'),
kwargs.get('meta_data'),
kwargs.get('raw_tags'),
)
kwargs.pop('raw_tags', None)
kwargs.pop('raw_license_url', None)
kwargs.pop('license_url', None)
kwargs['provider'] = self._PROVIDER
kwargs['filesize'] = None

return Image(**kwargs)


class MockImageStore(ImageStore):
Expand Down
Loading

0 comments on commit f08299a

Please sign in to comment.