Merge branch 'extract_media_storage' into add_audio_storage

WordPress · Jun 17, 2021 · f08299a · f08299a
2 parents f3785d0 + 273ea6a
commit f08299a
Show file tree

Hide file tree

Showing 20 changed files with 366 additions and 294 deletions.
diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
@@ -0,0 +1,2 @@
+# All files are owned by the Openverse Developers team
+* @WordPress/openverse-developers
diff --git a/.github/workflows-disabled/lint.yml → .github/workflows/lint.yml b/.github/workflows-disabled/lint.yml → .github/workflows/lint.yml
@@ -2,27 +2,28 @@ name: flake8_lint
 
 on:
   push:
+    branches:
+      - main
     paths:
       - '**.py'
   pull_request:
     paths:
       - '**.py'
 
-
 jobs:
   build:
     runs-on: ubuntu-latest
 
     steps:
     - uses: actions/checkout@v1
-    - name: Set up Python 3.7
+    - name: Set up Python 3.9
       uses: actions/setup-python@v1
       with:
-        python-version: 3.7
+        python-version: 3.9
     - name: Setup flake8 annotations  # To annotate the errors in the pull request
       uses: rbialon/flake8-annotations@v1
     - name: Lint with flake8
-      # Lint for only changed files, rather than the entire repository
+      # Lint only for changed files, rather than the entire repository
       run: |
         pip install flake8>=3.7.0
         git diff HEAD^ | flake8 --diff
diff --git a/...flows-disabled/push_pull_request_test.yml → .github/workflows/push_pull_request_test.yml b/...flows-disabled/push_pull_request_test.yml → .github/workflows/push_pull_request_test.yml
@@ -1,8 +1,15 @@
 name: Automated testing workflow
-on: [push, pull_request]
+
+on:
+  push:
+    branches:
+      - main
+  pull_request:
+
 jobs:
   test:
     runs-on: ubuntu-latest
+
     steps:
       - uses: actions/checkout@v2
       - name: Build the stack

diff --git a/.github/workflows/release-drafter.yml b/.github/workflows/release-drafter.yml
@@ -3,7 +3,7 @@ name: Release Drafter
 on:
   push:
     branches:
-      - develop
+      - main
 
 jobs:
   update_release_draft:

diff --git a/README.md b/README.md
@@ -49,40 +49,29 @@ The steps above are performed in [`ExtractCCLinks.py`][ex_cc_links].
 various API ETL jobs which pull and process data from a number of open APIs on
 the internet.
 
-### Common API Workflows
+### Daily API Workflows
 
-The Airflow DAGs defined in [`common_api_workflows.py`][api_flows] manage daily
-ETL jobs for the following platforms, by running the linked scripts:
+Workflows that have a `schedule_string='@daily'` parameter are run daily. The DAG 
+workflows run `provider_api_scripts` to load and extract media data from the APIs. 
+Below are some of the daily DAG workflows that run the corresponding `provider_api_scripts` 
+daily:
 
-- [Met Museum](src/cc_catalog_airflow/dags/provider_api_scripts/metropolitan_museum_of_art.py)
-- [PhyloPic](src/cc_catalog_airflow/dags/provider_api_scripts/phylopic.py)
-- [Thingiverse](src/cc_catalog_airflow/dags/provider_api_scripts/Thingiverse.py)
-
-[api_flows]: src/cc_catalog_airflow/dags/common_api_workflows.py
-
-### Other Daily API Workflows
-
-Airflow DAGs, defined in their own files, also run the following scripts daily:
-
-- [Flickr](src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py)
-- [Wikimedia Commons](src/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py)
-
-In the future, we'll migrate to the latter style of Airflow DAGs and
-accompanying Provider API Scripts.
+- [Met Museum Workflow](src/cc_catalog_airflow/dags/metropolitan_museum_workflow.py) ( [API script](src/cc_catalog_airflow/dags/provider_api_scripts/metropolitan_museum_of_art.py) )
+- [PhyloPic Workflow](src/cc_catalog_airflow/dags/phylopic_workflow.py) ( [API script](src/cc_catalog_airflow/dags/provider_api_scripts/phylopic.py) )
+- [Flickr Workflow](src/cc_catalog_airflow/dags/flickr_workflow.py) ( [API script](src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py) )
+- [Wikimedia Commons Workflow](src/cc_catalog_airflow/dags/wikimedia_workflow.py) ( [Commons API script](src/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py) )
 
 ### Monthly Workflow
 
-The Airflow DAG defined in [`monthlyWorkflow.py`][mon_flow] handles the monthly
-jobs that are scheduled to run on the 15th day of each month at 16:00 UTC. This
-workflow is reserved for long-running jobs or APIs that do not have date
-filtering capabilities so the data is reprocessed monthly to keep the catalog
-updated. The following tasks are performed:
+Some API ingestion workflows are scheduled to run on the 15th day of each 
+month at 16:00 UTC. These workflows are reserved for long-running jobs or
+APIs that do not have date filtering capabilities so the data is reprocessed 
+monthly to keep the catalog updated. The following tasks are performed monthly:
 
-- [Cleveland Museum of Art](src/cc_catalog_airflow/dags/provider_api_scripts/ClevelandMuseum.py)
-- [RawPixel](src/cc_catalog_airflow/dags/provider_api_scripts/RawPixel.py)
+- [Cleveland Museum of Art](src/cc_catalog_airflow/dags/provider_api_scripts/cleveland_museum_of_art.py)
+- [RawPixel](src/cc_catalog_airflow/dags/provider_api_scripts/raw_pixel.py)
 - [Common Crawl Syncer](src/cc_catalog_airflow/dags/commoncrawl_s3_syncer/SyncImageProviders.py)
 
-[mon_flow]: src/cc_catalog_airflow/dags/monthlyWorkflow.py
 
 ### DB_Loader
 
@@ -92,11 +81,10 @@ into the upstream database. It includes some data preprocessing steps.
 
 [db_loader]: src/cc_catalog_airflow/dags/loader_workflow.py
 
-### Other API Jobs (not in the workflow)
+### Other API Jobs
 
-- [Brooklyn Museum](src/cc_catalog_airflow/dags/provider_api_scripts/BrooklynMuseum.py)
-- [NYPL](src/cc_catalog_airflow/dags/provider_api_scripts/NYPL.py)
-- Cleveland Public Library
+- [Brooklyn Museum](src/cc_catalog_airflow/dags/provider_api_scripts/brooklyn_museum.py)
+- [NYPL](src/cc_catalog_airflow/dags/provider_api_scripts/nypl.py)
 
 ## Development setup for Airflow and API puller scripts
 

diff --git a/src/cc_catalog_airflow/dags/common/__init__.py b/src/cc_catalog_airflow/dags/common/__init__.py
@@ -1,3 +1,6 @@
+# flake8: noqa
 from .licenses import constants, licenses
-from .storage.image import Image, ImageStore, MockImageStore
+from .storage.image import (
+    Image, ImageStore, MockImageStore
+)
 from .requester import DelayedRequester
diff --git a/src/cc_catalog_airflow/dags/common/requester.py b/src/cc_catalog_airflow/dags/common/requester.py
@@ -33,24 +33,24 @@ def get(self, url, params=None, **kwargs):
         params:   Dictionary of query string params
         **kwargs: Optional arguments that will be passed to `requests.get`
          """
-        logger.info(f'Processing request for url: {url}')
-        logger.info(f'Using query parameters {params}')
-        logger.info(f'Using headers {kwargs.get("headers")}')
         self._delay_processing()
         self._last_request = time.time()
         try:
             response = requests.get(url, params=params, **kwargs)
             if response.status_code == requests.codes.ok:
+                logger.info(f'Received response from url {response.url}')
                 return response
             else:
                 logger.warning(
-                    f'Unable to request URL: {url}.  '
+                    f'Unable to request URL: {response.url}.  '
                     f'Status code: {response.status_code}'
                 )
                 return response
         except Exception as e:
-            logger.error('There was an error with the request.')
+            logger.error(f'Error with the request for url: {url}.')
             logger.info(f'{type(e).__name__}: {e}')
+            logger.info(f'Using query parameters {params}')
+            logger.info(f'Using headers {kwargs.get("headers")}')
             return None
 
     def _delay_processing(self):

diff --git a/src/cc_catalog_airflow/dags/common/storage/image.py b/src/cc_catalog_airflow/dags/common/storage/image.py
@@ -1,6 +1,6 @@
 from collections import namedtuple
 import logging
-from typing import Optional
+from typing import Optional, Dict, Union
 
 from common.storage import columns
 from common.storage.media import MediaStore
@@ -89,8 +89,11 @@ def __init__(
         media_type="image",
         tsv_columns=None,
     ):
-        super().__init__(provider, output_file, output_dir, buffer_length, media_type)
-        self.columns = IMAGE_TSV_COLUMNS if tsv_columns is None else tsv_columns
+        super().__init__(
+            provider, output_file, output_dir, buffer_length, media_type
+        )
+        self.columns = IMAGE_TSV_COLUMNS \
+            if tsv_columns is None else tsv_columns
 
     def add_item(
         self,
@@ -101,14 +104,14 @@ def add_item(
         license_: Optional[str] = None,
         license_version: Optional[str] = None,
         foreign_identifier: Optional[str] = None,
-        width=None,
-        height=None,
-        creator=None,
-        creator_url=None,
-        title=None,
-        meta_data=None,
+        width: Optional[int] = None,
+        height: Optional[int] = None,
+        creator: Optional[str] = None,
+        creator_url: Optional[str] = None,
+        title: Optional[str] = None,
+        meta_data: Optional[Union[Dict, str]] = None,
         raw_tags=None,
-        watermarked="f",
+        watermarked: Optional[str] = "f",
         source: Optional[str] = None,
     ):
         """
@@ -162,75 +165,57 @@ def add_item(
                              ImageStore init function is the specific
                              provider of the image.
         """
-        image = self._get_image(
-            foreign_landing_url=foreign_landing_url,
-            image_url=image_url,
-            thumbnail_url=thumbnail_url,
-            license_url=license_url,
-            license_=license_,
-            license_version=license_version,
-            foreign_identifier=foreign_identifier,
-            width=width,
-            height=height,
-            creator=creator,
-            creator_url=creator_url,
-            title=title,
-            meta_data=meta_data,
-            raw_tags=raw_tags,
-            watermarked=watermarked,
-            source=source,
+        valid_license, raw_license_url = self.get_valid_license_info(
+            license_url,
+            license_,
+            license_version
         )
+        if valid_license.license is None:
+            logger.debug(
+                f"Invalid image license : {license_url},"
+                "{license_}, {license_version}")
+            return None
+        image_data = {
+            'foreign_landing_url': foreign_landing_url,
+            'image_url': image_url,
+            'thumbnail_url': thumbnail_url,
+            'license_url': valid_license.url,
+            'license_': valid_license.license,
+            'license_version': valid_license.version,
+            'raw_license_url': raw_license_url,
+            'foreign_identifier': foreign_identifier,
+            'width': width,
+            'height': height,
+            'creator': creator,
+            'creator_url': creator_url,
+            'title': title,
+            'meta_data': meta_data,
+            'raw_tags': raw_tags,
+            'watermarked': watermarked,
+            'source': source,
+        }
+        image = self._get_image(**image_data)
         if image is not None:
             self.save_item(image)
-        return self._total_items
+        return self.total_items
 
-    def _get_image(
-        self,
-        foreign_identifier,
-        foreign_landing_url,
-        image_url,
-        thumbnail_url,
-        width,
-        height,
-        license_url,
-        license_,
-        license_version,
-        creator,
-        creator_url,
-        title,
-        meta_data,
-        raw_tags,
-        watermarked,
-        source,
-    ):
-        valid_license_info, raw_license_url = self.get_valid_license_info(
-            license_url, license_, license_version
-        )
-        if valid_license_info.license is None:
-            return None
-        source, meta_data, tags = self.parse_item_metadata(
-            valid_license_info.url, raw_license_url, source, meta_data, raw_tags
-        )
+    def _get_image(self, **kwargs) -> Image:
+        """Validates image information and returns Image namedtuple"""
 
-        return Image(
-            foreign_identifier=foreign_identifier,
-            foreign_landing_url=foreign_landing_url,
-            image_url=image_url,
-            thumbnail_url=thumbnail_url,
-            license_=valid_license_info.license,
-            license_version=valid_license_info.version,
-            width=width,
-            height=height,
-            filesize=None,
-            creator=creator,
-            creator_url=creator_url,
-            title=title,
-            meta_data=meta_data,
-            tags=tags,
-            watermarked=watermarked,
-            provider=self._PROVIDER,
-            source=source,
+        kwargs['source'] = self.get_source(kwargs['source'])
+        kwargs['meta_data'], kwargs['tags'] = self.parse_item_metadata(
+            kwargs['license_url'],
+            kwargs.get('raw_license_url'),
+            kwargs.get('meta_data'),
+            kwargs.get('raw_tags'),
         )
+        kwargs.pop('raw_tags', None)
+        kwargs.pop('raw_license_url', None)
+        kwargs.pop('license_url', None)
+        kwargs['provider'] = self._PROVIDER
+        kwargs['filesize'] = None
+
+        return Image(**kwargs)
 
 
 class MockImageStore(ImageStore):
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		# All files are owned by the Openverse Developers team
		* @WordPress/openverse-developers
-Original file line number
+Diff line change
@@ Expand Up / @@ -3,7 +3,7 @@ name: Release Drafter @@
     on:
       push:
         branches:
-          - develop
+          - main
     jobs:
       update_release_draft:
@@ Expand Down @@