Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Refactor Rawpixel to use ProviderDataIngester #795

Merged
merged 29 commits into from
Oct 27, 2022
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
c741826
Initial refactor for Rawpixel
AetherUnbound Oct 11, 2022
1d37200
Format JSON
AetherUnbound Oct 11, 2022
a93a5cf
Move everything into class
AetherUnbound Oct 11, 2022
e396059
Simplify ID get, add signature w/ notes (it is incomplete)
AetherUnbound Oct 11, 2022
733c1d3
Add logic for injecting API signature
AetherUnbound Oct 11, 2022
9976c1f
Add NO_LICENSE_FOUND constant
AetherUnbound Oct 12, 2022
111a97a
Improve license capture
AetherUnbound Oct 12, 2022
f7dae1d
Add function for retrieving direct URL
AetherUnbound Oct 12, 2022
d1defff
Add logic to remove fluff text from title
AetherUnbound Oct 13, 2022
a893dc1
Fill out more fields
AetherUnbound Oct 13, 2022
cd150d9
Add popularity data
AetherUnbound Oct 13, 2022
162b696
Add ingestion callable to workflows list
AetherUnbound Oct 13, 2022
fdad32e
Update data based on new API response
AetherUnbound Oct 13, 2022
44e9e55
Use style_uri to determine image URL
AetherUnbound Oct 13, 2022
7005ca0
Remove get_response_json override, add docs
AetherUnbound Oct 13, 2022
60f35a0
Update tests
AetherUnbound Oct 13, 2022
2b2a82d
Update DAGs.md
AetherUnbound Oct 14, 2022
cf23930
Add Rawpixel key to env.template & sort values
AetherUnbound Oct 14, 2022
2452e60
Add thumbnail capture into meta_data
AetherUnbound Oct 21, 2022
0d782cf
Remove unnecessary logging line
AetherUnbound Oct 21, 2022
129a9f7
Comment out Airflow Variable API key examples
AetherUnbound Oct 21, 2022
a3ef681
Add reference to image popularity metrics calculation
AetherUnbound Oct 21, 2022
7c3fa33
Remove thumbnail extraction for now
AetherUnbound Oct 21, 2022
1d6829a
Fine-tune string cleaning a bit more
AetherUnbound Oct 21, 2022
cc76ae8
More regex fine-tuning
AetherUnbound Oct 21, 2022
b6538c8
Revert "Comment out Airflow Variable API key examples"
AetherUnbound Oct 21, 2022
7b6dc78
Fix tests
AetherUnbound Oct 25, 2022
44c5d9c
Update documentation
AetherUnbound Oct 25, 2022
c803f3a
Rename _get_source to _get_creator
AetherUnbound Oct 26, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 19 additions & 1 deletion DAGs.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ The following are DAGs grouped by their primary tag:
| `museum_victoria_workflow` | `@monthly` | `False` | image |
| `nypl_workflow` | `@monthly` | `False` | image |
| [`phylopic_workflow`](#phylopic_workflow) | `@daily` | `True` | image |
| `rawpixel_workflow` | `@monthly` | `False` | image |
| [`rawpixel_workflow`](#rawpixel_workflow) | `@monthly` | `False` | image |
| `science_museum_workflow` | `@monthly` | `False` | image |
| [`smithsonian_workflow`](#smithsonian_workflow) | `@weekly` | `False` | image |
| `smk_workflow` | `@monthly` | `False` | image |
Expand Down Expand Up @@ -125,6 +125,7 @@ The following is documentation associated with each DAG (where available):
1. [`oauth2_token_refresh`](#oauth2_token_refresh)
1. [`phylopic_workflow`](#phylopic_workflow)
1. [`pr_review_reminders`](#pr_review_reminders)
1. [`rawpixel_workflow`](#rawpixel_workflow)
1. [`recreate_audio_popularity_calculation`](#recreate_audio_popularity_calculation)
1. [`recreate_image_popularity_calculation`](#recreate_image_popularity_calculation)
1. [`report_pending_reported_media`](#report_pending_reported_media)
Expand Down Expand Up @@ -466,6 +467,23 @@ author of the PR to re-assign review if one of the randomly selected reviewers
is unavailable for the time period during which the PR should be reviewed.


## `rawpixel_workflow`


Content Provider: Rawpixel

ETL Process: Use the API to identify all CC-licensed images.

Output: TSV file containing the image meta-data.

Notes: Rawpixel has given us beta access to their API.
This API is undocumented, and we will need to contact Rawpixel
directly if we run into any issues.
The public API max results range is limited to 100,000 results,
although the API key we've been given can circumvent this limit.
https://www.rawpixel.com/api/v1/search?tags=$publicdomain&page=1&pagesize=100


## `recreate_audio_popularity_calculation`


Expand Down
7 changes: 6 additions & 1 deletion docker/local_postgres/0004_openledger_image_view.sql
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,17 @@ CREATE TABLE public.image_popularity_metrics (
);


-- For more information on these values see:
-- https://github.com/cc-archive/cccatalog/issues/405#issuecomment-629233047
-- https://github.com/cc-archive/cccatalog/pull/477
INSERT INTO public.image_popularity_metrics (
provider, metric, percentile
) VALUES
('flickr', 'views', 0.85),
('wikimedia', 'global_usage_count', 0.85),
('stocksnap', 'downloads_raw', 0.85);
('stocksnap', 'downloads_raw', 0.85),
krysal marked this conversation as resolved.
Show resolved Hide resolved
('rawpixel', 'download_count', 0.85)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! This has me curious if there are any other easy wins for popularity metrics with our other providers, created #815

;


CREATE FUNCTION image_popularity_percentile(
Expand Down
4 changes: 3 additions & 1 deletion env.template
Original file line number Diff line number Diff line change
Expand Up @@ -28,10 +28,12 @@ AIRFLOW_VAR_API_KEY_BROOKLYN_MUSEUM=not_set
AIRFLOW_VAR_API_KEY_DATA_GOV=not_set
AIRFLOW_VAR_API_KEY_EUROPEANA=not_set
AIRFLOW_VAR_API_KEY_FLICKR=not_set
AIRFLOW_VAR_API_KEY_FREESOUND=not_set
AIRFLOW_VAR_API_KEY_JAMENDO=not_set
AIRFLOW_VAR_API_KEY_NYPL=not_set
AIRFLOW_VAR_API_KEY_RAWPIXEL=not_set
AIRFLOW_VAR_API_KEY_THINGIVERSE=not_set
AIRFLOW_VAR_API_KEY_FREESOUND=not_set
AIRFLOW_VAR_API_KEY_WALTERS_ART_MUSEUM=not_set
krysal marked this conversation as resolved.
Show resolved Hide resolved


########################################################################################
Expand Down
3 changes: 3 additions & 0 deletions openverse_catalog/dags/common/licenses/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,6 @@
get_license_info_from_license_pair,
is_valid_license_info,
)


NO_LICENSE_FOUND = LicenseInfo(None, None, None, None)
AetherUnbound marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
from airflow.providers.postgres.operators.postgres import PostgresOperator
from airflow.utils.task_group import TaskGroup
from common.constants import POSTGRES_CONN_ID
from common.licenses import LicenseInfo, get_license_info
from common.licenses import NO_LICENSE_FOUND, get_license_info
from common.loader import provider_details as prov
from providers.provider_api_scripts.provider_data_ingester import ProviderDataIngester

Expand Down Expand Up @@ -87,7 +87,7 @@ def get_record_data(self, data):
return None
license_url = data.get("license_url")
license_info = get_license_info(license_url=license_url)
if license_info == LicenseInfo(None, None, None, None):
if license_info == NO_LICENSE_FOUND:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great fix! It was really cumbersome before.

return None
record_data = {k: data[k] for k in data.keys() if k != "license_url"}
record_data["license_info"] = license_info
Expand Down
Loading