Openverse Catalog

Openverse Catalog

This repository contains the methods used to identify over 1.4 billion Creative Commons licensed works. The challenge is that these works are dispersed throughout the web and identifying them requires a combination of techniques.

Two approaches are currently in use:

Web crawl data
Application Programming Interfaces (API Data)

Web Crawl Data

The Common Crawl Foundation provides an open repository of petabyte-scale web crawl data. A new dataset is published at the end of each month comprising over 200 TiB of uncompressed data.

The data is available in three file formats:

WARC (Web ARChive): the entire raw data, including HTTP response metadata, WARC metadata, etc.
WET: extracted plaintext from each webpage.
WAT: extracted html metadata, e.g. HTTP headers and hyperlinks, etc.

For more information about these formats, please see the Common Crawl documentation.

Openverse Catalog uses AWS Data Pipeline service to automatically create an Amazon EMR cluster of 100 c4.8xlarge instances that will parse the WAT archives to identify all domains that link to creativecommons.org. Due to the volume of data, Apache Spark is used to streamline the processing. The output of this methodology is a series of parquet files that contain:

the domains and its respective content path and query string (i.e. the exact webpage that links to creativecommons.org)
the CC referenced hyperlink (which may indicate a license),
HTML meta data in JSON format which indicates the number of images on each webpage and other domains that they reference,
the location of the webpage in the WARC file so that the page contents can be found.

The steps above are performed in ExtractCCLinks.py.

API Data

Apache Airflow is used to manage the workflow for various API ETL jobs which pull and process data from a number of open APIs on the internet.

API Workflows

To view more information about all the available workflows (DAGs) within the project, see DAGs.md.

See each provider API script's notes in their respective handbook entry.

Development setup for Airflow and API puller scripts

There are a number of scripts in the directory openverse_catalog/dags/provider_api_scripts eventually loaded into a database to be indexed for searching in the Openverse API. These run in a different environment than the PySpark portion of the project, and so have their own dependency requirements.

For instructions geared specifically towards production deployments, see DEPLOYMENT.md

Requirements

You'll need docker and docker-compose installed on your machine, with versions new enough to use version 3 of Docker Compose .yml files.

You will also need the just command runner installed.

Setup

To set up the local python environment along with the pre-commit hook, run:

python3 -m venv venv
source venv/bin/activate
just install

The containers will be built when starting the stack up for the first time. If you'd like to build them prior to that, run:

just build

Environment

To set up environment variables run:

just dotenv

This will generate a .env file which is used by the containers.

The .env file is split into four sections:

Airflow Settings - these can be used to tweak various Airflow properties
API Keys - set these if you intend to test one of the provider APIs referenced
Connection/Variable info - this will not likely need to be modified for local development, though the values will need to be changed in production
Other config - misc. configuration settings, some of which are useful for local dev

The .env file does not need to be modified if you only want to run the tests.

Running & Testing

There is a docker-compose.yml provided in the openverse_catalog directory, so from that directory, run

just up

This results, among other things, in the following running containers:

openverse_catalog_webserver_1
openverse_catalog_postgres_1
openverse_catalog_s3_1

and some networking setup so that they can communicate. Note:

openverse_catalog_webserver_1 is running the Apache Airflow daemon, and also has a few development tools (e.g., pytest) installed.
openverse_catalog_postgres_1 is running PostgreSQL, and is setup with some databases and tables to emulate the production environment. It also provides a database for Airflow to store its running state.
The directory containing all modules files (including DAGs, dependencies, and other tooling) will be mounted to the directory /usr/local/airflow/openverse_catalog in the container openverse_catalog_webserver_1. On production, only the DAGs folder will be mounted, e.g. /usr/local/airflow/openverse_catalog/dags.

The various services can be accessed using these links:

Airflow: localhost:9090 (The default username and password are both airflow.)
Minio Console: localhost:5011 (The default username and password are test_key and test_secret)
Postgres: localhost:5434 (using a database connector)

At this stage, you can run the tests via:

just test

# Alternatively, run all tests including longer-running ones
just test --extended

Edits to the source files or tests can be made on your local machine, then tests can be run in the container via the above command to see the effects.

If you'd like, it's possible to login to the webserver container via:

just shell

If you just need to run an airflow command, you can use the airflow recipe. Arguments passed to airflow must be quoted:

just airflow "config list"

To follow the logs of the running container:

just logs

To begin an interactive pgcli shell on the database container, run:

just db-shell

If you'd like to bring down the containers, run

just down

To reset the test DB (wiping out all databases, schemata, and tables), run

just down -v

docker volume prune can also be useful if you've already stopped the running containers, but be warned that it will remove all volumes associated with stopped containers, not just openverse-catalog ones.

To fully recreate everything from the ground up, you can use:

just recreate

Directory Structure

openverse-catalog
├── .github/                                # Templates for GitHub
├── archive/                                # Files related to the previous CommonCrawl parsing implementation
├── docker/                                 # Dockerfiles and supporting files
│   ├── airflow/                            #   - Docker image for Airflow server and workers
│   └── local_postgres/                     #   - Docker image for development Postgres database
├── openverse_catalog/                      # Primary code directory
│   ├── dags/                               # DAGs & DAG support code
│   │   ├── common/                         #   - Shared modules used across DAGs
│   │   ├── commoncrawl/                    #   - DAGs & scripts for commoncrawl parsing
│   │   ├── database/                       #   - DAGs related to database actions (matview refresh, cleaning, etc.)
│   │   ├── maintenance/                    #   - DAGs related to airflow/infrastructure maintenance
│   │   ├── oauth2/                         #   - DAGs & code for Oauth2 key management
│   │   ├── providers/                      #   - DAGs & code for provider ingestion
│   │   │   ├── provider_api_scripts/       #       - API access code specific to providers
│   │   │   └── *.py                        #       - DAG definition files for providers
│   │   └── retired/                        #   - DAGs & code that is no longer needed but might be a useful guide for the future
│   └── templates/                          # Templates for generating new provider code
└── *                                       # Documentation, configuration files, and project requirements

Publishing

The docker image for the catalog (Airflow) is published to ghcr.io/WordPress/openverse-catalog.

Contributing

Pull requests are welcome! Feel free to join us on Slack and discuss the project with the engineers and community members on #openverse.

Acknowledgments

Openverse, previously known as CC Search, was conceived and built at Creative Commons. We thank them for their commitment to open source and openly licensed content, with particular thanks to previous team members @ryanmerkley, @janetpkr, @lizadaly, @sebworks, @pa-w, @kgodey, @annatuma, @mathemancer, @aldenstpage, @brenoferreira, and @sclachar, along with their community of volunteers.

License

LICENSE (Expat/MIT License)

Name		Name	Last commit message	Last commit date
Latest commit History 1,423 Commits
.github		.github
archive		archive
docker		docker
openverse_catalog		openverse_catalog
tests		tests
.dockerignore		.dockerignore
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DAGs.md		DAGs.md
DEPLOYMENT.md		DEPLOYMENT.md
LICENSE		LICENSE
README.md		README.md
dag-sync.sh		dag-sync.sh
docker-compose.override.yml		docker-compose.override.yml
docker-compose.yml		docker-compose.yml
env.template		env.template
justfile		justfile
pytest.ini		pytest.ini
requirements_dev.txt		requirements_dev.txt
requirements_prod.txt		requirements_prod.txt
requirements_tooling.txt		requirements_tooling.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Openverse Catalog

Web Crawl Data

API Data

API Workflows

Development setup for Airflow and API puller scripts

Requirements

Setup

Environment

Running & Testing

Directory Structure

Publishing

Contributing

Acknowledgments

License

About

Releases

Packages

Languages

License

twstokes/openverse-catalog

Folders and files

Latest commit

History

Repository files navigation

Openverse Catalog

Web Crawl Data

API Data

API Workflows

Development setup for Airflow and API puller scripts

Requirements

Setup

Environment

Running & Testing

Directory Structure

Publishing

Contributing

Acknowledgments

License

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages