Orchestration Project - Astronomer/Airflow tutorials
This repository contains examples of Apache Airflow DAGs for automating recurrent queries. All DAGs run on Astronomer infrastructure installed on Ubuntu 20.04.3 LTS.
Before running examples make sure to set up the right environment:
- Python3
- Docker version 18.09 and higher
- Astronomer
Astronomer is the managed provider that allows users to easily run and monitor Apache Airflow environments. The best way to initialize and run projects on Astronomer is to use Astronomer CLI. To install its latest version on Ubuntu run:
curl -sSL https://install.astronomer.io | sudo bash
To make sure that Astronomer CLI is installed run:
astro version
For installation of Astronomer CLI on another operating system, please refer to the official documentation.
The project directory has the following file structure:
├── dags # directory containing all DAGs
├── include # additional files which are used in DAGs
├── .astro # project settings
├── Dockerfile # runtime overrides for Astronomer Docker image
├── packages.txt # specification of OS-level packages
├── plugins # custom or community Airflow plugins
├── setup # additional setup-related scripts/database schemas
└── requirements.txt # specification of Python packages
In the dags directory you can find the specification of all DAGs for our examples. Each DAG is accompanied by a tutorial:
- table_export_dag.py (tutorial): performs a daily export of table data to a remote filesystem (in our case S3)
- data_retention_delete_dag.py (tutorial): implements a retention policy algorithm that drops expired partitions
- data_retention_reallocate_dag.py (tutorial): implements a retention policy algorithm that reallocates expired partitions from hot nodes to cold nodes
- data_retention_snapshot_dag.py: implements a retention policy algorithm that snapshots expired partitions to a repository
- nyc_taxi_dag.py (tutorial): imports NYC Taxi data from AWS S3 into CrateDB
- financial_data_dag.py (tutorial): downloads financial data from S&P 500 companies and stores them into CrateDB
- data_quality_checks_dag.py: loads incoming data to S3 then to CrateDB and checks several data quality properties. In case of failure, it sends Slack message.
To start the project on your local machine run:
astro dev start
To access the Apache Airflow UI go to http://localhost:8081
.
From Airflow UI you can further manage running DAGs, check their status, the time of the next and last run and some metadata.
If your Docker environment has the BuildKit feature enabled, you may run into an error when starting the Astronomer project:
$ astro dev start
Env file ".env" found. Loading...
buildkit not supported by daemon
Error: command 'docker build -t astronomer-project_dccf4f/airflow:latest failed: failed to execute cmd: exit status 1
To overcome this issue, start Astronomer without the BuildKit feature: DOCKER_BUILDKIT=0 astro dev start
(see the Astronomer Forum).
Before opening a pull request, please run pylint and black. To install all dependencies, run:
python -m pip install --upgrade -e ".[develop]"
python -m pip install --upgrade -r requirements.txt
Then run pylint
and black
using:
python -m pylint dags
python -m black .
Pytest is used for automated testing of DAGs. To set up test infrastructure locally, run:
python -m pip install --upgrade -e ".[testing]"
Tests can be run via:
python -m pytest -vvv