Skip to content

ETL task orchestrating solution for our machine learning pipelines

Notifications You must be signed in to change notification settings

sonia-auv/airflow-etl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

S.O.N.I.A's ETL

This is S.O.N.I.A's ETL engine to orchestrate our machine learning jobs using Apache-Airflow

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

$ git clone https://github.com/sonia-auv/docker-ros-airflow

Prerequisites

You must create a Dockerhub account.

Then you must have been granted collaborator access on the [email protected]

First of all you must have docker and docker-composed install on your system using the provided links Docker installation Docker-Compose installation

When you have completed docker and docker-compose installation you must login into your terminal using the following commmand

docker login

Installation

Development environment

Environment file

After you have installed docker and docker-compose you must create an environment file. Simply copy .env.template with destination file name .env

cp .env.template .env

Google Cloud SDK

When you have successfully launched the containers you must set your credential too google cloud. To complete this step you must ask for access either to the captain or software rep to the required access.

You must execute the following commands to init you gcloud config:

docker exec -it sonia-auv-airflow_airflow-webserver_1 gcloud beta auth application-default login

You will the be asked to select your google account using a link that will displayed in the terminal.

Afterward you will need to input the verification code into the terminal.

Once it's done you should be prompted to input the project name which should be deep-learning-detection

And you must set you default region to us-central1

Then run the following command :

You will the be asked to select your google account using a link that will displayed in the terminal.

Afterward you will need to input the verification code into the terminal.

docker exec -it sonia-auv-airflow_airflow-webserver_1 gcloud config set compute/region us-central1
docker exec -it sonia-auv-airflow_airflow-webserver_1 gcloud config set compute/zone us-central1

Airflow UI User

To create a user to access the Airflow UI through a web browser you must run the following command

docker exec -it sonia-auv-airflow_airflow-webserver_1 airflow create_user --role Admin --username USERNAME --email EMAIL --firstname FIRSTNAME --lastname LASTNAME --password PASSWORD

Start Airflow

Once you have your configuration file, run this command in you shell:

./start.sh dev

This will pull the docker-ros-airflow image from the docker repository, and start the containers locally. It will start both the airflow container as the postgres container use to store airflow metadata.

The output of the script should look like this

#########################################################################

 Generating 'soniaauvets/airflow-ros-tensorflow' image using tag '1.1.3'
1.1.3: Pulling from soniaauvets/airflow-ros-tensorflow
Digest: sha256:778224fdeb5b89a790376084913d272b87a8f24d6352af527e1b472839e7b0dd
Status: Image is up to date for soniaauvets/airflow-ros-tensorflow:1.1.3
#########################################################################

Launching sonia-auv airflow docker containers
Starting sonia-auv-airflow_airflow-postgres_1 ... done
sonia-auv-airflow_airflow-webserver_1 is ... done
#########################################################################

Airflow containers have STARTED

Production environment

Environment file

After you have installed docker and docker-compose you must create an environment file. Simply copy .env.template with destination file name .env

cp .env.template .env

Airflow Fernet Key (Database data encryption)

First of all you must generate an Fernet Key to encrypt (connexions data) into Airflow database

Here are the step to generate a fernet key

pip install cryptography

Then execute the following command to generate the fernet key

python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"

Replace the AIRFLOW_FERNET_KEY field value by the newly created key into the .env file

Launch Airflow services

./start.sh prod

This will pull the docker-ros-airflow image from the docker repository, and start the containers locally. It will start both the airflow container as the postgres container use to store airflow metadata.

The output of the script should look like this

#########################################################################

 Generating 'soniaauvets/airflow-ros-tensorflow' image using tag '1.1.3'
1.1.3: Pulling from soniaauvets/airflow-ros-tensorflow
Digest: sha256:778224fdeb5b89a790376084913d272b87a8f24d6352af527e1b472839e7b0dd
Status: Image is up to date for soniaauvets/airflow-ros-tensorflow:1.1.3
#########################################################################

Launching sonia-auv airflow docker containers
Starting sonia-auv-airflow_airflow-postgres_1 ... done
sonia-auv-airflow_airflow-webserver_1 is ... done
#########################################################################

Airflow containers have STARTED

Google Cloud SDK

NOTE: Make sure to that the file glcoud_service_account.json exist in docker-ros-airflow/config on the VM NOTE: Make sure the webserver docker container is running

Then from the vm run the following command

docker exec -it sonia-auv-airflow_airflow-webserver_1 gcloud auth activate-service-account airflow-etl-sonia@deep-learning-detection.iam.gserviceaccount.com --key-file=gcloud_service_account.json
docker exec -it sonia-auv-airflow_airflow-webserver_1 gcloud config set project deep-learning-detection && gcloud config set compute/zone us-east1 && gcloud config set compute/region us-east1-c

Airflow UI User

To create a user to access the Airflow UI through a web browser you must run the following command

docker exec -it sonia-auv-airflow_airflow-webserver_1 airflow create_user --role Admin --username USERNAME --email EMAIL --firstname FIRSTNAME --lastname LASTNAME --password PASSWORD

Airflow Variables (Variables)

To import airflow variables from saved json file you must run the following command :

docker exec -it sonia-auv-airflow_airflow-webserver_1 airflow variables --import variables.json

The variables file is added to the docker image during build and it's located into the config directory It can be modified if new variables are added to the airflow instance. Their is an airflow command to extract it see airflow variables

Airflow connections

You must create connections into Airflow UI to be able to launch all our pipelines.

  • Slack connection :Medium Article (see Using Slack Webhook section)

  • Labelbox : Delete existing labelbox key into into the labelbox setting menu an create a new one using the [email protected] account

Usage

Airflow Variables

Our project defines admin variables to change the behavior and configuration of DAGS. In the Admin->Variables section of Airflow, you can import from json file. For an example of a variables set, see the variables.json file at the root of the repository.

Troubleshooting

Permission denied on the logs directory

This error is caused by your logs directory being owned by root. It produces the following logs(https://pastebin.com/HWpXu83w). To fix, change the owner of the root directory to the current user:

chown -R [user]:[user] logs

Built With

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

Versioning

We use SemVer for versioning. For the versions available, see the tags on this repository.

Authors

See also the list of contributors who participated in this project.

License

This project is licensed under the MIT License - see the LICENSE.md file for details

About

ETL task orchestrating solution for our machine learning pipelines

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages