Skip to content

To run the same ETL code in multiple cloud services based on your preference, thus saving time further to develop the ETL scripts for different environments & clouds.

Notifications You must be signed in to change notification settings

wednesday-solutions/multi-cloud-etl-pipeline

Repository files navigation

Multi-cloud ETL Pipeline

Table of Contents

Objective

  • To run the same ETL code in multiple cloud services based on your preference, thus saving time.
  • To develop ETL scripts for different environments and clouds.

Note

  • This repository currently supports Azure Databricks + AWS Glue.
  • Azure Databricks can't be configured locally, We can only connect our local IDE to running cluster in databricks. It works by pushing code in a Github repository then adding a workflow in databricks with URL of the repo & file.
  • For AWS Glue we will set up a local environment using Glue Docker image or shell script, then deploying it to AWS glue using github actions.
  • The "tasks.txt" file contains the details of transformations done in the main file.

Pre-requisite

  1. Python3.7 with PIP
  2. AWS CLI configured locally
  3. Install Java 8.
    # Make sure to export JAVA_HOME like this:
    export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_261.jdk/Contents/Home

Quick Start

  1. Clone this repo (for Windows use WSL).

  2. For setting up required libraries and packages locally, run:

    # If default SHELL is zsh use
    make setup-glue-local SOURCE_FILE_PATH=~/.zshrc

    # If default SHELL is bash use
    make setup-glue-local SOURCE_FILE_PATH=~/.bashrc
  1. Source SHELL profile using:
    # For zsh
    source ~/.zshrc

    # For bash
    source ~/.bashrc
  1. Install Dependencies:
    make install

Change Your Paths

  1. Enter your S3 & ADLS paths in the app/.custom_env file for Databricks. This file will be used by Databricks.

  2. Similarly, we'll make .evn file in the root folder for Local Glue. To create the required file run:

    make glue-demo-env

This command will copy your paths from app/.custom_env to .env file.

  1. (Optional) If you want to extract from kaggle, enter KAGGLE_KEY & KAGGLE_USERNAME in .evn file only. Note: Don't enter any sensitive keys in app/.custom_env file.

Setup Check

Finally, check if everything is working correctly by running:

    gluesparksubmit jobs/demo.py

Ensure "Execution Complete" is printed.

Make New Jobs

Write your jobs in the jobs folder. Refer demo.py file. One example is the jobs/main.py file.

Deployment

  1. Set up a Github action for AWS Glue. Make sure to pass the following secrets in your repository:
    AWS_ACCESS_KEY_ID
    AWS_SECRET_ACCESS_KEY
    S3_BUCKET_NAME
    S3_SCRIPTS_PATH
    AWS_REGION
    AWS_GLUE_ROLE

Rest all the key-value pairs that entered in the .env file. make sure to pass them using automation/deploy_glue_jobs.sh file.

  1. For Azure Databricks, make a workflow with the link to your repo & main file. Pass the following parameters with their correct values:
    kaggle_username
    kaggle_token
    storage_account_name
    datalake_access_key

Run Tests & Coverage Report

To run tests & coverage report, run the following commands in the root folder of the project:

    make test

    # To see the coverage report
    make coverage-report

References

Glue Programming libraries

Common Errors

'sparkDriver' failed after 16 retries

About

To run the same ETL code in multiple cloud services based on your preference, thus saving time further to develop the ETL scripts for different environments & clouds.

Topics

Resources

Stars

Watchers

Forks