Skip to content
This repository has been archived by the owner on Aug 31, 2023. It is now read-only.

slalombuild/moonshot-engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

moonshot-engineering

Overview

"Engineering Your Moonshot" is a one-day, hands-on technical event where attendees will learn how to deploy a modern data engineering and analytics solution on AWS using Snowflake, Tableau, and AWS native services. Technical experts will explain key features and use cases, share best practices, and provide technical demos. The Engineering Your Moonshot application is based on the snowflake-on-ecs analytics pipeline framework from Slalom Build.

The target audience is data engineers, cloud architects, and product developers with 3 - 5 years experience (300 level). Best for existing builders on cloud who want to dive deeper into highly technical hands-on training and content.

Attendees will learn how to:

  • Leverage AWS infrastructure to build and run an automated Snowflake data pipeline
  • Automate the deployment of your system using AWS CloudFormation and SSM Parameter Store
  • Clean and transform raw data into analytic tables for consumption
  • Implement data warehousing best practices with a focus on least-privilege security
  • Create rich visualizations using Tableau, turning data into valuable insights

Architecture

alt text

Activities

The day's activities are divided into the following sections:

  • Getting Started (30 minutes)
  • Setting up Snowflake (60 minutes)
  • Building the Pipeline in AWS (60 minutes)
  • Running the Pipeline (60 minutes)
  • Visual Analytics with Tableau Desktop (60 minutes)

Getting Started

Download the Code

Download the Engineering Your Moonshot code here and save to a location on your disk you can refer to later.

Log into AWS

Log into the AWS account provided for the Engineering Your Moonshot event. You can reach the sign-in console here. Use the account alias provided in the email to log in. Once logged in, you'll retrieve the password to be used with Snowflake in a later step. Be sure to select the Oregon region in the AWS console.

  1. Log into AWS using your username/password
    alt text
  2. Change your password to something you can remember alt text
  3. Navigate to AWS Systems Manager and click 'Parameter Store'. SSM Parameter Store is used to store passwords in an encrypted format. alt text
  4. Retrieve the snowflake_user password you will use during the Snowflake steps. It will be stored under /airflow-ecs/SnowflakeCntl alt text

Setting Up Snowflake

Log into Your Snowflake Account

Log into your Snowflake account with the username and password you created when you set it up.

Setup the Snowflake Framework

Open the setup_framework.sql script in the Snowflake UI. Let's review the script before we run it, and discuss topics such as why we're doing this, why it needs elevated privilege to run, and what types of objects it creates in Snowflake.

  1. Copy the contents of the setup_framework.sql script (located in the snowflake directory) and paste in the Snowflake query editor window. alt text
  2. Highlight the query text and click on the blue Run button at the top left of the query editor window to run the script. This will create several objects in the database, including a few new databases, and a new user account that we'll use next.
  3. Log out of your account by selecting the drop down menu on the top right of the window. alt text

Deploy the Snowflake Database Objects

Log into Snowflake with the snowflake_user account and the default password specified in the script. Open the deploy_objects.sql script in the Snowflake UI. We'll take a walk through the script and describe what it is doing and why. Then we'll instruct them how to run it in one shot.

  1. Log back in to Snowflake with the following credentials:
    • User Name: snowflake_user
    • Password: __CHANGE__
  2. You will immediately be promted to change your password. Enter the password you retrieved in Step 4 of the AWS Setup section and click Finish. You are now logged in to the same Snowflake account as a different user.
  3. Copy the contents of the deploy_objects.sql script (located in the snowflake directory) and paste in the Snowflake query editor window.
  4. Highlight the query text and click on the blue Run button at the top left of the query editor window to run the script. This will create two stage tables, a file format, and two tables that we'll be loading data from S3 into. alt text
  5. We're now done with the initial Snowflake framework setup. Next, we'll be provisioning infrastructure in AWS that will allow us to run Airflow jobs to load these Snowflake tables with data.

Building the Pipeline in AWS

Deploy Airflow Running on ECS Fargate

The framework uses Apache Airflow for the workflow engine. ECS Fargate allows us to use any application built using a Docker image.

  1. Log into AWS and navigate to CloudFormation. alt text
  2. On the main CloudFormation page, you'll see that a stack called ecs-fargate-network has already been created. This stack contains the core networking infrastucture, including the VPC, subnets, security groups, and ECS cluster, that we'll use as a foundation for our next deployment. alt text
  3. Click the Create Stack button at the top right of the page. In the "Specify template" section, select "Upload a template file". Click "Choose file" and navigate to the location where the project repository is cloned. Select the cloudformation/private-subnet-pubilc-service.yml file. Click Next. alt text
  4. On the next page, you'll see a list of parameters that we need to set. These will be injected into the CloudFormation stack creation and will enable you to connect to Airflow from your computer once the ECS task is up and running. Set the following parameters then click Next:
    • Stack name: ecs-airflow-<your name>
    • AllowWebCidrIp: <your IP address> (Note: this will be the same value for everyone in the room)
    • Snowflake account: this can be derived from the URL of your Snowlake account. For example: if the URL for your account is https://ms72259.us-east-1.snowflakecomputing.com/, then the account ID is ms72259.us-east-1
    • SnowflakeRegion - the region of your Snowflake account (corresponds to an AWS region) alt text alt text
  5. On the next page, leave all of the default options and click Next. Scroll to the bottom of the next page and click Create Stack.
  6. You'll be routed to a page that contains details on your stack. Click on the Events tab to see the progress of the stack creation. This process will take between 10 and 20 minutes. Once it's complete, you'll see an event indicating that the creation is complete. alt text

Review Foundational Components - Instructor Led

While we are waiting for the Airflow ECS Service stack to launch, let's review some foundational concepts. We have the Snowflake components installed, can we leverage AWS to automatically load data? We'll first run through the foundational AWS components that were deployed prior to the Engineering Your Moonshot event. This includes building a Docker image with our pipeline application, setting up a VPC network in AWS, and pushing our Docker image to Amazon ECR (Elastic Container Registry). We will also walk through the Airflow ECS Service CloudFormation template you just deployed to AWS.

Running the Analytics Pipeline

Run the Pipeline in Airflow Web UI

In this section, we'll launch Airflow on ECS Task using a public IP, port 8080. We'll run the Raw pipeline to load the Raw tables. We'll then run the Analytics pipeline to load the Analytics tables. This will take some time to load. Go back to Snowflake and see the History tab, you can see Snowflake running the jobs and loading data. When it's done, Airflow UI will report success. We'll run a quick SQL query to see the data in the tables query_analytics.sql.

  1. Now that the stack has been created, navigate to ECS and select the single cluster that is running. alt text
  2. On the cluster details page, select the Tasks tab and find the task with a task definition name that corresponds to the name you gave your CloudFormation stack. alt text
  3. Select the task to view the task details. In the Network section, copy the Public IP address and paste it in your browser address bar. Append :8080 to the IP address and navigate to the page. You should see the Airflow user interface. alt text alt text
  4. In the Airflow UI, you should see two DAGs, snowflake_analytics and snowflake_raw. Toggle the snowflake_raw switch to On and select the Trigger Dag button in the Links section of the DAG row. When prompted to run the DAG, click OK. alt text
  5. The snowflake_raw DAG is now running and loading data into Snowflake. Navigate back to Snowflake and click on the History button. You should see the progress of the queries that are being executed by Airflow. alt text
  6. After a few minutes, the DAG should be complete. Back in Snowflake, run a quick query on the public.airline_raw table to confirm that data was loaded successfully. alt text alt text
  7. In Airflow, toggle the snowflake_analytics switch to On and select the Trigger Dag button in the Links section of the DAG row. When prompted to run the DAG, click OK. This DAG will take data loaded into the stage tables and load it into the final destination tables that can be used for analytical queries. alt text
  8. Once the DAG execution is complete, navigate back to Snowflake. Copy the contents of the query_analytics.sql script (located in the snowflake directory) and paste it in the Snowflake query window. Run the query. alt text

Code Walk Through - SQL and DAGs

The Airflow pipeline will take about 15 minutes to run. We'll use this time to walk through the Snowflake SQL and the Airflow DAGs used to automate it.

  1. Snowflake SQL Example

    • Open the file located at airflow/dags/sql/copy_raw_nyc_taxi.sql.sql
    • This COPY command loads data from S3 into a Snowflake tables.
    • Metadata attributes such as filename and file_row_number are captured automatically.
    • We also store the create process name and timestamp.
    copy into nyc_taxi_raw(vendorid, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance,
        pickup_longitude, pickup_latitude, ratecodeid, store_and_fwd_flag, pulocationid, dolocationid, payment_type,
        fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, src_filename,
        src_file_row_num, create_process, create_ts
    )
    from (
    select t.$1,t.$2,t.$3,t.$4,t.$5,t.$6,t.$7,t.$8,t.$9,t.$10,t.$11,t.$12,t.$13,t.$14,t.$15,t.$16,t.$17,t.$18,t.$19,
    metadata$filename,
    metadata$file_row_number,
    'Airflow snowflake_raw Dag',
    convert_timezone('UTC' , current_timestamp )::timestamp_ntz
    from @quickstart/nyc-taxi-data/ t
    )
    file_format= (format_name = csv);
  2. Airflow DAG Example

    • Open the file airflow/dags/snowflake_raw.py
    • Airflow DAGs are written in Python
    • This DAG file constructs a pipeline for loading datasets from S3 into Snowflake
    • A sequential workflow is shown here for demonstration purposes. We can also run tasks in parallel.
    from datetime import timedelta
    
    import airflow
    from airflow import DAG
    from airflow.contrib.operators.snowflake_operator import SnowflakeOperator
    
    # These args will get passed on to each operator
    # You can override them on a per-task basis during operator initialization
    default_args = {
        'owner': 'airflow',
        'depends_on_past': False,
        'start_date': airflow.utils.dates.days_ago(1),
        'email': ['[email protected]'],
        'email_on_failure': False,
        'email_on_retry': False,
        'retries': 1,
        'retry_delay': timedelta(minutes=5),
    }
    
    dag = DAG(
        'snowflake_raw',
        default_args=default_args,
        description='Snowflake raw pipeline',
        schedule_interval='0 */6 * * *',
    )
    
    t1 = SnowflakeOperator(
        task_id='copy_raw_airline',
        sql='sql/copy_raw_airline.sql',
        snowflake_conn_id='snowflake_default',
        warehouse='load_wh',
        database='raw',
        autocommit=True,
        dag=dag)
    
    t2 = SnowflakeOperator(
        task_id='copy_raw_nyc_taxi',
        sql='sql/copy_raw_nyc_taxi.sql',
        snowflake_conn_id='snowflake_default',
        warehouse='load_wh',
        database='raw',
        autocommit=True,
        dag=dag)
    
    t1 >> t2

Visual Analytics with Tableau Desktop

Please click here to download the Tableau Desktop instructions, then please follow the instructions outlined in the document.

About

Moonshot engineering technical session

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published