"Engineering Your Moonshot" is a one-day, hands-on technical event where attendees will learn how to deploy a modern data engineering and analytics solution on AWS using Snowflake, Tableau, and AWS native services. Technical experts will explain key features and use cases, share best practices, and provide technical demos. The Engineering Your Moonshot application is based on the snowflake-on-ecs analytics pipeline framework from Slalom Build.
The target audience is data engineers, cloud architects, and product developers with 3 - 5 years experience (300 level). Best for existing builders on cloud who want to dive deeper into highly technical hands-on training and content.
Attendees will learn how to:
- Leverage AWS infrastructure to build and run an automated Snowflake data pipeline
- Automate the deployment of your system using AWS CloudFormation and SSM Parameter Store
- Clean and transform raw data into analytic tables for consumption
- Implement data warehousing best practices with a focus on least-privilege security
- Create rich visualizations using Tableau, turning data into valuable insights
The day's activities are divided into the following sections:
- Getting Started (30 minutes)
- Setting up Snowflake (60 minutes)
- Building the Pipeline in AWS (60 minutes)
- Running the Pipeline (60 minutes)
- Visual Analytics with Tableau Desktop (60 minutes)
Download the Engineering Your Moonshot code here and save to a location on your disk you can refer to later.
Log into the AWS account provided for the Engineering Your Moonshot event. You can reach the sign-in console here. Use the account alias provided in the email to log in. Once logged in, you'll retrieve the password to be used with Snowflake in a later step. Be sure to select the Oregon region in the AWS console.
- Log into AWS using your username/password
- Change your password to something you can remember
- Navigate to AWS Systems Manager and click 'Parameter Store'. SSM Parameter Store is used to store passwords in an encrypted format.
- Retrieve the
snowflake_user
password you will use during the Snowflake steps. It will be stored under/airflow-ecs/SnowflakeCntl
Log into your Snowflake account with the username and password you created when you set it up.
Open the setup_framework.sql
script in the Snowflake UI. Let's review the script before we run it, and discuss topics such as why we're doing this, why it needs elevated privilege to run, and what types of objects it creates in Snowflake.
- Copy the contents of the
setup_framework.sql
script (located in thesnowflake
directory) and paste in the Snowflake query editor window. - Highlight the query text and click on the blue Run button at the top left of the query editor window to run the script. This will create several objects in the database, including a few new databases, and a new user account that we'll use next.
- Log out of your account by selecting the drop down menu on the top right of the window.
Log into Snowflake with the snowflake_user
account and the default password specified in the script.
Open the deploy_objects.sql
script in the Snowflake UI. We'll take a walk through the script and describe what it is doing and why. Then we'll instruct them how to run it in one shot.
- Log back in to Snowflake with the following credentials:
- User Name:
snowflake_user
- Password:
__CHANGE__
- User Name:
- You will immediately be promted to change your password. Enter the password you retrieved in Step 4 of the
AWS Setup
section and click Finish. You are now logged in to the same Snowflake account as a different user. - Copy the contents of the
deploy_objects.sql
script (located in thesnowflake
directory) and paste in the Snowflake query editor window. - Highlight the query text and click on the blue Run button at the top left of the query editor window to run the script. This will create two stage tables, a file format, and two tables that we'll be loading data from S3 into.
- We're now done with the initial Snowflake framework setup. Next, we'll be provisioning infrastructure in AWS that will allow us to run Airflow jobs to load these Snowflake tables with data.
The framework uses Apache Airflow for the workflow engine. ECS Fargate allows us to use any application built using a Docker image.
- Log into AWS and navigate to CloudFormation.
- On the main CloudFormation page, you'll see that a stack called
ecs-fargate-network
has already been created. This stack contains the core networking infrastucture, including the VPC, subnets, security groups, and ECS cluster, that we'll use as a foundation for our next deployment. - Click the Create Stack button at the top right of the page. In the "Specify template" section, select "Upload a template file". Click "Choose file" and navigate to the location where the project repository is cloned. Select the
cloudformation/private-subnet-pubilc-service.yml
file. Click Next. - On the next page, you'll see a list of parameters that we need to set. These will be injected into the CloudFormation stack creation and will enable you to connect to Airflow from your computer once the ECS task is up and running. Set the following parameters then click Next:
- Stack name:
ecs-airflow-<your name>
- AllowWebCidrIp:
<your IP address>
(Note: this will be the same value for everyone in the room) - Snowflake account: this can be derived from the URL of your Snowlake account. For example: if the URL for your account is
https://ms72259.us-east-1.snowflakecomputing.com/
, then the account ID isms72259.us-east-1
- SnowflakeRegion - the region of your Snowflake account (corresponds to an AWS region)
- Stack name:
- On the next page, leave all of the default options and click Next. Scroll to the bottom of the next page and click Create Stack.
- You'll be routed to a page that contains details on your stack. Click on the Events tab to see the progress of the stack creation. This process will take between 10 and 20 minutes. Once it's complete, you'll see an event indicating that the creation is complete.
While we are waiting for the Airflow ECS Service stack to launch, let's review some foundational concepts. We have the Snowflake components installed, can we leverage AWS to automatically load data? We'll first run through the foundational AWS components that were deployed prior to the Engineering Your Moonshot event. This includes building a Docker image with our pipeline application, setting up a VPC network in AWS, and pushing our Docker image to Amazon ECR (Elastic Container Registry). We will also walk through the Airflow ECS Service CloudFormation template you just deployed to AWS.
In this section, we'll launch Airflow on ECS Task using a public IP, port 8080. We'll run the Raw pipeline to load the Raw tables. We'll then run the Analytics pipeline to load the Analytics tables. This will take some time to load. Go back to Snowflake and see the History tab, you can see Snowflake running the jobs and loading data. When it's done, Airflow UI will report success. We'll run a quick SQL query to see the data in the tables query_analytics.sql
.
- Now that the stack has been created, navigate to ECS and select the single cluster that is running.
- On the cluster details page, select the Tasks tab and find the task with a task definition name that corresponds to the name you gave your CloudFormation stack.
- Select the task to view the task details. In the Network section, copy the Public IP address and paste it in your browser address bar. Append
:8080
to the IP address and navigate to the page. You should see the Airflow user interface. - In the Airflow UI, you should see two DAGs,
snowflake_analytics
andsnowflake_raw
. Toggle thesnowflake_raw
switch to On and select the Trigger Dag button in the Links section of the DAG row. When prompted to run the DAG, click OK. - The
snowflake_raw
DAG is now running and loading data into Snowflake. Navigate back to Snowflake and click on the History button. You should see the progress of the queries that are being executed by Airflow. - After a few minutes, the DAG should be complete. Back in Snowflake, run a quick query on the
public.airline_raw
table to confirm that data was loaded successfully. - In Airflow, toggle the
snowflake_analytics
switch to On and select the Trigger Dag button in the Links section of the DAG row. When prompted to run the DAG, click OK. This DAG will take data loaded into the stage tables and load it into the final destination tables that can be used for analytical queries. - Once the DAG execution is complete, navigate back to Snowflake. Copy the contents of the
query_analytics.sql
script (located in thesnowflake
directory) and paste it in the Snowflake query window. Run the query.
The Airflow pipeline will take about 15 minutes to run. We'll use this time to walk through the Snowflake SQL and the Airflow DAGs used to automate it.
-
Snowflake SQL Example
- Open the file located at
airflow/dags/sql/copy_raw_nyc_taxi.sql.sql
- This COPY command loads data from S3 into a Snowflake tables.
- Metadata attributes such as
filename
andfile_row_number
are captured automatically. - We also store the create process name and timestamp.
copy into nyc_taxi_raw(vendorid, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, pickup_longitude, pickup_latitude, ratecodeid, store_and_fwd_flag, pulocationid, dolocationid, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, src_filename, src_file_row_num, create_process, create_ts ) from ( select t.$1,t.$2,t.$3,t.$4,t.$5,t.$6,t.$7,t.$8,t.$9,t.$10,t.$11,t.$12,t.$13,t.$14,t.$15,t.$16,t.$17,t.$18,t.$19, metadata$filename, metadata$file_row_number, 'Airflow snowflake_raw Dag', convert_timezone('UTC' , current_timestamp )::timestamp_ntz from @quickstart/nyc-taxi-data/ t ) file_format= (format_name = csv);
- Open the file located at
-
Airflow DAG Example
- Open the file
airflow/dags/snowflake_raw.py
- Airflow DAGs are written in Python
- This DAG file constructs a pipeline for loading datasets from S3 into Snowflake
- A sequential workflow is shown here for demonstration purposes. We can also run tasks in parallel.
from datetime import timedelta import airflow from airflow import DAG from airflow.contrib.operators.snowflake_operator import SnowflakeOperator # These args will get passed on to each operator # You can override them on a per-task basis during operator initialization default_args = { 'owner': 'airflow', 'depends_on_past': False, 'start_date': airflow.utils.dates.days_ago(1), 'email': ['[email protected]'], 'email_on_failure': False, 'email_on_retry': False, 'retries': 1, 'retry_delay': timedelta(minutes=5), } dag = DAG( 'snowflake_raw', default_args=default_args, description='Snowflake raw pipeline', schedule_interval='0 */6 * * *', ) t1 = SnowflakeOperator( task_id='copy_raw_airline', sql='sql/copy_raw_airline.sql', snowflake_conn_id='snowflake_default', warehouse='load_wh', database='raw', autocommit=True, dag=dag) t2 = SnowflakeOperator( task_id='copy_raw_nyc_taxi', sql='sql/copy_raw_nyc_taxi.sql', snowflake_conn_id='snowflake_default', warehouse='load_wh', database='raw', autocommit=True, dag=dag) t1 >> t2
- Open the file
Please click here to download the Tableau Desktop instructions, then please follow the instructions outlined in the document.