Welcome to the Capstone project! During the next 1.5 days you will apply everything you've learned during the past days in an integrated exercise. You will:
- Read, transform and load weather data from S3 to Snowflake through PySpark
- Take a stab at running your application on AWS through Docker, AWS Batch and Airflow
- Run your pipeline on our inhouse tool Conveyor to experience a more complete solution
We set up a gitpod environment containing all the tools required to complete this exercise (awscli, python, vscode, ...). You can access this environment by clicking the button below:
NOTE: When you push to your own remote make sure to change the Gitpod URL to reflect your account in this README!
This is an ubuntu-based environment pre-installed with:
- VSCode
- A Python3 virtual environment, we recommend you always work inside this environment
- The AWS cli
Before you commence data crunching, set up your .aws/credentials
file. This required to access AWS services through the API.
Run aws configure
in the terminal and enter the correct information:
AWS Access Key ID [None]: [YOUR_ACCESS_KEY]
AWS Secret Access Key [None]: [YOUR_SECRET_KEY]
Default region name [None]: eu-west-1
Default output format [None]: json
IMPORTANT: Create a new branch and periodically push your work to the remote. After 30min of inactivity this environment shuts down and you will lose unsaved progress. As stated before, change the GitPod URL to reflect your remote.
Our team recently ingested Belgian weather data from the openaq API and stored it on AWS S3
under s3://dataminded-academy-capstone-resources/raw/open_aq/
.
You are tasked with building a PySpark application that reads this weather, transforms it and stores it in a Snowflake Table for further analysis.
Required transformations:
- Flatten all nested columns
- Some timestamp columns are stored as string, cast them to timestamp
You can access the weather data through your AWS credentials. While working locally, you'll need to
pass these to your spark application. In addition, your spark application requires
the following JAR to interact with AWS S3 org.apache.hadoop:hadoop-aws:3.1.2
.
After you successfully extracted and transformed the data, you can move on to loading said data to Snowflake.
To this end, you'll require Snowflake credentials stored on AWS Secretsmanager
under the following secret snowflake/capstone/login
.
Write Python code that retrieves this secret to leverage in the next step.
The final step requires you to load the tranformed data to a Snowflake table utilizing the login details obtained in the previous step. Your Spark application will also require additional JARs:
net.snowflake:spark-snowflake_2.12:2.9.0-spark_3.1
net.snowflake:snowflake-jdbc:3.13.3
Scan the Snowflake docs to figure out how to load data through Spark. All required configuration is stored in the AWS secret except the following:
- Snowflake Schema: On Snowflake, we created individual schema's for every participant. Write to your schema. The schema name is based on your username so login on Snowflake to retrieve it.
- Table name: Feel free to use any table name you want
When everything's tied together and functional, you can proceed to the next task.
As a data engineer, you might run your application locally during develoment (as you have done during the first task) but you should always run your applications on a stable, scalable environment with scheduling in place and ideally deployed through CICD.
During the final step of this Capstone you will run your application on AWS Batch scheduled through Airflow (MWAA). This requires you to execute the following steps:
- Containerize your application through Docker
- Create an ECR repository and push your image
- Create a batch job definition which runs your application (and test it for good measure)
- Create and upload a DAG that trigger a batch job using your job definition to a pre-created MWAA environment
- ??
- Profit
IMPORTANT NOTES:
-
While using the AWS console, make sure you are in the
eu-west-1
region -
Most resources you create will require a name that starts with your AWS username
-
Where applicable, you will need to tag every resource you create with:
environment
:academy-capstone-winter-2023
Create a Dockerfile
that packages your application. You can start from one of our Data Minded images
which pre-installs Spark and its dependencies: put FROM public.ecr.aws/dataminded/spark-k8s-glue:v3.1.2-hadoop-3.3.1
at the top of your Dockerfile.
Through the AWS console, create a private ECR repository. Its name should start with your AWS username. After creation, push your docker image to this repo.
With your image pushed to the repository, navigate to AWS Batch and create a new Job Definition. Apply the following configuration:
- Name: Make sure it starts with your AWS username
- Platform type: EC2
- Execution role:
academy-capstone-winter-2023-batch-job-role
- Job configuration:
- Image: Image you pushed during the previous step
- Command: Depends on your image:)
- Job Role Configuration:
academy-capstone-winter-2023-batch-job-role
- Tags:
environment
:academy-capstone-winter-2023
After creating the job definition you can run it by submitting a new job. Again, apply the correct naming convention and tags.
You can submit the job to the following queue: academy-capstone-winter-2023-job-queue
To conclude this part, create a DAG that triggers your AWS Batch job and upload it to an MWAA environment created for you.
You will find your environment by navigating to MWAA in the AWS console under the name <YOUR_USER_NAME>-mwaa-env
. Upload your DAG to the DAG folder specified in your MWAA environment.
You can access the Airflow Web UI through the link in the console.
After a successful upload, your DAG should be visible in the Airflow UI and can be triggered.
On Friday afternoon you will receive a brief introduction to Conveyor and will be tasked with deploying your code on the Conveyor platform. You will need to keep a couple changes in mind as Conveyor runs in a different AWS account:
- The accountid is
130966031144
- The bucket containing the raw files is
dataminded-academy-capstone-resources2
In case you finished the capstone but want to expand your pipeline, feel free to create an ingest job which fetches air quality data and stores it in S3.
To this end, have a look at the openaq project. They expose a public API which can be called to retrieve air quality metrics filtered on a set of parameters. (Note: We used this API to gather the raw files you transformed and loaded into Snowflake)
You can ingest your data to the following S3 location s3://dataminded-academy-capstone-resources/{YOUR_USER_NAME}/ingest/
With the transformed weather data available on Snowflake, login on Snowflake and navigate to Snowsight. The team wants you to perform a couple analysis through SQL and build an insightful graph for each of them:
- Show the monthly maximum values of each air quality metric
- Compare the daily max of each metric to its 7-day historic maximum. You will require window functions
Feel free to tackle this however you want and we are here to help!