Demonstrating and Building ETL pipelines in Airflow This repo demonstrates a use case for n Ecommerce business that has a platform that generates transaction data each time a purchase is made. With this transaction data, the functions in the pipeline seek to answer 3 business questions
- Who is our platinum customer? Anyone with purchase value equal to or more than 5000
- What is the purchase history like for each user? This builds a dataset that can be used for a recommendation engine downstream
- What items are commonly purchased together? This builds a dataset that can be used for Basket Analysis downstream
The code can be found in etl_utils.py
file.
Question 1 is implemented using pd.merge()
to get the combined dataset and df.groupby().sum()
to get total purchases.
To get the platinum customer, we apply a filter
final_df = df.loc[df['total_purchase_value']>=10000]
Both question 2 and 3 are achieved using Pandas Pivot Tables pd.pivtot_table()
The sample CSVs generated by the above functions can be found in the samples/
folder.
I have used the PostgresOperator which requires a postgres connection and a SimpleHttpOperator which requires a http connection. This is set in the Admin UI in the connections tab.