The goal of this repo is to quickly setup a ETL process for the orders data in json format and to enable data analysts to start querying the data loaded into relational tables. Visit data_migration for additional details on the problem statement.
Each file to be processed has orders data in nested json format. process_json_files.py performs the following steps to process json data files and to load the processed files into postgres database.
- Extracts line_items from each order to a separate csv file with user details
- Creates a separate csv file for orders without line_items
- Outputs csv files into a separate directory
- Loads orders csv file into myapp.orders and line_items csv file into myapp.line_items
- Analyzes both orders and line_items tables after loading all csv files
I have populated myapp.user_summary using user_summary.sql after orders and line_items tables were loaded.
Make sure that postgres database is accessible and tables were created before invoking run.sh. run.sh could be scheduled to run daily via a scheduler if required.