About | Extract | Transform | Load | Data Analysis | Limitations | Programming Language / Applications Used | Contributors
This project extracts, transforms, and loads video game sales data from https://www.vgchartz.com. This dataset could be later used to analyze and answer the following questions:
- Most popular video game console in various regions
- Most popular video game across various consoles/regions/genre
Wrote a Jupyter notebook python script to scrape the video games data from www.vgchartz.com website using BeautifulSoup and Splinter to do the following:
- Obtained the list of genres
- Dynamically built the URL to scrape the pages of games data for each genre
- Used Splinter, BeautifulSoup and Pandas Web Scraping to scrape multiple pages of games data for each genre
- Wrote the extracted raw video games data into CSV file to be used for Transform process
Wrote a Jupyter notebook python script to do the following clean up and transformation steps to the raw video games data obtained from the Extraction process:
- Deleted games without sales data
- Removed unwanted columns and renamed the columns
- Added Release year column
- Converted Release date values to date format and Sales values to float format
- Wrote the transformed video games data into CSV file to be used for the Load process
Performed the following steps to Load the transformed video games data into PostgreSQL Database:
- Designed the database tables in normalized form and generated ERD
- Created a SQL script for table creation in PostgreSQL
- Wrote a Jupyter notebook python script to split the transformed video games data into multiple dataframes and loaded them into PostgreSQL Database tables
Performed the following analysis after building the complete database:
- Wrote a SQL query to determine the year in which North America had highest video games sales.
- Wrote a SQL query to determine the year in which the global sales was highest.
- Wrote a SQL query to determine the year in which the total shipped was highest.
- Wrote a SQL query to determine the most popular video game console in North America.
- Wrote a SQL query to determine the most popular video game console in PAL.
- Wrote a SQL query to determine the most popular video game console in Japan.
- Wrote a SQL query to determine the most popular video game console in Other regions.
- Wrote a SQL query to determine the most popular video game console globally.
- Wrote a SQL query to determine the most shipped video game console.
- Wrote a SQL query to determine the most popular video game across various consoles.
- Wrote a SQL query to determine the most popular video game across various genres.
- Overall sales information for each game at the region/genre level only was available. Lack of sales history information for each game, limits the analysis of sales fluctuation for each game over the years.
- Python
- BeautifulSoup
- Splinter
- Pandas Web Scraping
- Pandas
- SQLAlchemy
- Jupyter Notebook
- PostgreSQL
- Usha Saravanakumar (ushaakumaar)
- Crystal Dalsania (cdalsania)
- Jame'l Brown (JayyMaurice2020)