[1] https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/
The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. The aim is to build a predictive model and find out the sales of each product at a particular store. Using this model, BigMart can try to understand the properties of products and stores which play a key role in increasing sales.
Here are the features and their description.
Several of the features had missing values or values that needed to be corrected.
We impute the missing values of the Item_Weight by the average Item_Weight of each Item_Identifier. We can see that the values we have chosen to replace the missing weights are reasonable as the boxplot of the affected outlets now follows the same pattern as the other outlets.
Grocery Stores and Supermarkets of Type 1 have missing values, as shown in the image below.
Grocery Stores All the non missing values in Grocery Stores are 'small'. So all the missing values in Outlet_Size of Grocery Stores are replaced with 'small'.
All the other missing values in the rest of the data set are replaced with the mode values for each Store Type, from the pivot table below.
The min value of Item_Visibility is 0, but this can not be as every item must have some visibility.
879 out of 14204 is a lot so we replace the 0 values for NAN values so the mean value is not affected.
We impute missing values for each Item_Type in each Outlet_Type, from the pivot table below.
There are categories that can be conbined: Low Fat, low fat and LF are all Low Fat; reg and Regular are both Regular.
We did the following feature engineering:
- Converted the Outlet_Establishment_Years into how old the establishments are, feature Outlet_Age
- Created broader categories for type of item: Food, Drink and Non-Consumable.
- Changed value of the 'Item_Fat_Content' of the items that are non-consumables, to Non-Edible
- Made a new category for items that reflect their sales: The Item_MRP illustrated in the image below clearly shows there are 4 different price categories. So we define them to be 'Low', 'Medium', 'High', 'Very High'.
- The Item_MRP does not change significantly accross the stores:
The Item_Outlet_Sales is the number of items sold times the Item_MRP. So we made a new variable with the number of items sold (by dividing the Item_Outlet_Sales by Item_MRP).
There is a positive correlation between Item_MRP and Item_Outlet_Sales and a negative correlation between Item_Outlet_Sales and visibility.
There is no correlation Item_MRP and Item_Number_Sales and there is a negative correlation between Item_Number_Sales and visibility.
Correlation between Item_MRP and Item_Outlet_Sales: 0.5675744466569193 Correlation between Item_MRP and Item_Number_Sales: 0.01114352701232483
Correlation between Item_Visibility and Item_Outlet_Sales: -0.14076174687662235 Correlation between Item_Visibility and Item_Number_Sales: -0.17440844918045084
- Numerical and One-Hot Coding of Categorical Variables¶
- Standardisation of numerical data - More on this later
- Separate train and test datasets
- Average sales - Replace missing values by the average sales for all items. This is how the resulting data looks:
- Average Sales by Item_Type_Category - Replace missing values by the average sales per Item_Type_Category from this pivot table:
This is how the resulting data looks:
- Average Sales by Product_Type_Category in Particular Outlet_Type - Replace missing values by the average sales per Item_Type_Category in each Outlet_Type from this pivot table:
This is how the resulting data looks:
Hot-coding of the categorical variables leaves a total of 56 features in total (numerical and categorical). Using Recursive Feature Elimination (rfe) from the sklearn package we choose the top 16 predictive features to build the rest of the predictive model, while avoiding over-fitting. These are the features chosen:
Model | Parameter Values | Validation dataset RMSE | CV score |
---|---|---|---|
Average Sales | - | 1652 | - |
Average Sales by Item_Type_Category | - | 1651 | - |
Average Sales by Product_Type_Category in Particular Outlet_Type | - | 1417 | - |
Regression | - | 1143 | Mean - 1222 (+/- 142.71), Std - 71.35, Min - 1028, Max - 1312 |
Regression Ridge | alpha = 0.001 | 1143 | Mean - 1222 (+/- 143.25), Std - 71.63, Min - 1026, Max - 1313 |
Decision Tree Regressor | max_depth = 10.6, min_samples_leaf = 0.01 | 1103 | Mean - 1180 (+/- 151.02), Std - 75.51 , Min - 975.5, Max - 1282 |
Neural Network | layers = 3, nodes/layer = 100 | 1101 | - |