Skip to content

Latest commit

 

History

History
294 lines (245 loc) · 7.21 KB

File metadata and controls

294 lines (245 loc) · 7.21 KB

House Prices: Advanced Regression Techniques in Kaggle

author: loveSnowBest

1. A brief introduction to this competition

This competition is a getting started one. As the title shows us, what we need to use for this competition is regression model. Here is the official description about this compeition:

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

2. My solution

import what we need

import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.preprocessing import StandardScaler

load the data

rawData=pd.read_csv('train.csv')
testData=pd.read_csv('test.csv')

And let's have a look at our data use head method:

rawData.head()
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

split original data into X,Y

First, we use drop method to split rawData into X and Y. Since we need to give Id for prediction, we should save testId before we drop them and finally we put it back.

Y_train=rawData['SalePrice']
X_train=rawData.drop(['SalePrice','Id'],axis=1)

testId=testData['Id']
X_test=testData.drop(['Id'],axis=1)

deal with categorical data

In scikit, we can use DictVectorizer and in pandas we can just use get_dummies. Here I choose the latter one. To use dummies we should put the X_train and X_test together.

# add new keys train and test for the convienence of the future split
X=pd.concat([X_train,X_test],axis=0,keys={'first','second'},
            ignore_index=False)
X_d=pd.get_dummies(X)

DO NOT forget to drop the original categorical data for pandas won't help you drop them automatically. You need to drop it manually:

keep_cols=X_d.select_dtypes(include=['number']).columns
X_d=X_d[keep_cols]

Finally, we need to get our X_train and X_test back

if len(X_d.loc['first'])==1460:
    X_train=X_d.loc['first']
    X_test=X_d.loc['second']
else:
    X_train=X_d.loc['second']
    X_test=X_d.loc['first']

deal with missing data

pandas provides us with a convienent way to fill missing data with average/median. Here we choose to fill the NA with average. Note to self: sometimes we use median() to avoid the influence by outlier.

X_train=X_train.fillna(X_train.mean())
X_test=X_test.fillna(X_test.mean())

Use StandardScaler to make data better for your model

There are some methods to scale data in scikit, like standardScaler, RobustScaler. Here we choose StandardScaler.

ss=StandardScaler()
X_scale=ss.fit_transform(X_train)
X_test_scale=ss.transform(X_test)

Choose your linear model

In scikit, we have,emmmmm,let's see:

  • LinearRegression
  • SVM
  • RandomForestRegressor
  • LassoCV
  • RidgeCV
  • ElasticCV
  • GradientBoostingRegressor

Also, you can use XGBoost for this competition. After several attempts with these models, I find GradientBoostingRegressor has the best perfermance.

gbr=GradientBoostingRegressor(n_estimators=3000,learning_rate=0.05, 
                              max_features='sqrt')
gbr.fit(X_scale,Y_train)
predict=np.array(gbr.predict(X_test_scale))

Save our prediction

Lack of knowledge about python, I don't know how to add feature names when I save them as csv. So I add 'Id' and 'SalePrice' manually afterwards.

final=np.hstack((testId.reshape(-1,1),predict.reshape(-1,1)))
np.savetxt('new.csv',final,delimiter=',',fmt='%d')
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  """Entry point for launching an IPython kernel.

3.Summary

This is just a simple sample for this competition. To get better score in this competition, we need to go deeper into the feature engineering and feature selection rather than simply selecting our model and training it. Furthermore, I think this is the most important part which deserves more focus since it will determine whether you can get to the top leaderboads in competitions.