author: loveSnowBest
This competition is a getting started one. As the title shows us, what we need to use for this competition is regression model. Here is the official description about this compeition:
Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.preprocessing import StandardScaler
rawData=pd.read_csv('train.csv')
testData=pd.read_csv('test.csv')
And let's have a look at our data use head method:
rawData.head()
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
First, we use drop method to split rawData into X and Y. Since we need to give Id for prediction, we should save testId before we drop them and finally we put it back.
Y_train=rawData['SalePrice']
X_train=rawData.drop(['SalePrice','Id'],axis=1)
testId=testData['Id']
X_test=testData.drop(['Id'],axis=1)
In scikit, we can use DictVectorizer and in pandas we can just use get_dummies. Here I choose the latter one. To use dummies we should put the X_train and X_test together.
# add new keys train and test for the convienence of the future split
X=pd.concat([X_train,X_test],axis=0,keys={'first','second'},
ignore_index=False)
X_d=pd.get_dummies(X)
DO NOT forget to drop the original categorical data for pandas won't help you drop them automatically. You need to drop it manually:
keep_cols=X_d.select_dtypes(include=['number']).columns
X_d=X_d[keep_cols]
Finally, we need to get our X_train and X_test back
if len(X_d.loc['first'])==1460:
X_train=X_d.loc['first']
X_test=X_d.loc['second']
else:
X_train=X_d.loc['second']
X_test=X_d.loc['first']
pandas provides us with a convienent way to fill missing data with average/median. Here we choose to fill the NA with average. Note to self: sometimes we use median() to avoid the influence by outlier.
X_train=X_train.fillna(X_train.mean())
X_test=X_test.fillna(X_test.mean())
There are some methods to scale data in scikit, like standardScaler, RobustScaler. Here we choose StandardScaler.
ss=StandardScaler()
X_scale=ss.fit_transform(X_train)
X_test_scale=ss.transform(X_test)
In scikit, we have,emmmmm,let's see:
- LinearRegression
- SVM
- RandomForestRegressor
- LassoCV
- RidgeCV
- ElasticCV
- GradientBoostingRegressor
Also, you can use XGBoost for this competition. After several attempts with these models, I find GradientBoostingRegressor has the best perfermance.
gbr=GradientBoostingRegressor(n_estimators=3000,learning_rate=0.05,
max_features='sqrt')
gbr.fit(X_scale,Y_train)
predict=np.array(gbr.predict(X_test_scale))
Lack of knowledge about python, I don't know how to add feature names when I save them as csv. So I add 'Id' and 'SalePrice' manually afterwards.
final=np.hstack((testId.reshape(-1,1),predict.reshape(-1,1)))
np.savetxt('new.csv',final,delimiter=',',fmt='%d')
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
"""Entry point for launching an IPython kernel.
This is just a simple sample for this competition. To get better score in this competition, we need to go deeper into the feature engineering and feature selection rather than simply selecting our model and training it. Furthermore, I think this is the most important part which deserves more focus since it will determine whether you can get to the top leaderboads in competitions.