A binary classifier for Stats202.
This is for the course being taken by Professor Rajan Patel during the summer session 2016.
The following is my report,
[TOC]
##Introduction
The objective here is to train a binary classifier with
The boolean property called relevance for all pairs -
We know that none of the other signals are solely functions of query or web page because for any unique value of query_id/url_id we find that all other signals vary.
We can see that unlike what it might seem at first
Note that I have converted the integer data types is_homepage and relevance into factor datatypes as the assumption of ordered values is false.
Coming to the output variable -
I saw that
So when I plotted the distributions, I saw that all of them were highly skewed distributions. While the values were non-negative, we had zeros. I used the
So for
After log transforming it, I get a beautiful distribution.
Similarly for the signals
which becomes,
and,
which becomes,
When it came to
After log transforming
So I am removing the first two columns from the training dataset to get the inputs used in the model as both query_id and url_id are just unique identifiers for queries and web pages respectively.
input = train[,3:13]
We usually perform variable selection when we have a large number of predictors, because it's hard to draw inference from a large number of predictors. Because we have only
When we have collinearity (highly correlated predictors) then some models fare worse with the redundant variables. So I used cor
and found that
In all the models I trained removing
I used functions in the [caret][4] library to check for some common problems that some datasets have.
I used nearZeroVar
to check if among the predictors any exist which have only a handful of unique values that occur with very low frequencies. These can cause problems when I split the training data into sub-samples (for cross validation). This function takes care of this problem neatly by using two different metrics. But in our data we do not have this problem.
I tried to find if any of the signals are linear combinations of the other signals using the function findLinearCombos
which uses the QR decomposition of a matrix to enumerate sets of linear combinations (if they exist). For each linear combination, we can remove that predictor.
I found that there was no need to remove any signals due to this reason. This is expected as this sort of linear dependence is usually found in binary chemical fingerprints etc.
We don't have missing values ( na
's), so we don't have to do any imputation.
I feel that outliers are never something that need to be removed. They indicate something about the underlying relationship between the web page and the query. So trying to mess with them will just cause the model to be overly simplified - which may be needed if I needed to perform inference, but as I need to get a good test error, all I care about is making a robust binary classifier which provides me with a good performance.
Different models are based on different assumptions (at different resolutions) to model outliers. Also outliers are usually due to a incorrect measurement etc.
Outliers are rare (unless the method of populating the dataset is flawed). When I used [Tukey’s method][6] to identify the outliers ranged above and below the 1.5*[IQR][5].
I also find other that some models require specific preprocessing which I will be discussing along with the model itself in Data Mining section of this report.
I am using the train
of the the [caret][4] library in order to train most models (except knn where I use class
). It also allows us to use cross validation to choose the tuning parameters for each model.
I use
I used
The following is a table of errors I got,
Model | Error from validation set
- | -
$Naive \ Bayes$ |$37.01899%$ $Bagging \ and \ Random \ Forest$ |$34.12%$ $KNN$ |$34.01624%$ $SVM$ |$33.74563%$ $Boosting$ |$33.25422%$ $.$ |$.$
So I started with Naive Bayes as it's a popular baseline classifier for [similar classification problems][1]. It runs pretty fast (unlike the iterative algorithms which we will work with later) and is simple to understand.
I obtained a error rate of
I tried scaling the predictors but the error did not change, this makes sense as algorithms such as Linear Discriminant Analysis and Naive Bayes do feature scaling by design and performing one manually would cause no effect.
I attribute the significant classification error due to the independence assumptions for the predictors which is highly unlikely to be true.
After I log transformed the signals
I used the randomForest
library with mtry
equal to all ntree
equal to
When talking about preprocessing, one of the benefits of decision trees is that ordinal input data does not require any significant preprocessing. In fact, the results should be consistent regardless of any scaling or translational normalization, since the trees can choose equivalent splitting points.
That's why even after the log transform the error just slightly drops from
In random forest we get pretty similar results as the number of predictors are so small that choosing a subset doesnt really change things much.
I get the OOB error to be
The k-Nearest Neighbors algorithm gave me a error of
After scaling I saw that I am getting more than a
The only tuning parameter for this model is
I used knn.cv
to perform LOOCV on the training set which showed me that the optimal value of
When I used the optimal
As far is preprocessing is concerned, SVM needs numeric data, I converted the categorical attributes into numeric data. Manually scaling the data didn't really affect my results because the caret
package scales the data internally.
I couldn't conduct extensive tuning by trying various kernels, and doing a proper
The best I got was using Radial Basis Function where I obtained a validation error rate of
So I used
- | -
-
$150$ |$3$ |$0.1$ |$10$
We find that
So I attribute the success of this model to the fact that it reduces bias while not increasing variance of the model.
###Code
I am maintaining the code in my [Stats202 github repository][10].
Written with StackEdit. [1]: https://mran.microsoft.com/package/e1071/ [2]: https://en.wikipedia.org/wiki/Text_categorization [3]: https://raw.githubusercontent.com/Aditya8795/Stats202/master/training.csv [4]: http://topepo.github.io/caret/ [5]: https://en.wikipedia.org/wiki/Interquartile_range [6]: https://en.wikipedia.org/wiki/Tukey%27s_range_test [7]: https://en.wikipedia.org/wiki/Receiver_operating_characteristic [8]: https://en.wikipedia.org/wiki/Pareto_principle [9]: https://en.wikipedia.org/wiki/Similarity_measure [10]: https://github.com/Aditya8795/Stats202