The task is to compare ordinary least squares method of regression with regularization methods: Lasso Regression and Ridge Regression. Also, Principal Component Regression and L1-Penalized Gradient Boosting Tree (XGBoost) have been used to perform regression.
The Communities and Crime data combines socio-economic data from the 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crime data from the 1995 FBI UCR. The variables included in the dataset involve the community, such as the percent of the population considered urban, and the median family income, and involving law enforcement, such as per capita number of police officers, and percent of officers assigned to drug units. The data contains 1994 instances and each instance is described by using 128 attributes.
The data set has missing values. Various techniques that are used to handle missing values:
All the missing values in the dataset are replaced by attribute mean and all non-predictive features (country, state, community and communityname) have been ignored.
Correlation matrix is used to find out correlation/association between two variables in the dataset. Also, Coefficient of Variation (CV) is used to measure relative variability of the variables. The higher the coefficient of variation, the greater is the level of dispersion around the mean. Variables with a CV less than 1 are considered to be of low-variance, whereas those with a CV higher than 1 are considered to be high-variance variables.
The ordinary least squares model is trained on training set to calculate the error obtained on testing data. This model is then compared with Ridge Regression model and Lasso Regression model, where regularization penalty is chosen by cross-validation. It has been proven that regularization reduced the Mean Squared Error drastically for this data.
Principal Component Regression (PCR) model has also been trained on training set with M (the number of principal components) chosen by cross-validation. PCR finds a low-dimensional representation of a dataset that contains as much as possible of the variation. Before PCR is performed, the variables should be centered to have mean zero. Furthermore, the results obtained when we perform PCR also depends on whether the variables have been individually scaled.
XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. XGBoost is an ensemble tree method that apply the principle of boosting weak learners using the gradient descent architecture. In this task, Lasso regularization has been used to prevent gradient boosting algorithm from overfitting. The regularization penalty has been chosen by cross-validation.