Skip to content

SurbhiJainUSC/Regularization-and-Gradient-Boosting

Repository files navigation

Regularization-and-Gradient-Boosting

The task is to compare ordinary least squares method of regression with regularization methods: Lasso Regression and Ridge Regression. Also, Principal Component Regression and L1-Penalized Gradient Boosting Tree (XGBoost) have been used to perform regression.

Dataset

The Communities and Crime data combines socio-economic data from the 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crime data from the 1995 FBI UCR. The variables included in the dataset involve the community, such as the percent of the population considered urban, and the median family income, and involving law enforcement, such as per capita number of police officers, and percent of officers assigned to drug units. The data contains 1994 instances and each instance is described by using 128 attributes.

Data Imputation Techniques

The data set has missing values. Various techniques that are used to handle missing values:

  • Substitute missing values with a constant (distinct from all other values) that has a meaning in that domain.
  • Substitute missing values with the value of the randomly selected observation.
  • Use statistics such as Mean, Median, Mode to fill out missing values.
  • Use predictive models where missing values are treated as output of the predictive model and are predicted based on other data points that do not have missing values.
  • Use iterative methods based on Expectation Maximization to handle missing values.

    All the missing values in the dataset are replaced by attribute mean and all non-predictive features (country, state, community and communityname) have been ignored.

    Exploratory Data Analysis

    Correlation matrix is used to find out correlation/association between two variables in the dataset. Also, Coefficient of Variation (CV) is used to measure relative variability of the variables. The higher the coefficient of variation, the greater is the level of dispersion around the mean. Variables with a CV less than 1 are considered to be of low-variance, whereas those with a CV higher than 1 are considered to be high-variance variables.

    Linear Model and Regularization

    The ordinary least squares model is trained on training set to calculate the error obtained on testing data. This model is then compared with Ridge Regression model and Lasso Regression model, where regularization penalty is chosen by cross-validation. It has been proven that regularization reduced the Mean Squared Error drastically for this data.

    Principal Component Regression

    Principal Component Regression (PCR) model has also been trained on training set with M (the number of principal components) chosen by cross-validation. PCR finds a low-dimensional representation of a dataset that contains as much as possible of the variation. Before PCR is performed, the variables should be centered to have mean zero. Furthermore, the results obtained when we perform PCR also depends on whether the variables have been individually scaled.

    L1-Penalized Gradient Boosting Tree

    XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. XGBoost is an ensemble tree method that apply the principle of boosting weak learners using the gradient descent architecture. In this task, Lasso regularization has been used to prevent gradient boosting algorithm from overfitting. The regularization penalty has been chosen by cross-validation.