A data analysis project for DSCI 522 Data Science Workflows
Authors: Saurav Chowdhury, Evhen Dytyniak, and Reiko Okamoto
Date: 2020-08-24
This analysis attempted determine the most important features when predicting a yearly salary of more than 50,000 USD. A logistic regression model and AdaBoost model were trained in an effort to extract feature importance. The models did not perform exceedingly well, with scores in the low 80s, but performed similarly to random forest and support vector machine (SVM) classifiers. The logistic regression model’s most important features in predicting a yearly salary of greater than 50,000 USD were marital_status_Married-AF-spouse
and marital_status_Married-civ-spouse
while the most important features in predicting a yearly salary of less than 50,000 USD were occupation_Priv-house-serv
and workclass_Without-pay
. The Adaboost model identified education_num
and age
as the most important features in classification.
The data used in this project was created by Ronny Kohavi and Barry Becker, Data Mining and Visualization division at Silicon Graphics. This data was extracted from the 1994 US Census Data. It was sourced from the UCI Machine Learning Repository (Dua and Graff 2017) and can be found here. Each row in the data represents the attributes of an individual such as sex, race, age, educational attainment, and working hours. The target variable is whether one's income is above or below 50,000 USD.
The report can be found here.
To replicate this analysis, clone this repository, install the dependencies, and run the following commands at the command line from the root directory.
Note
- Running
make clean
followed bymake all
will take up to an hour and consume all available processors - For the purpose of this milestone submission,
make clean_light
will only remove files that do not trigger the time-consuming scripts, but will demonstrate correct use of the Makefile - If using a Windows OS, it appears that downloading a local copy of the repo and running
make clean_light
followed bymake all
, runs all of the scripts in the pipeline (this is not the case for Linux or MacOS)
1. Using Docker
These instructions rely on running the commands in a unix shell (e.g. Terminal or Git Bash)
To replicate the analysis, install Docker. Then clone this GitHub repository and run the following command at the command line/terminal from the root directory of this project:
Note 1: append the command sudo
if running a Linux OS
docker run --rm -v /$(pwd):/home/incomelevelpredictor/ evhend/dsci522incomelevelpredictor make -C /home/incomelevelpredictor/ clean
docker run --rm -v /$(pwd):/home/incomelevelpredictor/ evhend/dsci522incomelevelpredictor make -C /home/incomelevelpredictor/ all
Alternatively, to reduce runtime for this submission:
docker run --rm -v /$(pwd):/home/incomelevelpredictor/ evhend/dsci522incomelevelpredictor make -C /home/incomelevelpredictor/ clean_light
docker run --rm -v /$(pwd):/home/incomelevelpredictor/ evhend/dsci522incomelevelpredictor make -C /home/incomelevelpredictor/ all
Note 2: Although this may not work, if running Windows and using Git Bash, it may be necessary to instead try:
docker run --rm -v /$(pwd):/home/incomelevelpredictor/ evhend/dsci522incomelevelpredictor bash -c "make -C /home/incomelevelpredictor/ clean"
docker run --rm -v /$(pwd):/home/incomelevelpredictor/ evhend/dsci522incomelevelpredictor bash -c "make -C /home/incomelevelpredictor/ all"
Alternatively, to save time:
docker run --rm -v /$(pwd):/home/incomelevelpredictor/ evhend/dsci522incomelevelpredictor bash -c "make -C /home/incomelevelpredictor/ clean_light"
docker run --rm -v /$(pwd):/home/incomelevelpredictor/ evhend/dsci522incomelevelpredictor bash -c "make -C /home/incomelevelpredictor/ all"
2. Without Using Docker
make clean
make all
OR, to reduce runtime for this submission:
make clean_light
make all
- Python 3.7.3 and Python packages:
- docopt == 0.6.2
- requests == 2.22.0
- pandas == 0.25.3
- numpy == 1.18.1
- scikit-learn == 0.22.1
- feather-format == 0.4.0
- pyarrow == 0.15.1
- R version 3.6.1 and R packages:
- knitr == 1.27.2
- feather == 0.3.5
- tidyverse == 1.3.0
- docopt == 0.6.1
- ggthemes == 4.2.0
- testthat == 2.3.1
- gridExtra == 2.3
- rlang == 0.4.4
- rmarkdown == 2.1
- kableExtra == 1.1.0
The Income Level Predictor materials are licensed under the MIT License.
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository . Irvine, CA: University of California, School of Information and Computer Science. [http://archive.ics.uci.edu/ml]