Income Level Predictor

A data analysis project for DSCI 522 Data Science Workflows

Authors: Saurav Chowdhury, Evhen Dytyniak, and Reiko Okamoto
Date: 2020-08-24

About

This analysis attempted determine the most important features when predicting a yearly salary of more than 50,000 USD. A logistic regression model and AdaBoost model were trained in an effort to extract feature importance. The models did not perform exceedingly well, with scores in the low 80s, but performed similarly to random forest and support vector machine (SVM) classifiers. The logistic regression model’s most important features in predicting a yearly salary of greater than 50,000 USD were marital_status_Married-AF-spouse and marital_status_Married-civ-spouse while the most important features in predicting a yearly salary of less than 50,000 USD were occupation_Priv-house-serv and workclass_Without-pay. The Adaboost model identified education_num and age as the most important features in classification.

The data used in this project was created by Ronny Kohavi and Barry Becker, Data Mining and Visualization division at Silicon Graphics. This data was extracted from the 1994 US Census Data. It was sourced from the UCI Machine Learning Repository (Dua and Graff 2017) and can be found here. Each row in the data represents the attributes of an individual such as sex, race, age, educational attainment, and working hours. The target variable is whether one's income is above or below 50,000 USD.

Report

The report can be found here.

Usage

To replicate this analysis, clone this repository, install the dependencies, and run the following commands at the command line from the root directory.

Note

Running make clean followed by make all will take up to an hour and consume all available processors
For the purpose of this milestone submission, make clean_light will only remove files that do not trigger the time-consuming scripts, but will demonstrate correct use of the Makefile
If using a Windows OS, it appears that downloading a local copy of the repo and running make clean_light followed by make all, runs all of the scripts in the pipeline (this is not the case for Linux or MacOS)

1. Using Docker

These instructions rely on running the commands in a unix shell (e.g. Terminal or Git Bash)

To replicate the analysis, install Docker. Then clone this GitHub repository and run the following command at the command line/terminal from the root directory of this project:

Note 1: append the command sudo if running a Linux OS

docker run --rm -v /$(pwd):/home/incomelevelpredictor/ evhend/dsci522incomelevelpredictor make -C /home/incomelevelpredictor/ clean
docker run --rm -v /$(pwd):/home/incomelevelpredictor/ evhend/dsci522incomelevelpredictor make -C /home/incomelevelpredictor/ all

Alternatively, to reduce runtime for this submission:

docker run --rm -v /$(pwd):/home/incomelevelpredictor/ evhend/dsci522incomelevelpredictor make -C /home/incomelevelpredictor/ clean_light
docker run --rm -v /$(pwd):/home/incomelevelpredictor/ evhend/dsci522incomelevelpredictor make -C /home/incomelevelpredictor/ all

Note 2: Although this may not work, if running Windows and using Git Bash, it may be necessary to instead try:

docker run --rm -v /$(pwd):/home/incomelevelpredictor/ evhend/dsci522incomelevelpredictor bash -c "make -C /home/incomelevelpredictor/ clean"
docker run --rm -v /$(pwd):/home/incomelevelpredictor/ evhend/dsci522incomelevelpredictor bash -c "make -C /home/incomelevelpredictor/ all"

Alternatively, to save time:

docker run --rm -v /$(pwd):/home/incomelevelpredictor/ evhend/dsci522incomelevelpredictor bash -c "make -C /home/incomelevelpredictor/ clean_light"
docker run --rm -v /$(pwd):/home/incomelevelpredictor/ evhend/dsci522incomelevelpredictor bash -c "make -C /home/incomelevelpredictor/ all"

2. Without Using Docker

make clean
make all

OR, to reduce runtime for this submission:

make clean_light
make all

Dependency diagram

Dependencies

Python 3.7.3 and Python packages:
- docopt == 0.6.2
- requests == 2.22.0
- pandas == 0.25.3
- numpy == 1.18.1
- scikit-learn == 0.22.1
- feather-format == 0.4.0
- pyarrow == 0.15.1
R version 3.6.1 and R packages:
- knitr == 1.27.2
- feather == 0.3.5
- tidyverse == 1.3.0
- docopt == 0.6.1
- ggthemes == 4.2.0
- testthat == 2.3.1
- gridExtra == 2.3
- rlang == 0.4.4
- rmarkdown == 2.1
- kableExtra == 1.1.0

License

The Income Level Predictor materials are licensed under the MIT License.

References

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository . Irvine, CA: University of California, School of Information and Computer Science. [http://archive.ics.uci.edu/ml]

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
data		data
doc		doc
eda		eda
results		results
scripts		scripts
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
dependency_diagram.png		dependency_diagram.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Income Level Predictor

About

Report

Usage

Dependency diagram

Dependencies

License

References

About

Releases

Packages

Languages

License

evhend/DSCI_522_group-307

Folders and files

Latest commit

History

Repository files navigation

Income Level Predictor

About

Report

Usage

Dependency diagram

Dependencies

License

References

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages