Our codebase currently supports Python 3.9. We recommend that you create a separate virtual or Conda environment.
For example,
conda create -n dp-data python=3.9
Install the source files (via setuptools)
conda activate dp-data
pip install --upgrade pip
pip install -e .
Scripts for preprocessing individual datasets can be found in scripts
. The following script will preprocess the available datasets.
./preprocess_all.sh
Note that the credit
dataset must be downloaded manually from here and moved to datasets/raw/credit.csv
. Please do this first, and then run the following command separately.
./scripts/credit.sh
We provide code for training classifiers on our preprocessed datasets. Our scripts by default create 80-20 train-test splits. An example of how to use this code can be found below:
DATASET=adult
MODELS='DecisionTree KNN LogisticRegression LinearSVC RandomForest GradientBoosting XGBoost'
python run/ml_eval.py --dataset $DATASET --train_test_split_dir original --models $MODELS
Our scripts generate a large set of datasets from the PPMF raw source data, and the overall preprocessing pipeline may take several hours. We therefore do not include them in preprocess_all.sh
. To obtain them, you can instead run the following scripts separately.
./scripts/ppmf/tracts.sh
./scripts/ppmf/blocks.sh
Please also note that the PPMF datasets are constructed with 50-50 train-test splits. The purpose of such splits is to reproduce baseline results for the experiments found in this work. There is no natural classification problem for this data, but you are free to construct your own by specifying a target
attribute when running run/ml_eval.py
.