IncomeSCM

IncomeSCM is a time-series simulator based on the Adult dataset intended for evaluation of causal effect estimators.

It has been used to construct a cross-sectional benchmark data set for conditional average treatment effect (CATE) estimation, IncomeSCM-1.0.CATE. The files for this benchmark are contained in IncomeSCM-1.0.CATE.zip.

Using the CATE estimation data set (IncomeSCM-1.0.CATE)

The IncomeSCM-1.0.CATE data set is sampled from the IncomeSCM-1.0 simulator, fit with the config_v1.yml configuration file.

Data set description

The data set represents 13 variables extracted from the 1994 US Census bureau database, as well as a hand-designed "studies" variable.

Covariates, $X$

Column	Description	Type
native-country	Native country	Categorical
sex	Sex (as reported in census)	Categorical
race	Race (as reported in census)	Categorical
age	Age	Numeric
education	Education type (e.g., Bachelors)	Categorical
education-num	Education (numeric representation)	Numeric
workclass	Workclass (e.g., private, self-employed)	Categorical
occupation	Occupation (e.g., Tech-support)	Categorical
marital-status	Marital status (e.g., married)	Categorical
relationship	Relationship type (e.g., wife)	Categorical
capital-net	Net capital gains	Numeric
hours-per-week	Number of work hours per week	Numeric
income_prev	Income the previous year (USD)	Numeric
studies_prev	Studies the previous year	Categorical

Intervention, $A$

Column	Description	Type
studies	Studies the current year (e.g., Full-time studies)	Categorical

Outcome, $Y$

Column	Description	Type
income	Income 5 years after the intervention (USD)	Numeric

Task description

The goal is to use observational data to estimate the causal effect on income ($Y$) after intervening on studies with "Full-time studies" ($A \leftarrow 1$), relative to "No studies" ($A \leftarrow 0$), $$\Delta = Y(1) - Y(0),$$ where $Y(t)$ is the potential outcome of intervening with $A\leftarrow a$. In particular, we are interested in the conditional average treatment effect (CATE), $$\mathrm{CATE}(z) = \mathbb{E}[\Delta \mid Z=z]$$ where $Z \subseteq X$ is a given set of covariates. For this, we consider three main conditioning sets:

$Z$ is the set of all pre-intervention covariates
$Z$ is the set of direct causes of $A$
$Z$ is a subset of covariates which is an invalid adjustment set. Specifically, $Z = (\mathrm{age}, \mathrm{education}, \mathrm{income\_prev})$.

In addition, we seek to estimate the average treatment effect (ATE), $$\mathrm{ATE} = \mathbb{E}[\Delta]$$ using the first two conditioning sets above for adjustment.

Note: The intervention variable studies is simulated with 4 values: Full-time studies, No studies, Evening course and Day course. In the Tasks above, samples with interventions other than Full-time studies or No studies can be discarded, or used for learning, depending on the estimator.

Evaluation

We measure the quality in estimates by the $R^2$, MSE, RMSE for CATE and the absolute error (AE) for ATE. Due to the complexity of the simulator, the CATE and ATE are not known as closed-form. Instead, we sample both counterfactual outcomes for a fixed of baseline subjects and compare their outcomes to each other.

File description

The main data set files are:

IncomeSCM-1.0.CATE_default.pkl (V1 simulator, default policy ($A$ observational), 50 000 samples, horizon T=5, seed=0)
IncomeSCM-1.0.CATE_no.pkl (V1 simulator, "No studies" policy ($A \leftarrow 0$), 50 000 samples, horizon T=5, seed=1)
IncomeSCM-1.0.CATE_full.pkl (V1 simulator, "Full-time studies" policy ($A \leftarrow 1$), 50 000 samples, horizon T=5, seed=1)
All three files are contained in IncomeSCM-1.0.CATE.zip
Training data: The "default" policy data set represents observational data for causal effect estimators to learn from.
Evaluation data: The "full" and "no" policy data sets represent samples observed under alternative interventions ($A \leftarrow 1$ and $A \leftarrow 0$, respectively). The out-of-sample quality of estimates of CATE and ATE can be estimated by using a model fit to the training data to predict (average) potential outcome for the subjects in the file representing each intervention and compare to the observed values. In Python, using the "S-learner" estimator implemented in the IncomeSCM package:

import pandas as pd
import numpy as np
from income.estimators import S_learner

dobs = pd.read_pickle('samples/IncomeSCM-1.0.CATE_default.pkl')
d1 = pd.read_pickle('samples/IncomeSCM-1.0.CATE_no.pkl')
d0 = pd.read_pickle('samples/IncomeSCM-1.0.CATE_full.pkl')

model = S_learner(base_estimator=..., c_int='studies', c_out='income', c_adj=[...]).fit(dobs)     # c_adj is the set of adjustment variables.
                                                                                                  # base_estimator is any regression estimator. For the example to work
                                                                                                  #   out of the box, it must handle categorical attributes in dobs[c_adj]
                                                                                                  #   Alternatively, one-hot encoders can be used in e.g., a pipeline

yp1 = model.predict_outcomes(d1)
yp0 = model.predict_outcomes(d0)

cate_pred = yp1 - yp0
cate_true = d1['income'] - d0['income']
mse_cate = np.mean(np.square(cate_pred - cate_true))

ate_pred = np.mean(cate_pred)
ate_true = np.mean(cate_true)

ae_ate = np.abs(ate_pred - ate_true)

A real fitting and evaluation example is given in estimate.py

Using the simulator and estimators (IncomeSCM-1.0)

IncomeSCM is written in Python 3 and based on the Scikit-learn package and the Adult dataset.

Prerequisites

To reproduce results or use the simulator, start by installing python modules pandas, numpy, scikit-learn, jupyter, matplotlib, yaml, xgboost, for example in a virtual environment. Below, we list the versions used during development and testing.
```
pip install scikit-learn==1.4.1.post1 pandas==2.0.1 PyYAML==6.0 xgboost==2.0.0 matplotlib==3.7.1
```

Download the IncomeSCM simulator

git clone [email protected]:Healthy-AI/IncomeSim.git

The IncomeSCM simulator is fit to the Adult dataset data set.
To fit the simulator, run the python script fit.py in the repository folder

python fit.py [-c CONFIG_FILE]

The default config file is configs/config_v1.yml
To sample from the simulator, use the script sample.py

python sample.py [-c CONFIG_FILE]

This also uses the same default config file, which specifies which fitted model to use, how many samples are used, and from which (counterfactual) policy to sample. By default, 50 000 samples are generated from the "default" (observational) "full" and "no" policies.
The samples are stored (by default) in ./samples/[SAMPLE_FILE_NAME].pkl. The file name is determined by the version labels specified in the config file.

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
configs		configs
data		data
demos		demos
income		income
samples		samples
DATASHEET.md		DATASHEET.md
README.md		README.md
estimate.py		estimate.py
fit.py		fit.py
generate.py		generate.py
present_results.py		present_results.py
sample.py		sample.py
visualize.py		visualize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IncomeSCM

Using the CATE estimation data set (IncomeSCM-1.0.CATE)

Data set description

Task description

Evaluation

File description

Using the simulator and estimators (IncomeSCM-1.0)

Prerequisites

Papers using the data set

About

Releases

Packages

Languages

Healthy-AI/IncomeSim

Folders and files

Latest commit

History

Repository files navigation

IncomeSCM

Using the CATE estimation data set (IncomeSCM-1.0.CATE)

Data set description

Task description

Evaluation

File description

Using the simulator and estimators (IncomeSCM-1.0)

Prerequisites

Papers using the data set

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages