Kuntai Cai, Xiaoyu Lei, Jianxin Wei, Xiaokui Xiao
This project provides the implementation of the paper "Data Synthesis via Differentially Private Markov Random Fields" (PrivMRF). It enables generating a synthetic dataset with differential privacy guarantees.
Requirements:
- Python 3.8
- CUDA 11.7
- GPU that supports
cupy
Install dependencies:
pip3 install -r requirements.txt
To reproduce the experimental results from the paper:
python3 script.py
This will run PrivMRF once on each of the four datasets and five ε values, reporting the total variation distances (TVD). Runtime is 2-5 hours depending on hardware.
You may modify script.py
to run a subset of the experiments.
To generate a synthetic dataset with specified settings:
python3 main.py
This generates synthetic data without reporting metrics.
SVM experiment code in script.py
is commented out by default. Uncomment to reproduce:
run_experiment(data_list, method_list, exp_name, task='SVM', epsilon_list=epsilon_list, repeat=repeat, classifier_num=25)
run_experiment(data_list, method_list, exp_name, task='SVM', epsilon_list=epsilon_list, repeat=repeat, classifier_num=25, generate=False)
The first line generates synthetic data, the second trains and tests SVMs in parallel. Limit data_list
, method_list
, epsilon_list
sizes to control number of processes. One dataset/ε takes 1-6 hours.
main.py
shows how to run PrivMRF with default config. Read your data, preprocess domains to discrete int values, and call PrivMRF
to train and generate synthetic data.
Default δ=1e-5. For other values, calculate privacy budget with cal_privacy_budget()
in PrivMRF/utils/tools.py
and hardcode the result in privacy_budget()
.
In the adult
dataset, for each attribute, we merge values into bins. Since the number of data records in a bin is larger than the number of records of each value, merging values provides more resistance to noise. This slightly improves PrivMRF performance on adult
but doesn't always help for other datasets.
To use attribute hierarchy for your dataset:
- Define hierarchy (see
data/adult_hierarchy.json
) - Read with
read_hierarchy()
inPrivMRF/attribute_hierarchy.py
- Pass hierarchy to
PrivMRF.run()
- Set
config['enable_attribute_hierarchy'] = True
Calculate privacy budgets for different ε/δ using cal_privacy_budget()
in PrivMRF/utils/tools.py
.