A Differentially Private (DP) Synthetic Data benchmarking package, posing the question: "Can a DP Synthesizer produce private (tabular) data that preserves scientific findings?" In other words, do DP Synthesizers satisfy Epistemic Parity?
Citation: Rosenblatt, L., Holovenko, A., Rumezhak, T., Stadnik, A., Herman, B., Stoyanovich, J., & Howe, B. (2022). Epistemic Parity: Reproducibility as an Evaluation Metric for Differential Privacy. arXiv preprint arXiv:2208.12700.
(under review)
The benchmark is currently in beta-0.1
. Still, you can install the development version by running the following commands:
- Create your preferred package management environment with
python=3.7
(for example,conda create -n "synrd" python=3.7
) git clone https://github.com/DataResponsibly/SynRD.git
cd SynRD
pip install git+https://github.com/ryan112358/private-pgm.git
pip install .
Step (4) installs a non-PyPi dependency (this excellent package for DP synthesizers here: (https://github.com/ryan112358/private-pgm)[https://github.com/ryan112358/private-pgm]).
Note: This package is under heavy development - if functionality doesn't work/is missing, feel free to add an issue or submit a PR to fix!
If you would like to use the GEMSynthesizer, you must follow an alternative installation process for SynRD:
- Create your preferred package management environment with
python=3.7
(for example,conda create -n "synrd" python=3.7
) - Git clone the SynRD repo:
git clone https://github.com/DataResponsibly/SynRD
cd SynRD/synthesizers
- Git clone the dp-query-release repo:
git clone https://github.com/terranceliu/dp-query-release.git
- Move
src/
folder out ofdp-query-release/
and intoSynRD/synthesizers/
- From the top level of SynRD clone, run
pip install .
If you would like to benchmark with the paper Fruiht2018Naturally
, please follow some of the following rpy2
installation instructions to configure your R-Python interface package.
If you have a mac with an M1 chip, you may have success installing rpy2 via the following:
- Uninstall existing R versions on your machine.
- Install
R-4.2.2-arm64.pkg
from https://cran.r-project.org/bin/macosx/. conda install -n base conda-forge::mamba
mamba install -c conda-forge rpy2
To run analysis for papers using R, you must ensure that R is downloaded and your R_HOME environment variable is set to the path of the R executable.
For installing with Anaconda, you may use conda install r-base r-essentials
.
For confirming rpy2 is working as expected, try the following in Python:
import rpy2
rpy2.robjects.r['pi'] # Returns R object with the number pi
- Each "paper" in the benchmark is named according to bibtex convention (authorYEARfirstword).
Brief details on how to add a new paper.
- Create a new folder with (authorYEARfirstword)
- Create a
process.ipynb
notebook as your data playground. Use this to investigate data cleaning/processing/results generation. - In parellel with (2), create a
authorYEARfirstword.py
file, and extend thePublication()
metaclass withAuthorYEARFirstword(Publication)
. Add the relevant details (seemeta_classes.py
for notes on what this means). Then, begin to move overfindings
fromprocess.ipynb
into replicable lambdas inAuthorYEARFirstword(Publication)
. - Ensure that
AuthorYEARFirstword(Publication)
has aFINDINGS
list class attribute. This should consist ofFinding
objects that wrap eachfinding_i(self)
lambda in the properFinding, VisualFinding or FigureFinding
metaclass, and adds it to the list. - See
Saw2018Cross
for an example of a cleanly implementedPublication
class.
Finding
lambdas should have a particular structure that should be strictly adhered to. Consider the following example, and note particularly the return values
def finding_i_j(self): # there can be kwargs
"""
(Text from paper, usually 2 or 3 sentences)
"""
# often can use a table finding directly or
# as a starting point to quickly recreate
# finding
results = self.table()
# (pandas stuff happens here to generate
# the findings)
return ([values],
soft_finding,
[hard_findings])
The finding lambdas can essentially perform any computation necessary, but must return a tuple of
-
A list of values (these are a set of any relevant values to the soft finding, non-exhaustive)
[interest_stem_ninth,interest_stem_eleventh]
-
A soft_finding boolean (this is simply a boolean that reflects the primary inequality/contrast presented in the original paper for this finding)
soft_finding = interest_stem_ninth > interest_stem_eleventh
-
A list of hard findings i.e. values (this could be the difference or set of differences that affected the soft_finding inequality. F)
hard_finding = interest_stem_ninth - interest_stem_eleventh hard_findings = [hard_finding]