SDGym evaluates the performance of Synthetic Data Generators, also called Synthesizers.
A Synthesizer is a Python function (or class method) that takes as input a dict
with table
names and pandas.DataFrame
instances, which we call the real data, and outputs another
dict
with the same shape entries and new pandas.DataFrame
instances, filled with new
synthetic data that has the same format and mathematical properties as the real data.
The complete list of inputs of the synthesizer is:
real_data
: adict
containing table names as keys andpandas.DataFrame
instances as values.metadata
: an instance of ansdv.Metadata
with information about the dataset.
And the output is a new dict
with the same tables that the real_data
contains.
def synthesizer_function(real_data: dict[str, pandas.DataFrame],
metadata: sdv.Metadata) -> real_data: dict[str, pandas.DataFrame]:
...
# do all necessary steps to learn from the real data
# and produce new synthetic data that resembles it
...
return synthetic_data
Apart from the benchmark functionality, SDGym implements a collection of Baseline Synthesizers which are either trivial baseline synthesizers or integrations of synthesizers found in other libraries.
These Synthesizers are written as Python classes that can be imported from the sdgym.synthesizers
module and have a fit_sample
method with the signature indicated above, which can be directly
passed to the sdgym.run
function to benchmark them.
This is the list of all the Synthesizers currently implemented, with references to the corresponding publications when applicable.
Name | Description | Reference |
---|---|---|
Identity | The synthetic data is the same as training data. | |
Independent | Each column sampled independently. Continuous columns use Gaussian Mixture Model and discrete columns use the PMF of training data. | |
Uniform | Each column in the synthetic data is sampled independently and uniformly. | |
CLBN | [2] | |
CopulaGAN | sdv.tabular.CopulaGAN | |
CTGAN | sdv.tabular.CTGAN | [1] |
GaussianCopulaCategorical | sdv.tabular.GaussianCopula using a CategoricalTransformer | |
GaussianCopulaCategoricalFuzzy | sdv.tabular.GaussianCopula using a CategoricalTransformer with fuzzy=True |
|
GaussianCopulaOneHot | sdv.tabular.GaussianCopula using a OneHotEncodingTransformer | |
HMA1 | sdv.relational.HMA1 | [7] |
MedGAN | [6] | |
PAR | sdv.timeseries.PAR | |
PrivBN | [3] | |
TVAE | [1] | |
TableGAN | [4] | |
SDV | sdv.SDV | [7] |
VEEGAN | [5] |
If you want to re-evaluate the performance of any of the SDGym synthesizers, all you need to
do is pass its class directly to the benchmark
function:
from sdgym import benchmark
from sdgym.synthesizers import CTGAN
leaderboard = benchmark(synthesizers=CTGAN)
If you want to run the complete benchmark suite to re-evaluate all the existing synthesizers you can simply pass the list of them to the function:
from sdgym.synthesizers import (
CLBN, CopulaGAN, CTGAN, HMA1, Identity, Independent,
MedGAN, PAR, PrivBN, SDV, TableGAN, TVAE,
Uniform, VEEGAN)
all_synthesizers = [
CLBN,
CTGAN,
CopulaGAN,
HMA1,
Identity,
Independent,
MedGAN,
PAR,
PrivBN,
SDV,
TVAE,
TableGAN,
Uniform,
VEEGAN,
]
scores = sdgym.run(synthesizers=all_synthesizers)
[1] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni. "Modeling tabular data using conditional gan." (2019) (pdf)
[2] C. Chow, Cong Liu. "Approximating discrete probability distributions with dependence trees." (1968) (pdf)
[3] Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, and Xiaokui Xiao. "Privbayes: Private data release via bayesian networks." (2017) (pdf)
[4] Noseong Park, Mahmoud Mohammadi, Kshitij Gorde, Sushil Jajodia, Hongkyu Park, Youngmin Kim. "Data synthesis based on generative adversarial networks." (2018) (pdf)
[5] Akash Srivastava, Lazar Valkov, Chris Russell, Michael U. Gutmann, Charles Sutton. "VEEGAN: Reducing mode collapse in gans using implicit variational learning." (2017) (pdf)
[6] Karim Armanious, Chenming Jiang, Marc Fischer, Thomas Küstner, Konstantin Nikolaou, Sergios Gatidis, Bin Yang. "MedGAN: Medical Image Translation using GANs" (2018) (pdf)
[7] Neha Patki, Roy Wedge, Kalyan Veeramachaneni. "The Synthetic Data Vault" (2018) (pdf)