Skip to content

Latest commit

 

History

History
117 lines (90 loc) · 8.25 KB

SYNTHESIZERS.md

File metadata and controls

117 lines (90 loc) · 8.25 KB

Synthetic Data Generators

SDGym evaluates the performance of Synthetic Data Generators, also called Synthesizers.

A Synthesizer is a Python function (or class method) that takes as input a dict with table names and pandas.DataFrame instances, which we call the real data, and outputs another dict with the same shape entries and new pandas.DataFrame instances, filled with new synthetic data that has the same format and mathematical properties as the real data.

The complete list of inputs of the synthesizer is:

  • real_data: a dict containing table names as keys and pandas.DataFrame instances as values.
  • metadata: an instance of an sdv.Metadata with information about the dataset.

And the output is a new dict with the same tables that the real_data contains.

def synthesizer_function(real_data: dict[str, pandas.DataFrame],
                         metadata: sdv.Metadata) -> real_data: dict[str, pandas.DataFrame]:
    ...
    # do all necessary steps to learn from the real data
    # and produce new synthetic data that resembles it
    ...
    return synthetic_data

SDGym Synthesizers

Apart from the benchmark functionality, SDGym implements a collection of Baseline Synthesizers which are either trivial baseline synthesizers or integrations of synthesizers found in other libraries.

These Synthesizers are written as Python classes that can be imported from the sdgym.synthesizers module and have a fit_sample method with the signature indicated above, which can be directly passed to the sdgym.run function to benchmark them.

This is the list of all the Synthesizers currently implemented, with references to the corresponding publications when applicable.

Name Description Reference
Identity The synthetic data is the same as training data.
Independent Each column sampled independently. Continuous columns use Gaussian Mixture Model and discrete columns use the PMF of training data.
Uniform Each column in the synthetic data is sampled independently and uniformly.
CLBN [2]
CopulaGAN sdv.tabular.CopulaGAN
CTGAN sdv.tabular.CTGAN [1]
GaussianCopulaCategorical sdv.tabular.GaussianCopula using a CategoricalTransformer
GaussianCopulaCategoricalFuzzy sdv.tabular.GaussianCopula using a CategoricalTransformer with fuzzy=True
GaussianCopulaOneHot sdv.tabular.GaussianCopula using a OneHotEncodingTransformer
HMA1 sdv.relational.HMA1 [7]
MedGAN [6]
PAR sdv.timeseries.PAR
PrivBN [3]
TVAE [1]
TableGAN [4]
SDV sdv.SDV [7]
VEEGAN [5]

Benchmarking the SDGym Synthesizers

If you want to re-evaluate the performance of any of the SDGym synthesizers, all you need to do is pass its class directly to the benchmark function:

from sdgym import benchmark
from sdgym.synthesizers import CTGAN

leaderboard = benchmark(synthesizers=CTGAN)

If you want to run the complete benchmark suite to re-evaluate all the existing synthesizers you can simply pass the list of them to the function:

⚠️ WARNING: This takes a lot of time to run!
from sdgym.synthesizers import (
    CLBN, CopulaGAN, CTGAN, HMA1, Identity, Independent,
    MedGAN, PAR, PrivBN, SDV, TableGAN, TVAE,
    Uniform, VEEGAN)

all_synthesizers = [
    CLBN,
    CTGAN,
    CopulaGAN,
    HMA1,
    Identity,
    Independent,
    MedGAN,
    PAR,
    PrivBN,
    SDV,
    TVAE,
    TableGAN,
    Uniform,
    VEEGAN,
]
scores = sdgym.run(synthesizers=all_synthesizers)

References

[1] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni. "Modeling tabular data using conditional gan." (2019) (pdf)

[2] C. Chow, Cong Liu. "Approximating discrete probability distributions with dependence trees." (1968) (pdf)

[3] Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, and Xiaokui Xiao. "Privbayes: Private data release via bayesian networks." (2017) (pdf)

[4] Noseong Park, Mahmoud Mohammadi, Kshitij Gorde, Sushil Jajodia, Hongkyu Park, Youngmin Kim. "Data synthesis based on generative adversarial networks." (2018) (pdf)

[5] Akash Srivastava, Lazar Valkov, Chris Russell, Michael U. Gutmann, Charles Sutton. "VEEGAN: Reducing mode collapse in gans using implicit variational learning." (2017) (pdf)

[6] Karim Armanious, Chenming Jiang, Marc Fischer, Thomas Küstner, Konstantin Nikolaou, Sergios Gatidis, Bin Yang. "MedGAN: Medical Image Translation using GANs" (2018) (pdf)

[7] Neha Patki, Roy Wedge, Kalyan Veeramachaneni. "The Synthetic Data Vault" (2018) (pdf)