This was our (Team DESTATIS) repository for work on the United Nations Economic Commission for Europe (UNECE) High-level Group for the Modernisation of Statistical Production and Services (HLG-MOS) Synthetic Data Challenge 2022.
The HLG-MOS Synthetical Data Challenge was all about exploring different methods, algorithms, metrics and utilities to create synthetic data. Synthetic Data could potentially be an interesting option for national statistical agencies to share data while maintaining public trust.
In order to be beneficial for certain use cases (e.g. 'release data to the public', 'release to trusted researchers', 'usage in education') the synthetic data needs to conserve certain statistical properties of the data. Thus, the synthetic data needs to be similar to the original data, but at the same time it has to be different to preserve privacy.
There is a lot of active research done on synthetic data and new methods for generating and evaluating confidentiality of synthetic data are emerging. The HLG-MOS has created a Synthetic Data Starter Guide to give national statistic offices an intro into this topic.
Figure 1: Results example Correlation Plot showing differences in correlations between a GAN created synthetic dataset and the original data
Goal of the challenge was to create synthetic versions of provided datasets and afterwards evaluate to what extent we would use this synthetic data for certain use cases. These use-cases were 'Releasing microdata to the public', 'Testing analysis', 'Education', 'Testing technology'.
One objective thereby was to evaluate as many different methods as possibly, while still trying to optimize parameters for the methods as good as possible.
Our team managed to create synthetic data with the following methods:
- Fully Conditional Specification (FCS)
- Generative Adversarial Network (GAN)
- Probabilistic Graphical Models(Minutemen DP-pgm)
- Information Preserving Statistical Obfuscation(IPSO)
- Multivariate non-normal distribution (Simulated)
The other objective was to do this evaluation ideally for both of the provided original datasets. One dataset (SATGPA) being more of a toy example and the other (ACS) a more complex real-life dataset.
-
SATGPA: SAT (United States Standardized university Admissions Test) and GPA (university Grade Point Average) data, 6 features, 1.000 observations.
-
ACS: Demographic survey data (American Community Survey), 33 features, 1.035.201 observations.
So overall, it was about trying as many methods as possible, while still doing a quality evaluation (in terms of privacy and usability metrics) for each created synthetic dataset.
Final deliverables were:
A short 5 minute summary video, synthetic datasets, evaluation reports and an evaluation of the starter guide.
Figure 2: Results example histogram showing differences in distributions between a GAN created synthetic dataset and the original data
Our team of the Federal Statistical Office of Germany (Statistisches Bundesamt) consited from five members of different groups within Destatis. Participating were:
- Steffen M.
- Reinhard T.
- Felix G.
- Michel R.
- Hariolf M.
Since it was a challenge in limited time and we were working in parallel the Github repository might look a little bit untidy. There are plenty of interesting things to find in the repository, here is a quick orientation:
- All 0_ files: Overview Presentation abour our challenge work
- All 1_ files: All our final synthetic datasets and multiple evaluation reports
- All 2_ files: Different folders with .Rmd files to create the evalation reports for the synthetic datasets
- All 3_ files: Mainly intermediate datasets created from using minutemen
- All 4_ files: Saved cgan models
- All 5_ files: Different resulting synthetic datasets
- All 6_ files: .Rmd files used to run python code for GANs
- All 7_ files: Different .R files for running algorithms to create synthetic data
- All other files: Mainly different other .R code files, original datasets, samples of original datasets
Some larger files >100MB of our repo are unfortunately not linked, because of the max. Github file size allowed in the free tier.
We ended up on 2nd place in the challenge leaderboard (which mostly expressed, how many methods a team successfully used to create and evaluate synthetic datasets). It might be interesting to look at our final overview slides.
Here is also a quick overview about some resulting metrics for the different methods:
Figure 3: Some of the utility metrics calculated for the created synthetic versions of the SATGAP dataset
Figure 4: Some of the privacy metrics calculated for the created synthetic versions of the SATGAP dataset
But, one key takeaway from this challenge is: One or two metrics are not able to tell the whole story. Some datasets showed e.g.really good usability results according to the pMSE metric, but the usability according to histogram comparisons was terrible. That is why in order to evaluate each method / created synthetic datasets we had to compile quite large reports using several different metrics.
These reports SATGPA-FCS, SATGPA-GAN, SATGPA-PGM, SATGPA-IPSO, SATGPA-SIM, ACS-FCS, ACS-GAN, ACS-PGM, ACS-IPSO, ACS-SIM (together with the datasets itself) were the main results.
To end this already quite lengthy Readme.md file - here is the SATGPA-GAN report as an report example: