-
Notifications
You must be signed in to change notification settings - Fork 551
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Update sdg_logo.png * update readme content * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add demo video Both video files are highly compressed * Update readme * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix demo video in markdown * use GIF instead * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add emoji * add c3tgan reference * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Loading branch information
1 parent
269063d
commit 457158a
Showing
5 changed files
with
69 additions
and
52 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -26,26 +26,40 @@ | |
</p> | ||
</div> | ||
|
||
Synthetic Data Generator (SDG) is a framework focused on quickly generating high-quality structured tabular data. It supports many single-table and multi-table data synthesis algorithms, achieving up to 120 times performance improvement, and supports differential privacy and other methods to enhance the security of synthesized data. | ||
The Synthetic Data Generator (SDG) is a specialized framework designed to rapidly generate high-quality structured tabular data. It incorporates a wide range of single-table and multi-table data synthesis algorithms, LLM-based synthetic data generation model is also integrated. | ||
|
||
Synthetic data is generated by machines based on real data and algorithms, it does not contain sensitive information, but can retain the characteristics of real data. | ||
There is no correspondence between synthetic data and real data, and it is not subject to privacy regulations such as GDPR and ADPPA. | ||
In practical applications, there is no need to worry about the risk of privacy leakage. | ||
High-quality synthetic data can also be used in various fields such as data opening, model training and debugging, system development and testing, etc. | ||
Synthetic data, generated by machines using real data, metadata, and algorithms, does not contain any sensitive information, yet it retains the essential characteristics of the original data. There is no direct correlation between synthetic data and real data, making it exempt from privacy regulations such as GDPR and ADPPA. This eliminates the risk of privacy breaches in practical applications. | ||
|
||
## 🎉 Features | ||
High-quality synthetic data can be safely utilized across various domains including data sharing, model training and debugging, system development and testing, etc. Read [the latest API docs](https://synthetic-data-generator.readthedocs.io/en/latest/) for more details! | ||
|
||
- high performance | ||
- Supports a wide range of statistical data synthesis algorithms to achieve up to 120x performance improvement, without the need for GPU devices; | ||
## 🔧 Features | ||
|
||
- Technological advancements | ||
- Supports a wide range of statistical data synthesis algorithms, LLM-based synthetic data generation model is also integrated; | ||
- Optimised for big data scenarios, effectively reducing memory consumption; | ||
- Continuously tracking the latest advances in academia and industry, and introducing support for excellent algorithms and models in a timely manner. | ||
- Provide distributed training support for deep learning models with frameworks such as torch. | ||
- Privacy enhancements: | ||
- Privacy enhancements | ||
- SDG supports differential privacy, anonymization and other methods to enhance the security of synthetic data. | ||
- Easy to extend | ||
- Supports expansion of models, data processing, data connectors, etc. in the form of plug-in packages | ||
|
||
Read [the latest API docs](https://synthetic-data-generator.readthedocs.io/en/latest/) for more details. | ||
### 🎉 LLM-integrated synthetic data generation | ||
|
||
For a long time, LLM has been used to understand and generate various types of data. In fact, LLM also has certain capabilities in tabular data generation. Also, it has some abilities that cannot be achieved by traditional (based on GAN methods or statistical methods) . | ||
|
||
Our `sdgx.models.LLM.single_table.gpt.SingleTableGPTModel` implements two new features: | ||
|
||
#### Synthetic data generation without Data | ||
|
||
No training data is required, synthetic data can be generated based on metadata data. | ||
|
||
![Synthetic data generation without Data](assets/LLM_Case_1.gif) | ||
|
||
#### Off-Table feature inference | ||
|
||
Infer new column data based on the existing data in the table and the knowledge mastered by LLM. | ||
|
||
![Off-Table feature inference](assets/LLM_Case_2.gif) | ||
|
||
## 🔛 Quick Start | ||
|
||
|
@@ -57,9 +71,15 @@ You can use pre-built images to quickly experience the latest features. | |
docker pull idsteam/sdgx:latest | ||
``` | ||
|
||
### Install from PyPi | ||
|
||
```bash | ||
pip install sdgx | ||
``` | ||
|
||
### Local Install (Recommended) | ||
|
||
At present, the code of this project is updated very quickly. We recommend that you use SDG by installing it through the source code. | ||
Use SDG by installing it through the source code. | ||
|
||
```bash | ||
git clone [email protected]:hitsz-ids/synthetic-data-generator.git | ||
|
@@ -68,12 +88,6 @@ pip install . | |
pip install git+https://github.com/hitsz-ids/synthetic-data-generator.git | ||
``` | ||
|
||
### Install from PyPi | ||
|
||
```bash | ||
pip install sdgx | ||
``` | ||
|
||
### Quick Demo of Single Table Data Generation and Metric | ||
|
||
#### Demo code | ||
|
@@ -156,21 +170,13 @@ The SDG project was initiated by **Institute of Data Security, Harbin Institute | |
|
||
## 👩🎓 Related Work | ||
|
||
### Research Paper | ||
|
||
- CTGAN:[Modeling Tabular Data using Conditional GAN](https://proceedings.neurips.cc/paper/2019/hash/254ed7d2de3b23ab10936522dd547b78-Abstract.html) | ||
- C3-TGAN: [C3-TGAN- Controllable Tabular Data Synthesis with Explicit Correlations and Property Constraints](https://www.researchgate.net/publication/374652636_C3-TGAN-_Controllable_Tabular_Data_Synthesis_with_Explicit_Correlations_and_Property_Constraints) | ||
- TVAE:[Modeling Tabular Data using Conditional GAN](https://proceedings.neurips.cc/paper/2019/hash/254ed7d2de3b23ab10936522dd547b78-Abstract.html) | ||
- table-GAN:[Data Synthesis based on Generative Adversarial Networks](https://arxiv.org/pdf/1806.03384.pdf) | ||
- CTAB-GAN:[CTAB-GAN: Effective Table Data Synthesizing](https://proceedings.mlr.press/v157/zhao21a/zhao21a.pdf) | ||
- OCT-GAN: [OCT-GAN: Neural ODE-based Conditional Tabular GANs](https://arxiv.org/pdf/2105.14969.pdf) | ||
|
||
### Dataset | ||
|
||
- [Adult](http://archive.ics.uci.edu/ml/datasets/adult) | ||
- [Satellite](http://archive.ics.uci.edu/dataset/146/statlog+landsat+satellite) | ||
- [Rossmann](https://www.kaggle.com/competitions/rossmann-store-sales/data) | ||
- [Telstra](https://www.kaggle.com/competitions/telstra-recruiting-network/data) | ||
|
||
## 📄 License | ||
|
||
The SDG open source project uses Apache-2.0 license, please refer to the [LICENSE](https://github.com/hitsz-ids/synthetic-data-generator/blob/main/LICENSE). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -27,24 +27,41 @@ | |
</p> | ||
</div> | ||
|
||
合成数据生成器(Synthetic Data Generator,SDG)是一个专注于快速生成高质量结构化表格数据的组件。支持多种单表、多表数据合成算法,实现最高120倍性能提升,支持差分隐私等方法,加强合成数据安全性。 | ||
合成数据生成器(Synthetic Data Generator,SDG)是一个专注于快速生成高质量的结构化表格数据的数据组件。SDG支持单表和多表数据合成算法,并集成了基于大语言模型(LLM)的合成数据生成模型。 | ||
|
||
合成数据是由机器根据真实数据与算法生成的,合成数据不含敏感信息,但能保留真实数据中的行为特征。合成数据与真实数据不存在任何对应关系,不受 GDPR 、ADPPA等隐私法规的约束,在实际应用中不需要担心隐私泄漏风险。高质量的合成数据可用于数据安全开放、模型训练调试、系统开发测试等众多领域。 | ||
合成数据(Synthetic Data)是由计算机使用真实数据、元数据和算法生成的合成数据不包含任何敏感信息,但它保留了原始数据的基本特性。合成数据和真实数据之间没有直接的关联,使其免于GDPR和ADPPA等隐私法规的约束,消除实际应用中的隐私泄露风险。 | ||
|
||
## 🎉 主要特性 | ||
高质量的合成数据可以安全、多样化地在各种领域中使用,包括数据共享、模型训练和调试、系统开发和测试等应用。阅读 [最新API文档](https://synthetic-data-generator.readthedocs.io/en/latest/) 获取更多细节。 | ||
|
||
- 高性能 | ||
- 支持多种统计学数据合成算法,实现最高120倍性能提升,不需要GPU设备; | ||
## 🔧 主要特性 | ||
|
||
- 无限进步: | ||
- 支持多种统计学数据合成算法,支持基于LLM的仿真数据生成方法; | ||
- 为大数据场景优化,有效减少内存消耗; | ||
- 持续跟踪学术界和工业界的最新进展,及时引入支持优秀算法和模型。 | ||
- 为深度学习模型提供torch等框架的分布式训练支持 | ||
- 隐私增强 | ||
- 隐私增强: | ||
- 提供中文敏感数据自动识别能力,包括姓名、身份证号、人名等17种常见敏感字段; | ||
- 支持差分隐私、匿名化等方法,加强合成数据安全性。 | ||
- 易扩展 | ||
- 支持以插件包的形式拓展模型、数据处理、数据连接器等功能 | ||
- 易扩展: | ||
- 支持以插件包的形式拓展模型、数据处理、数据连接器等功能。 | ||
|
||
### 🎉 借助LLM进行合成数据生成 | ||
|
||
长期以来,LLM一直被用来理解和生成各种类型的数据。 事实上,LLM在表格数据生成方面也有较强的性能。 且LLM还具有一些传统(基于GAN方法或统计方法)无法实现的能力。 | ||
|
||
我们的 `sdgx.models.LLM.single_table.gpt.SingleTableGPTModel` 实现了两个新功能: | ||
|
||
#### 无原始记录的数据合成功能 | ||
|
||
无需原始训练数据,可以根据元数据生成合成数据。 | ||
|
||
阅读 [最新的文档](https://synthetic-data-generator.readthedocs.io/en/latest/) 获取更多细节. | ||
![Synthetic data generation without Data](assets/LLM_Case_1.gif) | ||
|
||
#### 表外特征推断功能 | ||
|
||
根据表中已有的数据以及LLM掌握的知识推断表外特征,即新的列数据。 | ||
|
||
![Off-Table feature inference](assets/LLM_Case_2.gif) | ||
|
||
## 🔛 快速开始 | ||
|
||
|
@@ -56,9 +73,15 @@ | |
docker pull idsteam/sdgx:latest | ||
``` | ||
|
||
### 从本地安装(目前推荐) | ||
### 从Pypi安装 | ||
|
||
目前本项目的代码更新速度快,我们推荐您通过源码进行安装的方式使用SDG。 | ||
```bash | ||
pip install sdgx | ||
``` | ||
|
||
### 从本地安装 | ||
|
||
您可以通过源码进行安装的方式使用SDG。 | ||
|
||
```bash | ||
git clone [email protected]:hitsz-ids/synthetic-data-generator.git | ||
|
@@ -67,12 +90,6 @@ pip install . | |
pip install git+https://github.com/hitsz-ids/synthetic-data-generator.git | ||
``` | ||
|
||
### 从Pypi安装 | ||
|
||
```bash | ||
pip install sdgx | ||
``` | ||
|
||
### 单表数据快速合成示例 | ||
|
||
#### 演示代码 | ||
|
@@ -158,18 +175,12 @@ SDG开源项目由**哈尔滨工业大学(深圳)数据安全研究院**发 | |
### 论文 | ||
|
||
- CTGAN:[Modeling Tabular Data using Conditional GAN](https://proceedings.neurips.cc/paper/2019/hash/254ed7d2de3b23ab10936522dd547b78-Abstract.html) | ||
- C3-TGAN: [C3-TGAN- Controllable Tabular Data Synthesis with Explicit Correlations and Property Constraints](https://www.researchgate.net/publication/374652636_C3-TGAN-_Controllable_Tabular_Data_Synthesis_with_Explicit_Correlations_and_Property_Constraints) | ||
- TVAE:[Modeling Tabular Data using Conditional GAN](https://proceedings.neurips.cc/paper/2019/hash/254ed7d2de3b23ab10936522dd547b78-Abstract.html) | ||
- table-GAN:[Data Synthesis based on Generative Adversarial Networks](https://arxiv.org/pdf/1806.03384.pdf) | ||
- CTAB-GAN:[CTAB-GAN: Effective Table Data Synthesizing](https://proceedings.mlr.press/v157/zhao21a/zhao21a.pdf) | ||
- OCT-GAN: [OCT-GAN: Neural ODE-based Conditional Tabular GANs](https://arxiv.org/pdf/2105.14969.pdf) | ||
|
||
### 数据集 | ||
|
||
- [Adult数据集](http://archive.ics.uci.edu/ml/datasets/adult) | ||
- [Satellite数据集](http://archive.ics.uci.edu/dataset/146/statlog+landsat+satellite) | ||
- [Rossmann数据集](https://www.kaggle.com/competitions/rossmann-store-sales/data) | ||
- [Telstra数据集](https://www.kaggle.com/competitions/telstra-recruiting-network/data) | ||
|
||
## 📄 许可证 | ||
|
||
SDG开源项目使用 Apache-2.0 license,有关协议请参考[LICENSE](https://github.com/hitsz-ids/synthetic-data-generator/blob/main/LICENSE)。 |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.