Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update doc readme #140

Merged
merged 7 commits into from
Feb 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright 2023 hitsz-ids
Copyright 2024 hitsz-ids

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
</p>
</div>

The Synthetic Data Generator (SDG) is a specialized framework designed to rapidly generate high-quality structured tabular data. It incorporates a wide range of single-table and multi-table data synthesis algorithms, LLM-based synthetic data generation model is also integrated.
The Synthetic Data Generator (SDG) is a specialized framework designed to generate high-quality structured tabular data. It incorporates a wide range of single-table, multi-table data synthesis algorithms and LLM-based synthetic data generation models.

Synthetic data, generated by machines using real data, metadata, and algorithms, does not contain any sensitive information, yet it retains the essential characteristics of the original data. There is no direct correlation between synthetic data and real data, making it exempt from privacy regulations such as GDPR and ADPPA. This eliminates the risk of privacy breaches in practical applications.

Expand Down
157 changes: 12 additions & 145 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,162 +1,29 @@
# 快速入门
# SDG API docs

## 快速安装
## Online docs

`pip install sdgx`
Typically, our [latest API document](https://synthetic-data-generator.readthedocs.io/en/latest/) can be accessed via readthedocs.

## 单表数据快速合成示例
## Build docs locally

```python
# 导入相关模块
from sdgx.models.single_table.ctgan import CTGAN
from sdgx.data_process.sampling.sampler import DataSamplerCTGAN
from sdgx.data_processors.transformers.transform import DataTransformer
from sdgx.utils.io.csv_utils import *
You can build the docs on your own computer.

# 读取数据
demo_data, discrete_cols = get_demo_single_table()
Step 1: Install docs dependencies

```

真实数据如下:

```
age workclass fnlwgt ... hours-per-week native-country class
0 27 Private 177119 ... 44 United-States <=50K
1 27 Private 216481 ... 40 United-States <=50K
2 25 Private 256263 ... 40 United-States <=50K
3 46 Private 147640 ... 40 United-States <=50K
4 45 Private 172822 ... 76 United-States >50K
... ... ... ... ... ... ... ...
32556 43 Local-gov 33331 ... 40 United-States >50K
32557 44 Private 98466 ... 35 United-States <=50K
32558 23 Private 45317 ... 40 United-States <=50K
32559 45 Local-gov 215862 ... 45 United-States >50K
32560 25 Private 186925 ... 48 United-States <=50K

[32561 rows x 15 columns]

```

```python
#定义模型
model = GeneratorCTGAN(epochs=10,\
transformer= DataTransformer,\
sampler=DataSamplerCTGAN)

#训练模型
model.fit(demo_data, discrete_cols)

# 生成合成数据
sampled_data = model.sample(1000)
```

合成数据如下:

pip install -e .[docs]
```
age workclass fnlwgt ... hours-per-week native-country class
0 33 Private 276389 ... 41 United-States >50K
1 33 Self-emp-not-inc 296948 ... 54 United-States <=50K
2 67 Without-pay 266913 ... 51 Columbia <=50K
3 49 Private 423018 ... 41 United-States >50K
4 22 Private 295325 ... 39 United-States >50K
5 63 Private 234140 ... 65 United-States <=50K
6 42 Private 243623 ... 52 United-States <=50K
7 75 Private 247679 ... 41 United-States <=50K
8 79 Private 332237 ... 41 United-States >50K
9 28 State-gov 837932 ... 99 United-States <=50K
```

## 多表数据快速合成示例

```python
# 导入相关模块
from sdgx.models.single_table.cwamt import CWAMT
from sdgx.utils.io.csv_utils import *

# 读取数据
data = get_multi_table()
```

真实数据如下:
Step 2: Build docs

```
{'tables': {'table1': {'table_name': 'train', 'table_value': Store DayOfWeek Date ... Promo StateHoliday SchoolHoliday
0 1 5 2015-07-31 ... 1 0 1
1 2 5 2015-07-31 ... 1 0 1
2 3 5 2015-07-31 ... 1 0 1
3 4 5 2015-07-31 ... 1 0 1
4 5 5 2015-07-31 ... 1 0 1
... ... ... ... ... ... ... ...
1017204 1111 2 2013-01-01 ... 0 a 1
1017205 1112 2 2013-01-01 ... 0 a 1
1017206 1113 2 2013-01-01 ... 0 a 1
1017207 1114 2 2013-01-01 ... 0 a 1
1017208 1115 2 2013-01-01 ... 0 a 1

[1017209 rows x 9 columns]}, 'table2': {'table_name': 'store', 'table_value': Store StoreType ... Promo2SinceYear PromoInterval
0 1 c ... NaN NaN
1 2 a ... 2010.0 Jan,Apr,Jul,Oct
2 3 a ... 2011.0 Jan,Apr,Jul,Oct
3 4 c ... NaN NaN
4 5 a ... NaN NaN
... ... ... ... ... ...
1110 1111 a ... 2013.0 Jan,Apr,Jul,Oct
1111 1112 c ... NaN NaN
1112 1113 a ... NaN NaN
1113 1114 a ... NaN NaN
1114 1115 d ... 2012.0 Mar,Jun,Sept,Dec

[1115 rows x 10 columns]}}, 'relations': {'table1-table2': 'store'}}
cd docs && make html
```

```python
#定义模型
model = CWAMT()

#训练模型
model.fit(data)
Step 3 (Optional): Use `start-docs-host.sh` to deploy a local http server to view the docs

# 生成合成数据
sampled = model.generate(num_rows=10)
```

合成数据如下:

cd ./dev-tools && ./start-docs-host.sh
```
{'table1': {'table_name': 'train', 'table_value': Store DayOfWeek Date ... Promo StateHoliday SchoolHoliday
0 3 2 2013-01-01 ... 0 a 1
1 5 2 2013-01-01 ... 0 a 1
2 5 2 2013-01-01 ... 0 a 1
3 6 2 2013-01-01 ... 0 a 1
4 2 2 2013-01-01 ... 0 a 1
5 1 2 2013-01-01 ... 0 a 1
6 7 2 2013-01-01 ... 0 a 1
7 2 2 2013-01-01 ... 0 a 1
8 8 2 2013-01-01 ... 0 a 1
9 5 2 2013-01-01 ... 0 a 1
10 9 2 2013-01-01 ... 0 a 1
11 3 2 2013-01-01 ... 0 a 1
12 2 2 2013-01-01 ... 0 a 1
13 4 2 2013-01-01 ... 0 a 1
14 4 2 2013-01-01 ... 0 a 1
15 7 2 2013-01-01 ... 0 a 1
16 8 2 2013-01-01 ... 0 a 1
17 10 2 2013-01-01 ... 0 a 1
18 3 2 2013-01-01 ... 0 a 1
19 7 2 2013-01-01 ... 0 a 1

[20 rows x 9 columns]}, 'table2': {'table_name': 'store', 'table_value': Store StoreType ... Promo2SinceYear PromoInterval
0 1 a ... 2013.0 Jan,Apr,Jul,Oct
1 2 a ... 2010.0 Jan,Apr,Jul,Oct
2 3 a ... NaN NaN
3 4 c ... 2012.0 Jan,Apr,Jul,Oct
4 5 c ... NaN NaN
5 6 a ... 2013.0 Jan,Apr,Jul,Oct
6 7 c ... NaN NaN
7 8 a ... NaN NaN
8 9 a ... NaN NaN
9 10 d ... 2012.0 Mar,Jun,Sept,Dec

[10 rows x 10 columns]}}
```
Then access http://localhost:8910 for docs.
Binary file modified docs/source/_static/sdg_logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
# -- Project information -----------------------------------------------------

project = "Synthetic Data Generator"
copyright = "2023, hitsz-ids"
copyright = "2024, hitsz-ids"
author = "hitsz-ids"

# The full version, including alpha/beta/rc tags
Expand Down
18 changes: 9 additions & 9 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,29 +25,29 @@ SDG: Synthetic Data Generator



Synthetic Data Generator (SDG) is a framework focused on quickly generating high-quality structured tabular data. It supports many single-table and multi-table data synthesis algorithms, achieving up to 120 times performance improvement, and supports differential privacy and other methods to enhance the security of synthesized data.
The Synthetic Data Generator (SDG) is a specialized framework designed to generate high-quality structured tabular data. It incorporates a wide range of single-table, multi-table data synthesis algorithms and LLM-based synthetic data generation models.

Synthetic data is generated by machines based on real data and algorithms, it does not contain sensitive information, but can retain the characteristics of real data.
There is no correspondence between synthetic data and real data, and it is not subject to privacy regulations such as GDPR and ADPPA.
In practical applications, there is no need to worry about the risk of privacy leakage.
High-quality synthetic data can also be used in various fields such as data opening, model training and debugging, system development and testing, etc.
Synthetic data, generated by machines using real data, metadata, and algorithms, does not contain any sensitive information, yet it retains the essential characteristics of the original data. There is no direct correlation between synthetic data and real data, making it exempt from privacy regulations such as GDPR and ADPPA. This eliminates the risk of privacy breaches in practical applications.

High-quality synthetic data can be safely utilized across various domains including data sharing, model training and debugging, system development and testing, etc.

Our CODE/ISSUE/PULL REQUESTS are all hosted on `github <https://github.com/hitsz-ids/synthetic-data-generator>`_. Feel free to contact us if you have any questions.

Installation
====================================================================

You can use pre-built images to quickly experience the latest features.
You can install our python package with pip,

.. code-block:: bash

docker pull idsteam/sdgx:latest
pip install sdgx

Or install our python package with pip

Or use pre-built images to quickly experience the latest features.

.. code-block:: bash

pip install sdgx
docker pull idsteam/sdgx:latest

In order to use the GPU for synthesis, you may need to refer to `Torch's GPU installation guide <https://pytorch.org/get-started/locally/>`_.

Expand Down
Loading