This repo contains a ficitious data-generator that handles relationships between tables and fields within table(s)
Steps to Generate Synthetic Data
- Create YAML file (example) which specifies your data relationships and values
- Run
scripts/generate_code.py
to create the Python Script based on YAML - Run the python script to produce ficitious output
- Clone repo
git clone https://github.com/llooker/looker-datagenerator.git
- Set up Virtual Enviornment
python3 -m venv MY_ENV
- Install dependencies
pip install -r requirements.txt
- Create a directory for your dataset if it doesn't exist
- Inside your dataset directory, create a
yaml
,code
,output
directory
The file tree for your dataset should look like this:
datasets/
└── $DATASET/
├─ code/
└─ yaml1.py
└─ yaml2.py
└─ ...
└─ output/
└─ table1.csv
└─ table2.csv
└─ ...
└── yaml/
└─ schema1.yaml
└─ schema2.yaml
└─ ...
Folder | Description |
---|---|
$DATASET |
contains your dataset files |
code |
contains the python code used to generate your data |
yaml |
contains your YAML file that defines your schema/data relationships |
output |
contains your raw genereated data output (csv) |
- Within your
$DATASET/yaml
directory add your custom YAML file(s) - From the root directory, run
generate_code.py
to create the Python code for your YAML file(s)
python scripts/generate_code.py \
--dataset $DATASET \
--yaml_file $YAML_FILE \
--num_of_records $NUM_OF_RECORDS
This will create your yaml.py
file within your dataset's YAML directory $DATASET/yaml/
- Run your python code to generate the output(s) of your dataset.
From the root directory (replace $DATASET
with your dataset name, replace MY_CUSTOM_CODE.py
with your dataset's .py
file name)
python scripts/datasets/$DATASET/code/MY_CUSTOM_CODE.py`
Running your code will generate raw files for your schema into the $DATASETS/output
directory