init

microsoft · May 21, 2024 · e9f2624 · e9f2624
commit e9f2624
Show file tree

Hide file tree

Showing 36 changed files with 4,207 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,178 @@
+history/
+data/
+resources/
+source/
+trash/
+*.ini
+*.DS_Store
+*.gitkeep
+*.pkl
+*.xlsx
+*.json
+*.0
+*.jsonl
+*.tar.gz
+*.zip
+*.txt
+*.csv
+*.p
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/#use-with-ide
+.pdm.toml
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
@@ -0,0 +1,9 @@
+# Microsoft Open Source Code of Conduct
+
+This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
+
+Resources:
+
+- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)
+- [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
+- Contact [[email protected]](mailto:[email protected]) with questions or concerns
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+    MIT License
+
+    Copyright (c) Microsoft Corporation.
+
+    Permission is hereby granted, free of charge, to any person obtaining a copy
+    of this software and associated documentation files (the "Software"), to deal
+    in the Software without restriction, including without limitation the rights
+    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+    copies of the Software, and to permit persons to whom the Software is
+    furnished to do so, subject to the following conditions:
+
+    The above copyright notice and this permission notice shall be included in all
+    copies or substantial portions of the Software.
+
+    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+    SOFTWARE
diff --git a/README.md b/README.md
@@ -0,0 +1,135 @@
+# Table-GPT: Table-tuned GPT for Diverse Table Tasks
+
+This repository contains the source code to generate table-tuning datasets for the SIGMOD'24 paper [Table-GPT: Table-tuned GPT for Diverse Table Tasks](https://arxiv.org/abs/2310.09263). 
+
+## Data Availability
+Our training and test table-finetuning data for Table-GPT can be downloaded directly from https://huggingface.co/datasets/LipengCS/Table-GPT.
+
+## Task descriptions
+We collect (or synthesize) 18 diverse table-related tasks, which are summarized in the table below. There are 14 training tasks (T5 - T18) and 9 test tasks (T1 - T9). Some of these tasks (T-1 to T-4) are used as unseen hold-out tasks, to evaluate Table-GPT ability to generalize to completely new and unseen tasks. Some of these tasks (T-10 to T-18) are used for training only. 
+
+ | **Task Name** | **Task Description** | **Task Category** | **Train/Test** |
+|---|---|---|---|
+| T-1: Missing-value identification (MV) | Identify the row and column position of the only missing cell in a given table | Table understanding | Test only |
+| T-2: Column-finding (CF) | Identify the column-name of a specific value that appears only once in a given table | Table Understanding | Test only |
+| T-3: Table-QA (TQA) | Answer a natural-language question based on the content of a table | Table QA | Test only |
+| T-4: Column type annotation (CTA) | Find the semantic type of a column from a given list of choices | Table understanding | Test only |
+| T-5: Row-to-row transform (R2R) | Transform table data based on input/output examples | Data transformation | Train/Test |
+| T-6: Entity matching (EM) | Match rows from two tables that refer to the same real-world entity | Table matching | Train/Test |
+| T-7: Schema matching (SM) | Match columns from two tables that refer to the same meaning | Table matching | Train/Test |
+| T-8: Data imputation (DI) | Predict the missing values in a cell based on the table context | Data cleaning | Train/Test |
+| T-9: Error detection (ED) | Detect data values in a table that is a likely error from misspelling | Data cleaning | Train/Test |
+| T-10: List extraction (LE) | Extract a structured table from a list that lacks explicit column delimiters | Data transformation | Train only |
+| T-11: Header value matching (HVM) | Match column-headers with its data values drawn from the same table | Table matching | Train only |
+| T-12: Natural-language to SQL (NS) | Translate a natural-language question on a table into a SQL query | NL-to-SQL | Train only |
+| T-13: Table summarization (TS) | Produce a natural-language summary for the content in a table | Data augmentation | Train only |
+| T-14: Column augmentation (CA) | Augment a table with additional columns compatible with a given table | Data augmentation | Train only |
+| T-15: Row augmentation (RA) | Augment a table with additional rows compatible with a given table | Data augmentation | Train only |
+| T-16: Row/column swapping (RCSW) | Manipulate a given table by swapping the position of two rows or columns | Table manipulation | Train only |
+| T-17: Row/column filtering (RCF) | Manipulate a given table by filtering on given rows or columns | Table manipulation | Train only |
+| T-18: Row/column sorting (RCS) | Manipulate a given table by performing sorting on given rows or columns | Table manipulation | Train only |
+
+## Data Generation
+To generate training or test table-finetuning data using the source code, we provide `generate_tablegpt_data.py` that can easily load source data and transform it into training and testing data for finetuning a large language model. Our genererated data are released and can be downloaded from [here](https://huggingface.co/datasets/LipengCS/Table-GPT).
+
+**Step 1.** Download the source data (source.zip) from [here](https://huggingface.co/datasets/LipengCS/Table-GPT/blob/main/source.zip) and unzip it.
+
+**Step 2.** Run the following code to generate training data for a specific table task or run `bash generate_all_train.sh` to generate all training data.
+
+```
+python generate_tablegpt_data.py --mode train --task <task_name> --source_dir <source_data_dir> --prob_train_fewshot <prob> --save_dir <save_data_dir> --seed <integer>
+```
+
+- `--task` specifies the training task, which can be chosen from "EntityMatching", "SchemaMatching", "DataImputation", "ErrorDetection", "ListExtraction", "HeaderValueMatching", "NL2SQL", "TableSummary", "ColumnAugmentation", "RowAugmentation", "RowColumnSwapping", "RowColumnFiltering", "RowColumnSorting", "Row2RowTransformation", corresponsing to T5 - T18.  
+- `--source_dir` specifies the path of the source data downloaded from step 1.
+- `--prob_train_fewshot` specifies the the probility of fewshot prompting examples in the generated training data. If an example is selected for fewshot prompting, the number of the fewshot samples will be randomly chosen from 1 to 10.
+- `--save_dir` specifies where the generated data are saved.
+- `--seed` specifies seed to control randomness.
+
+**Step 3.** Run the following code to generate test data for a specific table task or run `bash generate_all_test.sh` to generate all test data.
+```
+python generate_tablegpt_data.py --mode test --task <task_name> --source_dir <source_data_dir> --num_test_fewshot_samples <integer> --save_dir <save_data_dir> --seed <integer>
+```
+
+- `--task` specifies the test task, which can be chosen from  "ColumnFinding", "MissingValueIdentification", "TableQuestion",
+"ColumnTypeAnnotation","EntityMatching", "SchemaMatching", "DataImputation", "ErrorDetection", "Row2RowTransformation", corresponsing to T1 - T9.
+- `--source_dir` specifies the path of the source data downloaded from step 1.
+- `--num_test_fewshot_samples` specifies the the number of fewshot prompting examples in the generated test data. Set it to zero to generate zero-shot test data.
+- `--save_dir` specifies where the generated data are saved.
+- `--seed` specifies seed to control randomness.
+
+## Evaluation and Reproduction
+We provide a `Evaluator` class that can compute the performance score for each task. We can use it to evaluate the performance of each model by running the following code.
+
+We provide `evaluate_tablegpt_result.py` to evaluate the model performance on a specific table task. We als provide the result generated from our Table-GPT model. Download our results from [here](https://huggingface.co/datasets/LipengCS/Table-GPT/tree/main/results). See `reproduce.ipynb` for steps to reproduce the main results in our paper.
+
+## Documentation
+We `DataGenerator` class to generate training or test data for table tasks.
+
+```python
+class DataGenerator:
+    def __init__(
+        self,
+        table_task: Union[str, BaseTableTask],
+        mode: str = "train",
+        num_test_fewshot_samples: int = 5,
+        prob_train_fewshot: float = 0.5,
+        max_num_train_fewshot_samples: int = 10,
+        min_num_train_fewshot_samples: int = 1,
+        max_size: Optional[int] = None,
+        max_token_length: int = 4096,
+        drop_long_prompt: bool = False,
+        random_state: int = 1,
+        verbose: bool = False,
+        use_random_template: bool = False,
+        use_cot: bool = False,
+        n_jobs: int = 1,
+        augment: bool = False,
+    ):
+```
+**Parameters**
+- **table_task** (str or a BaseTableTask object): Specifies the type of table task. Use the name for built-in table tasks or define a `BaseTableTask` object for any customized table task. 
+
+- **mode** (str): Specifies whether it is to generate training data or test data. Default is "train". Possible values include "train", "test".
+
+- **num_test_fewshot_samples** (int): Number of few-shot samples to use during testing. Default is 5.
+
+- **prob_train_fewshot** (float): Probability of including a few-shot sample during training. Default is 0.5.
+
+- **max_num_train_fewshot_samples** (int): Maximum number of few-shot samples to use during training. Default is 10.
+
+- **min_num_train_fewshot_samples** (int): Minimum number of few-shot samples to use during training. Default is 1.
+
+- **max_size** (Optional[int]): Maximum number of generated examples. Default is None, meaning no limit.
+
+- **max_token_length** (int): Maximum token length for the generated data. This only works if truncate=True. Default is 4096.
+
+- **drop_long_prompt** (bool): If True and if the generated prompt is longer than max_token_length, it will be dropped. If False, the long prompt will be kept and a warning will be given. Default is True.
+
+- **random_state** (int): Seed for data generation. Default is 1.
+
+- **verbose** (bool): If True, enables verbose output for debugging and logging. Default is False.
+
+- **use_random_template** (bool): If True, uses random templates for data generation. Default is False.
+
+- **use_cot** (bool): If True, uses chain-of-thought reasoning in data generation if supported. Currently, only entity matching and error detection support COT. Default is False. 
+
+- **n_jobs** (int): Number of jobs to run in parallel for data generation. Default is 1.
+
+- **augment** (bool): If True, use data augmentation (e.g., column permutation) if supported. Currently, only "EntityMatching", "SchemaMatching", "DataImputation", "ErrorDetection" and "HeaderValueMatching" support data augmentation.
+
+**Methods**
+```python
+    def generate_data(
+        self, 
+        test_data_dir: str, 
+        train_data_dir: Optional[str] = None
+    ) -> pd.DataFrame:
+```
+**Parameters**
+- **test_data_dir** (str): the folder containing all test data
+
+- **train_data_dir** (str): the folder containing all training data (used for generating few-shot examples)
+
+**Return**
+- **result** (pd.DataFrame): a dataframe containing prompts and completions for all data examples.
diff --git a/SECURITY.md b/SECURITY.md
@@ -0,0 +1,41 @@
+<!-- BEGIN MICROSOFT SECURITY.MD V0.0.9 BLOCK -->
+
+## Security
+
+Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet) and [Xamarin](https://github.com/xamarin).
+
+If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://aka.ms/security.md/definition), please report it to us as described below.
+
+## Reporting Security Issues
+
+**Please do not report security vulnerabilities through public GitHub issues.**
+
+Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://aka.ms/security.md/msrc/create-report).
+
+If you prefer to submit without logging in, send email to [[email protected]](mailto:[email protected]).  If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://aka.ms/security.md/msrc/pgp).
+
+You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc). 
+
+Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
+
+  * Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
+  * Full paths of source file(s) related to the manifestation of the issue
+  * The location of the affected source code (tag/branch/commit or direct URL)
+  * Any special configuration required to reproduce the issue
+  * Step-by-step instructions to reproduce the issue
+  * Proof-of-concept or exploit code (if possible)
+  * Impact of the issue, including how an attacker might exploit the issue
+
+This information will help us triage your report more quickly.
+
+If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://aka.ms/security.md/msrc/bounty) page for more details about our active programs.
+
+## Preferred Languages
+
+We prefer all communications to be in English.
+
+## Policy
+
+Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://aka.ms/security.md/cvd).
+
+<!-- END MICROSOFT SECURITY.MD BLOCK -->