Introduction and data download page of a challenging text-to-SQL dataset: KaggleDBQA.
Data | Evaluation | Paper | Citation | Leaderboard
KaggleDBQA is a challenging cross-domain and complex evaluation dataset of real Web databases, with domain-specific data types, original formatting, and unrestricted questions.
It expands upon contemporary cross-domain text-to-SQL datasets in three key aspects:
(1) Its databases are pulled from real-world data sources and not normalized.
(2) Its questions are authored in environments that mimic natural question answering.
(3) It also provides database documentation which contain rich in-domain knowledge.
Each file in examples/
contains the following fields:
question
: the natural language questionquestion_toks
: the natural language question tokensdb_id
: the database id to which this question is addressed.query
: the SQL query corresponding to the question.query_toks
: the SQL query tokens corresponding to the question.sql
: parsed results of this SQL query usingprocess_sql.py
from Spider
{
"db_id": "StudentMathScore",
"query": "SELECT T1.school_district FROM FINREV_FED_17 as T1 JOIN FINREV_FED_KEY_17 as T2 ON T1.state_code = T2.state_code WHERE T2.state = \"Wisconsin\" ORDER BY T1.t_fed_rev DESC LIMIT 1",
"query_toks": ["SELECT", "T1.school_district", "FROM", "FINREV_FED_17", ...],
"question": "Which school district receive the most of federal revenue through state in Wisconsin?",
"question_toks": ["Which", "school", "district", ...],
"sql": {
"from": {
"table_units": [
[
"table_unit",
0
],
[
"table_unit",
2
]
],
...
}
},
KaggleDBQA_tables.json
contains the following information for each database:
db_id
: database idtable_names_original
: original table names.table_names
: table names. The only preprocess steps are (1) tokenize names based on snake case and camel case. (2) lowercase.column_names_original
: original column names. Each column entry is given as[tableindex, name]
, wheretableindex
is the index of the table this column belongs to (intable_names
), andname
is the name of the column.column_names
: column names. The only preprocess steps are (1) tokenize names based on snake case and camel case. (2) lowercase.column_names_manually_normalized_alternative
: normalized column names for an alternative evaluation setting. This was done manually using column descriptions to help clarify each column and without introducing any extra knowledge into the column names except for the expansion of abbreviations (e.g.t_fed_rev
->total federal revenue
). Please evaluate parsers in non-normalized setting by default, and use normalized column names for research exploration of model capabilities (as in our paper).column_descriptions
: textual descriptions that explain the meaning of the columns.db_overview
: textual descriptions that describe the general content in each DB.value_enums
: the set of possible values in the categorial columns and their meaning explanations.column_types
: data type of each column.foreign_keys
: foreign keys in the database.[3, 8]
means column indices in thecolumn_names
. These two columns are foreign keys of two different tables.primary_keys
: primary keys in the database. Each number is the index ofcolumn_names
.
{
"column_descriptions": [
...,
"Total federal revenue through the state to each school district.",
"Federal revenue through the state- Title 1 (no child left behind act).",
"Federal revenue through the state- Child Nutrition A",
...
],
"column_names": [
...,
[
0,
"t fed rev"
],
[
0,
"c14"
],
[
0,
"c25"
],
...
],
"column_names_normalized": [
...,
[
0,
"total federal revenue"
],
[
0,
"child 14"
],
[
0,
"child 25"
],
...
],
"column_names_original": [
...,
[
0,
"t_fed_rev"
],
[
0,
"c14"
],
[
0,
"c25"
],
...
],
"column_types": [
...,
"number",
"number",
"number",
...
],
"db_id": "StudentMathScore",
"foreign_keys": [],
"primary_keys": [],
"table_names": [
"finrev fed 17",
"ndecoreexcel math grade8",
"finrev fed key 17"
],
"table_names_original": [
"FINREV_FED_17",
"NDECoreExcel_Math_Grade8",
"FINREV_FED_KEY_17"
]
},
All the database content stored in SQLite database file under databases/
. Please download the databases here.
KaggleDBQA is primarily used as a testing suite for realistic semantic parsing of text-to-SQL. We introduce two evaluation settings here (1) Plain-Testing (2) Few-Shot Annotation/Finetuning. We randomly split examples in each database into two subsets(30/70 ratio): finetune and test.
Use the files examples/*_test
to evaluate parsers.
You can use the in-domain examples contained in examples/*_fewshot
to improve your model (e.g., by fine-tuning or
other few-shot training methods) before testing on the examples in examples/*_test
.
If you use the dataset in your work, please cite our paper.
@inproceedings{lee-2021-kaggle-dbqa,
title = "{KaggleDBQA}: Realistic Evaluation of Text-to-{SQL} Parsers",
author = "Lee, Chia-Hsuan and
Polozov, Oleksandr and
Richardson, Matthew",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-long.176",
pages = "2261--2273"
}
Please contact Chia-Hsuan Lee (chiahlee[at]uw.edu) for questions and suggestions.