Skip to content

Dataset and code for EMNLP2020 paper "HybridQA: A Dataset of Multi-Hop Question Answeringover Tabular and Textual Data"

License

Notifications You must be signed in to change notification settings

wenhuchen/HybridQA

Repository files navigation

HybridQA

This repository contains the dataset and code for the EMNLP2020 paper HybridQA: A Dataset of Multi-Hop Question Answeringover Tabular and Textual Data, which is the first large-scale multi-hop question answering dataset on heterogeneous data including tabular and textual data. The whole dataset contains over 70K question-answer pairs based on 13,000 tables, each table is in average linked to 44 passages, more details in https://hybridqa.github.io/.

The questions are annotated to require aggregation of information from both the table and its hyperlinked text passages, which poses challenges to existing homongeneous text-based or KB-based models.

Requirements:

Dataset Visualization

Have fun interacting with the dataset: https://hybridqa.github.io/explore.html

Released data:

The released data contains the following files:

train/dev/test.json: these files are the original files, all annotated by human.
train/dev.traced.json: these files are generated by trace_answer.py to find the answer span in the given evidences.

Preprocess data:

First of all, you should download all the tables and passages into your current folder

git clone https://github.com/wenhuchen/WikiTables-WithLinks

Then, you can either preprocess the data on your own,

python preprocessing.py

or use our preprocessed version from Amazon S3

wget https://hybridqa.s3-us-west-2.amazonaws.com/preprocessed_data.zip
unzip preprocessed_data.zip

Reproduce the reported results

Download the trained bert-base model from Amazon S3:

wget https://hybridqa.s3-us-west-2.amazonaws.com/models.zip
unzip BERT-base-uncased.zip

It will download and generate folder stage1/stage2/stage3/

Using pretrained model to run stage1/stage2:

CUDA_VISIBLE_DEVICES=0 python train_stage12.py --stage1_model stage1/2020_10_03_22_47_34/checkpoint-epoch2 --stage2_model stage2/2020_10_03_22_50_31/checkpoint-epoch2/ --do_lower_case --predict_file preprocessed_data/dev_inputs.json --do_eval --option stage12 --model_name_or_path  bert-large-uncased

This command generates a intermediate result file

Using pretrained model to run stage3:

CUDA_VISIBLE_DEVICES=0 python train_stage3.py --model_name_or_path stage3/2020_10_03_22_51_12/checkpoint-epoch3/ --do_stage3   --do_lower_case  --predict_file predictions.intermediate.json --per_gpu_train_batch_size 12  --max_seq_length 384   --doc_stride 128 --threads 8

This command generates the prediction file

Compute the score

python evaluate_script.py predictions.json released_data/dev_reference.json

Training [Default for Bert-base-uncased model]

Train Stage1:

Running training command for stage1 using BERT-base-uncased as follows:

CUDA_VISIBLE_DEVICES=0 python train_stage12.py --do_lower_case --do_train --train_file preprocessed_data/stage1_training_data.json --learning_rate 2e-6 --option stage1 --num_train_epochs 3.0

Or Running training command for stage1 using BERT-large-uncased as follows:

CUDA_VISIBLE_DEVICES=0 python train_stage12.py --model_name_or_path bert-large-uncased --do_train --train_file preprocessed_data/stage1_training_data.json --learning_rate 2e-6 --option stage1 --num_train_epochs 3.0

Train Stage2:

Running training command for stage2 as follows:

CUDA_VISIBLE_DEVICES=0 python train_stage12.py --do_lower_case --do_train --train_file preprocessed_data/stage2_training_data.json --learning_rate 5e-6 --option stage2 --num_train_epochs 3.0

Or BERT-base-cased/BERT-large-uncased like above.

Train Stage3:

Running training command for stage3 as follows:

CUDA_VISIBLE_DEVICES=0 python train_stage3.py --do_train  --do_lower_case   --train_file preprocessed_data/stage3_training_data.json  --per_gpu_train_batch_size 12   --learning_rate 3e-5   --num_train_epochs 4.0   --max_seq_length 384   --doc_stride 128  --threads 8

Or BERT-base-cased/BERT-large-uncased like above.

Model Selection for Stage1/2:

Model Selction command for stage1 and stage2 as follows:

CUDA_VISIBLE_DEVICES=0 python train_stage12.py --do_lower_case --do_eval --option stage1 --output_dir stage1/[OWN_PATH]/ --predict_file preprocessed_data/stage1_dev_data.json

Evaluation

Model Evaluation Step1 -> Stage1/2:

Evaluating command for stage1 and stage2 as follows (replace the stage1_model and stage2_model path with your own):

CUDA_VISIBLE_DEVICES=0 python train_stage12.py --stage1_model stage1/[OWN_PATH] --stage2_model stage2/[OWN_PATH] --do_lower_case --predict_file preprocessed_data/dev_inputs.json --do_eval --option stage12

The output will be saved into predictions.intermediate.json, which contain all the answers for non hyper-linked cells, with the hyperlinked cells, we need the MRC model in stage3 to extract the span.

Model Evaluation Step2 -> Stage3:

Evaluating command for stage3 as follows (replace the model_name_or_path with your own):

CUDA_VISIBLE_DEVICES=0 python train_stage3.py --model_name_or_path stage3/[OWN_PATH] --do_stage3   --do_lower_case  --predict_file predictions.intermediate.json --per_gpu_train_batch_size 12  --max_seq_length 384   --doc_stride 128 --threads 8

The output is finally saved to predictoins.json, which can be used to calculate F1/EM with reference file.

Computing the score

python evaluate_script.py predictions.json released_data/dev_reference.json

CodaLab Evaluation

We host CodaLab challenge in HybridQA Competition, you should submit your results to the competition to obtain your testing score. The submitted file should first be named "test_answers.json" and then zipped. The required format of the submission file is described as follows:

[
  {
    "question_id": xxxxx,
    "pred": XXX
  },
  {
    "question_id": xxxxx,
    "pred": XXX
  }
]

The reported scores are EM and F1.

Recent Papers

Model Organization Reference Dev-EM Dev-F1 Test-EM Test-F1
S3HQA CASIA Lei et al. (2023) 68.4 75.3 67.9 75.5
MAFiD JBNU & NAVER Lee et al. (2023) 66.2 74.1 65.4 73.6
TACR TIT & NTU Wu et al. (2023) 64.5 71.6 66.2 70.2
UL-20B Google Tay et al. (2022) - - 61.0 -
MITQA IBM & IIT Kumar et al. (2021) 65.5 72.7 64.3 71.9
RHGN SEU Yang et al. (2022) 62.8 70.4 60.6 68.1
POINTR + MATE Google Eisenschlos et al. (2021) 63.3 70.8 62.7 70.0
POINTR + TAPAS Google Eisenschlos et al. (2021) 63.4 71.0 62.8 70.2
MuGER2 JD AI Wang et al. (2022) 57.1 67.3 56.3 66.2
DocHopper CMU Sun et al. (2021) 47.7 55.0 46.3 53.3
HYBRIDER UCSB Chen et al. (2020) 43.5 50.6 42.2 49.9
HYBRIDER-Large UCSB Chen et al. (2020) 44.0 50.7 43.8 50.6
Unsupervised-QG NUS&UCSB Pan et al. (2020) 25.7 30.5 - -

Referenece

If you find this project useful, please use the following format to cite the paper:

@article{chen2020hybridqa,
  title={HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data},
  author={Chen, Wenhu and Zha, Hanwen and Chen, Zhiyu and Xiong, Wenhan and Wang, Hong and Wang, William},
  journal={Findings of EMNLP 2020},
  year={2020}
}

Miscellaneous

If you have any question about the dataset and code, feel free to raise a github issue or shoot me an email. Thanks!

About

Dataset and code for EMNLP2020 paper "HybridQA: A Dataset of Multi-Hop Question Answeringover Tabular and Textual Data"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published