Skip to content


Repository files navigation

Learning to Imagine: Integrating Counterfactual Thinking in Neural Discrete Reasoning

2023-09-19 Update: Uploaded trained checkpoints.

This repositary contains the TAT-HQA dataset and the code for ACL 2022 paper, view PDF.


The hypothetical questions in TAT-HQA are created from the factual questions of TAT-QA.

dataset_raw contains a mixture of TAT-QA and TAT-HQA data, where TAT-HQA is annotated with counterfactual in the answer_type and the corresponding original TAT-QA question is recorded in rel_question. Similar to TAT-QA, the data files are organized by tables and a list of following passages, with a list of questions under each table. Refer to our website for detailed description of the data format.

The questions contain the following keys,

  • uid: the unique question id.
  • order: the order in the question list under the table and passages.
  • question: a question string.
  • answer: a list of answer strings. The model is expected to predict all the answers.
  • derivation: for arithmetic questions, an equation of the answer calculation process.
  • answer_type: 5 types, span, multi-span, arithmetic, count or counterfactual.
  • answer_from: 3 types, table, text or table-text.
  • rel_paragraph: the order(s) of the relevant passage(s).
  • req_comparison: True or False, whether the arithmetic question requires comparison.
  • scale: the model is also expected to predict a correct scale ('', thousand, million, billion, or percent) for each question. If the scale prediction is incorrect, the answer is evaluated as incorrect.
  • rel_question (for hypothetical questions): the order of the corresponding factual question. Usually, it is the previous question.

For our implementation of the paper method, we pre-process dataset_raw to generate some extra_fields. The facts and mapping are generated in the same way as TAT-QA (used for training the TagOP baseline). Apart from these, we extract the assumption substring from the hypothetical question (question_if_part), and we heuristically generate the if_op (SWAP, ADD, MINUS, DOUBLE, INCREASE_PERC, etc.) and if_tagging (the operands of if_op) for the Learning-to-Imagine module. The preocessed data is stored in [dataset_extra_field] and splitted by TAT-QA and TAT-HQA, saved in orig and counter.

The test data of TAT-HQA is stored in dataset_test_hqa/tathqa_dataset_dev.json.


The paper method is built upon TagOp (Paper) and further fine-tune on TAT-HQA from the TagOp checkpoint. Please follow the process in TAT-QA to create the conda environment and download roberta.large. Our cuda version is 10.2, cuda driver version 440.33.01. We use one 32GB GPU.

Training & Testing

Preprocess Dataset

Process the training/validation data by specifying 'counter' for TAT-HQA, 'orig' for TAT-QA, and 'both' for a mix of them.

PYTHONPATH=$PYTHONPATH:$(pwd):$(pwd)/tag_op python tag_op/ --input_path ./dataset_extra_field/[counter/orig/both] --output_dir tag_op/data/[counter/orig/both] --encoder roberta --mode [train/dev] --roberta_model roberta.large

Process the test data of TAT-HQA.

PYTHONPATH=$PYTHONPATH:$(pwd):$(pwd)/tag_op python tag_op/ --input_path ./dataset_test_hqa --output_dir tag_op/data/test --encoder roberta --mode dev --data_format tathqa_dataset_{}.json --roberta_model roberta.large


First, we obtain a base TagOp model trained on a mix of TAT-QA&HQA data by running the following command. Set --roberta_model as the path to the roberta model.

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$PYTHONPATH:$(pwd):$(pwd)/tag_op python tag_op/ --data_dir tag_op/data/both --save_dir tag_op/model_L2I --batch_size 32 --eval_batch_size 32 --max_epoch 50 --warmup 0.06 --optimizer adam --learning_rate 5e-4  --weight_decay 5e-5 --seed 123 --gradient_accumulation_steps 4 --bert_learning_rate 1.5e-5 --bert_weight_decay 0.01 --log_per_updates 100 --eps 1e-6  --encoder roberta --test_data_dir tag_op/data/both/ --roberta_model roberta.large --cross_attn_layer 0 --do_finetune 0

This checkpoint of model_L2I can be obtained from this link

To inference the validation set results on TAT-QA or TAT-HQA on this model, run

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$PYTHONPATH:$(pwd) python tag_op/ --data_dir tag_op/data/[counter/orig] --test_data_dir tag_op/data/[counter/orig] --save_dir tag_op/model_L2I --eval_batch_size 32 --model_path tag_op/model_L2I --encoder roberta --roberta_model roberta.large --cross_attn_layer 0

The predicted answer file for the validation set is saved at tag_op/model_L2I/answer_dev.json.

Then, we fine-tune TAT-HQA on the base TagOp model using the Learning-to-Imagine module, by setting --do_finetune, --model_finetune_from tag_op/model_L2I, and --cross_attn_layer 3.

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$PYTHONPATH:$(pwd):$(pwd)/tag_op python tag_op/ --data_dir tag_op/data/counter --save_dir tag_op/model_ft_hqa_on_L2I_50e --batch_size 32 --eval_batch_size 32 --max_epoch 50 --warmup 0.06 --optimizer adam --learning_rate 5e-5  --weight_decay 5e-5 --seed 123 --gradient_accumulation_steps 4 --bert_learning_rate 1.5e-6 --bert_weight_decay 0.01 --log_per_updates 100 --eps 1e-6  --encoder roberta --test_data_dir tag_op/data/counter/ --roberta_model roberta.large --cross_attn_layer 3 --do_finetune 1 --model_finetune_from tag_op/model_L2I

This checkpoint of model_ft_hqa_on_L2I_50e can be obtained from this link


To get the performance on the validation set of TAT-HQA, run the following command and obtain the prediction file at tag_op/model_ft_hqa_on_L2I_50e/answer_dev.json.

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$PYTHONPATH:$(pwd) python tag_op/ --data_dir tag_op/data/counter --test_data_dir tag_op/data/counter --save_dir tag_op/model_ft_hqa_on_L2I_50e --eval_batch_size 32 --model_path tag_op/model_ft_hqa_on_L2I_50e --encoder roberta --roberta_model roberta.large --cross_attn_layer 3

To obtain the test results on TAT-HQA, run the following command and obtain the prediction file at tag_op/model_ft_hqa_on_L2I_50e/answer_dev.json.

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$PYTHONPATH:$(pwd) python tag_op/ --data_dir tag_op/data/test --test_data_dir tag_op/data/test --save_dir tag_op/model_ft_hqa_on_L2I_50e --eval_batch_size 32 --model_path tag_op/model_ft_hqa_on_L2I_50e --encoder roberta --roberta_model roberta.large --cross_attn_layer 3


To use the evaluation script to evaluate the validation result of TAT-HQA, try running

python dataset_extra_field/counter/tatqa_and_hqa_field_dev.json tag_op/model_ft_hqa_on_L2I_50e/answer_dev.json 0

the 1st argument is the gold answer path, and the 2nd argument is the prediction file path.

The gold answers for the test set of TAT-HQA are not released. Please refer to our website for the leaderboard details.

Checkpoint Results

The trained checkpoints are released. If we have the same environment, the performance on TAT-HQA are listed as follows.

model name dev EM dev F1 test EM test F1
model_ft_hqa_on_L2I_50e 55.7 56.3 55.6 56.3
model_L2I 54.7 55.3 52.6 53.2


Please kindly add the following citation if you find our work helpful. Thanks!

  title={Learning to Imagine: Integrating Counterfactual Thinking in Neural Discrete Reasoning},
  author={Li, Moxin and Feng, Fuli and Zhang, Hanwang and He, Xiangnan and Zhu, Fengbin and Chua, Tat-Seng},
  booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},

Any Questions?

Kindly contact us at [email protected] for any issue. Thank you!