This repository contains the code for our research paper titled "SEER: Facilitating Structured Reasoning and Explanation via Reinforcement Learning", which has been accepted the main conference for ACL 2024.
[10/01/2024] We have released the code.
[05/31/2024] Our Paper is accepted at the main conference for ACL 2024.
- Release the checkpoint
- Release the Code and Data
- Python 3.10
- Unbuntu 22.04
- Python Packages
conda create -n seer python=3.10
conda activate seer
pip install -r requirements.txt
The data folder includes the EntailmentBank, EntailmentBankQA, STREET, eQASC and eOBQA.
See data/Readme.md
for details.
Please download the retrieve, entailment and other modules to the ./exp
.
See exp/Readme.md
for details.
python ./supervised_warm_up/preprocess_data/proof_to_step_data.py
python ./supervised_warm_up/preprocess_data/warm_state_data.py
cd ./supervised_warm_up/scripts
bash task1.sh
You can directly use Policy for Task 3 (Iter0.zip) in ./exp
from FAME as the warmup model.
cd ./etree_task1/scripts
bash SEER_task1.sh
cd ./etree_task2/scripts
bash SEER_task2.sh
cd ./etree_task3/scripts
bash SEER_task3.sh
For trajectory rollout, action generation (Policy) and conclusion generation (entitlement) are performed alternately. The orange area details the reasoning process from st to st+1. For policy optimization, the reward module assigns rewards and updates the policy and critic based on tree or graph structures.
Bold and underlined texts highlight the best method and the runner-up. RLET is based on DeBERTa-large, while all other methods are based on T5-large. All baseline results come from published papers. We use the GPT-4-1106-preview
version for GPT-4.
Table 1: Experiment results on EntailmentBank
An illustration of the reward and alignment process of SEER. Each reasoning step represents a subtree.
(1) Tpred is constructed using the last intermediate conclusion (i4 in this example) as the hypothesis.
(2) The Jaccard similarity between the intermediate nodes (i∗) in Tpred and each golden intermediate node in Tgold (ˆi1 and h in this example) is calculated, and alignment is performed based on the maximum Jaccard similarity. In this example, i1 is aligned with ˆi1 because JS(i1,ˆi1) = 1. i2 is aligned with "NULL". i4 is aligned with ˆi1 because JS(i4,ˆi1) = 0.5 and JS(i4, h) = 0.4.
(3) Rewards are assigned based on the alignment results. Note that i3 (s3) is a redundant step. r1 = 1, r2 = -1, r3 = -0.5, and r4 = -1. The reward for each state originates from the tree structure rather than the chained trajectory. Therefore, the return of each state should also follow the tree structure (or graph structure in reasoning graphs) rather than the chained trajectory.
An illustration of the equivalent trajectory and the definition of return. As the reasoning steps of
If our paper is helpful in your projects, please cite our paper and help to ⭐ this repo. Thanks.
@inproceedings{chen-etal-2024-seer,
title = "{SEER}: Facilitating Structured Reasoning and Explanation via Reinforcement Learning",
author = "Chen, Guoxin and
Tang, Kexin and
Yang, Chao and
Ye, Fuying and
Qiao, Yu and
Qian, Yiming",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-long.321",
doi = "10.18653/v1/2024.acl-long.321",
pages = "5901--5921",
}
If you have any questions, please raise an issue or contact us at 📧 Email: [email protected]