This repository contains a list of papers, datasets and leaderboards of the text-and-table HybridQA task, which is carefully and comprehensively organized. If you found any error, please open an issue or pull request.
The question-answering task with just text or tables to generate answers has been systematically studied, which we call the classic QA task. These two sorts of evidence each have their advantages: textual evidence is prevalent in daily communication, while tabular evidence is a well-organized display of numerical information. However, using the heterogeneous data that combines these two types of evidence is increasingly prevalent in real applications, particularly in fields demanding numerical reasoning, like the financial and scientific domains. This technique is known as Table-and-Text Hybrid Question Answering (HybridQA). Considering that HybridQA is still under-researched, we present this project to summarize the current development including benchmarks and their published-sota.
HybridQA is the first HybridQA benchmark, which is also the largest cross-domain benchmark to date. Each question and answer is relayed on a single table and multiple texts. Each text usually is a description of information of a table cell, for example, a hyperlink page of the cell, which is crawled from Wikipedia. For each case, the benchmark offers the golden text and table rows. All answers to questions are the spans in evidence, which called span-based answers, and need one or more hops between heterogeneous data.
Model | Organization | Reference | Dev-EM | Dev-F1 | Test-EM | Test-F1 |
---|---|---|---|---|---|---|
UL-20B | Tay et al. (2022) | - | - | 61.0 | - | |
MITQA | IBM & IIT | Kumar et al. (2021) | 65.5 | 72.7 | 64.3 | 71.9 |
RHGN | SEU | Yang et al. (2022) | 62.8 | 70.4 | 60.6 | 68.1 |
POINTR + MATE | Eisenschlos et al. (2021) | 63.3 | 70.8 | 62.7 | 70.0 | |
POINTR + TAPAS | Eisenschlos et al. (2021) | 63.4 | 71.0 | 62.8 | 70.2 | |
MuGER2 | JD AI | Wang et al. (2022) | 57.1 | 67.3 | 56.3 | 66.2 |
DocHopper | CMU | Sun et al. (2021) | 47.7 | 55.0 | 46.3 | 53.3 |
HYBRIDER | UCSB | Chen et al. (2020) | 43.5 | 50.6 | 42.2 | 49.9 |
HYBRIDER-Large | UCSB | Chen et al. (2020) | 44.0 | 50.7 | 43.8 | 50.6 |
Unsupervised-QG | NUS&UCSB | Pan et al. (2020) | 25.7 | 30.5 | - | - |
To lower the difficulties of answering, HybridQA annotates the related evidence to each example and the links of text and tables, which widens the gap with real-world applications. To be more relevant to the practical applications, OTT-QA blends textual and tabular evidence of each example into one single corpus that contains more than five million items and removes the relation information between them, which is called the open-QA benchmark. So the most challenging part of this benchmark is to retrieve evidence of questions from millions of heterogeneous data, like open domain question answering. The questions and evidence of OTT-QA are all built based on the HybridQA. Also, all its answers are the spans in the evidence.
Model | Organization | Reference | Dev-EM | Dev-F1 | Test-EM | Test-F1 |
---|---|---|---|---|---|---|
CORE | CMU + Microsoft Research | Ma et al. (2022) | 49.0 | 55.7 | 47.3 | 54.1 |
OTTeR | MSRA + Beihang | Huang et al. (2022) | 37.1 | 42.8 | 37.3 | 43.1 |
CARP | MSRA + Sun Yet-sen University | Zhong et al. (2021) | 33.2 | 38.6 | 32.5 | 38.5 |
Fusion+Cross-Reader | Chen et al. (2021) | 28.1 | 32.5 | 27.2 | 31.5 | |
Dual Reader-Parser | Amazon | Alexander et al. (2021) | 15.8 | - | - | - |
BM25-HYBRIDER | UCSB | Chen et al. (2021) | 10.3 | 13.0 | 9.7 | 12.8 |
Some HybridQA answers generation require numeric reasoning compatibility, while the benchmarks with only span-based questions cannot fulfill this requirement. FinQA is a finance HybridQA benchmark containing the questions of many standard financial analysis calculations. FinQA annotates the arithmetic answer in a domain-specific language (DSL), which consists of mathematical and table operations, to reduce the difficulty of formula generation and make it more interpretable.
Model | Orgnization | Reference | Dev-Execution Accuracy | Dev-Program Accuracy | Test-Execution Accuracy | Test-Program Accuracy |
---|---|---|---|---|---|---|
PoT-SCcode-davinci-002 | University of Waterloo | Chen et al. | - | - | 68.1 | - |
APOLLO | MSRA + Xiamen University | Sun et al. | 69.79 | 65.91 | 67.99 | 65.60 |
ELASTIC | Strath | Zhang et al. (2022) | - | - | 68.96 | 65.21 |
DyRRen | Nanjing University | Li et al. (2022) | 66.82 | 63.87 | 63.30 | 61.29 |
ReasonFuse | CAS | Xia et al. (2022) | 61.84 | 59.80 | 60.68 | 58.94 |
FinQANet | UCSB | Chen et al. (2021) | 61.22 | 58.05 | 61.24 | 58.86 |
Although FinQA has presented well-annotated numerical reasoning questions, it ignores the questions with span-based answers.
Similar to the classic QA benchmark DROP, TAT-QA is a collection of financial HybridQA samples that includes questions with both span-based and arithmetic answers.
Additionally, unlike the benchmarks mentioned above, each TAT-QA question is typically related to only five texts, which lowers the difficulty of retrieval.
Just like FinQA, TAT-QA also provides the formulations of arithmetic questions.
Model | Orgnization | Reference | Dev-EM | Dev-F1 | Test-EM | Test-F1 |
---|---|---|---|---|---|---|
AeNER | HSE | Yarullin et al. | - | - | 75.0 | 83.2 |
RegHNT | CAS | Lei et al. | 73.6 | 81.3 | 70.3 | 78.0 |
UniRPG | Harbin Institute of Technology + JD AI Research | Zhou et al. (2022) | 70.2 | 77.9 | 67.1 | 76.0 |
PoT-SCcode-davinci-002 | University of Waterloo | Chen et al. | 70.2 | - | - | - |
UniPCQA | CUHK | Deng et al. (2022) | 68.2 | 75.5 | 63.9 | 72.2 |
MHST | NUS | Zhu et al. (2022) | 68.2 | 76.8 | 63.6 | 72.7 |
GANO | National Institute of Advanced Industrial Science and Technology | Nararatwong et al. (2022) | 68.4 | 77.8 | 62.1 | 71.6 |
FinMath | Northeastern University | Li et al. (2022) | 60.5 | 66.3 | 58.6 | 64.1 |
TagOp | NUS | Zhu et al. (2021) | 55.2 | 62.7 | 50.1 | 58.0 |
Hierarchical tables, which contain multi-level headers, are common in the real world but are hard to be expressed and be understood by models because of the complex table structure. However, almost all tables of the previous benchmarks are flattened structures without multi-level headers. To overcome this challenge, MultiHiertt collects and annotates many hierarchical tables compared with questions.
Model | Orgnization | Reference | Dev-EM | Dev-F1 | Test-EM | Test-F1 |
---|---|---|---|---|---|---|
NAPG | Zhengzhou University + Peng Cheng Lab | Zhang et al. | - | - | 44.19 | 44.81 |
MT2Net | Yale | Zhao et al. | 37.05 | 39.96 | 36.22 | 38.43 |
GeoTSQA is the first scenario-based question-answering benchmark with hybrid evidence, which requires retrieving and integrating knowledge from multiple sources and applying general knowledge to a specific case described by the scenario. This benchmark is constructed on the multiple-choice questions in the geography domain from Chinese high-school exams. Besides tables and text, each question is also provided with four options, from which model should select one as the answer.
Model | Orgnization | Reference | Accuracy |
---|---|---|---|
TTGen | Nanjing University | Li et al. | 39.7 |