StructBench: An Autogenerated Benchmark for Evaluating Large Language Model's Ability in Structure-Rich Text Understanding

🎉 Overview

Given the substantial volumes of structured data held by many companies, enabling Large Language Models (LLMs) to directly understand structured text in non-structured forms could significantly enhance their capabilities across various business scenarios. To this end, we propose evaluation data generation method for assessing LLM's ability in understanding the structure-rich text, which generates structured data of controllable complexity based on manually crafted question templates and generation rules. Building on this generation method, we introduce StructBench, a benchmark comprising 6,032 questions across 8 different structured languages and 29 specific tasks. Furthermore, considering human proficiency in rule-based tasks, we also present StructBench-Hard, which includes 3,016 questions designed to further examine the gap between LLMs and human performance. Results indicate that the best-performing LLM currently achieve an accuracy of 65.0% on StructBench-Hard, while human accuracy reaches up to 95.7%. Moreover, while fine-tuning using StructBench can enhance existing LLMs' understanding of all structured languages, it does not necessarily improve performance across all task types.

The repo mainly consist of:

StructBench Dataset used in the paper
Code used to generate the dataset

🔥 Updates

2024/10/12: New version of StrucText-Eval is released. With simpler and clearer data generation codes and evaluation codes. We also release a subset of our evaluation benchmark which allow researchers to directly use it to evaluate their LLMs.
2024/6/29: We released the customization code for customizing your own StructBench.
2024/6/19: We released the initial version of the dataset used in the paper.
2024/6/15: We released the first version of our paper.

💡 The Introduction to the Existing StructBench

There are eight types of structure-rich languages are covered in StructBench, including seven existing languages and one customized:

There are eight different tasks in StructBench, and two different level of difficulties are preset. The statistic information to the existing StructBench and the description of the tasks are listed as follow:

⚙️ Customize Your Own Benchmark for Evaluating LLMs' Structure-Rich Text Understanding Ability

Dependency Installation

conda create -n fcs python=3.10.9
conda activate fcs
pip install fire

Benchmark Generation with Different Setting

Please rewrite the setting in generate_setting.json. The descriptions to all the parameters are directly listed bellow:

{
  "#": 
  {  // Overall Setting
    "output_dir": "",  // where your benchmark files want to be writed down
    "few_shots": []  // how many few_shot do you want to use in your benchmark
  },
  "csv": 
  {
    "nodes": [],  // The depth of your structure-rich data
    "n_ary_ratios": [1,2,3],  // The width in each layer
    "para_len_ratios": [6]  // useless
  },
  "json": 
  {
    "nodes": [],  // The depth of your structure-rich data
    "n_ary_ratios": [1,2,3],  // The width in each layer
    "para_len_ratios": [6]  // the max number of char in each item
  },
  "latex": {
    "nodes": [],  // The depth of your structure-rich data
    "n_ary_ratios": [1,2,3],  // The width in each layer
    "para_len_ratios": [6]  // the max number of char in each item
  },
  "markdown": {
    "nodes": [],  // The depth of your structure-rich data
    "n_ary_ratios": [1,2,3],  // The width in each layer
    "para_len_ratios": [6]  // the max number of char in each item
  },
  "org": {
    "nodes": [],  // The depth of your structure-rich data
    "n_ary_ratios": [1,2,3],  // The width in each layer
    "para_len_ratios": [6]  // the max number of char in each item
  },
  "tree": {
    "nodes": [],  // The depth of your structure-rich data
    "n_ary_ratios": [1,2,3],  // The width in each layer
    "para_len_ratios": [6]  // useless
  },
  "xml": {
    "nodes": [],  // The depth of your structure-rich data
    "n_ary_ratios": [1,2,3],  // The width in each layer
    "para_len_ratios": [6] // the max number of char in each item
  },
  "yaml": {
    "nodes": [],  // The depth of your structure-rich data
    "n_ary_ratios": [1,2,3],  // The width in each layer
    "para_len_ratios": [6]  // the max number of char in each item
  }
}

And use the following code to generate the benchmark:

generate_dataset.sh

or

cd LLMStructure
python datagen.py

📒 Citation

@article{gu2024structbench,
  title={StructBench: An Autogenerated Benchmark for Evaluating Large Language Model's Ability in Structure-Rich Text Understanding},
  author={Gu, Zhouhong and Ye, Haoning and Zhou, Zeyang and Feng, Hongwei and Xiao, Yanghua},
  journal={arXiv preprint arXiv:2406.10621},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Benchmark		Benchmark
Evaluation		Evaluation
Figs		Figs
Generation		Generation
.gitignore		.gitignore
1_generate_benchmark.sh		1_generate_benchmark.sh
2_generate_test_dataset.sh		2_generate_test_dataset.sh
3_start_evaluate.sh		3_start_evaluate.sh
4_get_evaluation_result.sh		4_get_evaluation_result.sh
LICENSE		LICENSE
README.md		README.md
generate_dataset.sh		generate_dataset.sh
generate_setting.json		generate_setting.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StructBench: An Autogenerated Benchmark for Evaluating Large Language Model's Ability in Structure-Rich Text Understanding

🎉 Overview

🔥 Updates

💡 The Introduction to the Existing StructBench

⚙️ Customize Your Own Benchmark for Evaluating LLMs' Structure-Rich Text Understanding Ability

Dependency Installation

Benchmark Generation with Different Setting

📒 Citation

About

Releases

Packages

Languages

License

MikeGu721/StrucText-Eval

Folders and files

Latest commit

History

Repository files navigation

StructBench: An Autogenerated Benchmark for Evaluating Large Language Model's Ability in Structure-Rich Text Understanding

🎉 Overview

🔥 Updates

💡 The Introduction to the Existing StructBench

⚙️ Customize Your Own Benchmark for Evaluating LLMs' Structure-Rich Text Understanding Ability

Dependency Installation

Benchmark Generation with Different Setting

📒 Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages