M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Models
[2024/07/27] We released the dataset that supports up to 128K. We also released the code for constructing the test instances. See 5.2. Data Creation for more details.
[2024/05/16] Our paper is accepted to ACL 2024 main.
[2023/10/31] We released the M4LE paper.
- 1. Introduction
- 2. Leaderboard
- 3. Results
- 4. Setup
- 5. Data
- 6. Task
- 7. Inference
- 8. Evaluation
- Citation
π 1. Introduction [Back to Top]
M4LE is a Multi-ability, Multi-range, Multi-task, bilingual benchmark for long-context evaluation. We categorize long-context understanding into five distinct abilities by considering whether it is required to identify single or multiple spans in long contexts based on explicit or semantic hints. Specifically, these abilities are explicit single-span, semantic single-span, explicit multiple-span, semantic multiple-span, and global. Different from previous long-context benchmarks that simply compile from a set of existing long NLP benchmarks, we introduce an automated method to transform short-sequence tasks into a comprehensive long-sequence scenario encompassing all these capabilities.
M4LE consists of 36 tasks, covering 11 task types and 12 domains. For each task, we construct 200 instances for each context length bucket (1K, 2K, 4K, 6K, 8K, 12K, 16K, 24K, 32K). Due to computation and cost constraints, our paper evaluated 11 well-established LLMs on instances up to the 8K context length bucket. For more details, please refer to the paper.
π 2. Leaderboard [Back to Top]
For each task, the score is normalized by the score of GPT-3.5-Turbo-16K on the 1K context. The equation for normalization can be found in the paper. We use the average normalized score obtained from context length buckets of 1K, 2K, 4K, 6K, and 8K to compare different models.
Model | 1k | 2k | 4k | 6k | 8k | Avg. Score |
---|---|---|---|---|---|---|
GPT-3.5-Turbo-16K | 0.50 | 0.47 | 0.42 | 0.39 | 0.36 | 0.43 |
Vicuna-13B-v1.5-16K | 0.48 | 0.45 | 0.40 | 0.33 | 0.23 | 0.38 |
LongChat-7B-v1.5-32K | 0.41 | 0.37 | 0.33 | 0.27 | 0.24 | 0.32 |
LongChat-13B-16K | 0.41 | 0.37 | 0.33 | 0.25 | 0.21 | 0.31 |
ChatGLM2-6B-32K | 0.39 | 0.34 | 0.30 | 0.27 | 0.26 | 0.31 |
Vicuna-7B-v1.5-16K | 0.43 | 0.39 | 0.32 | 0.26 | 0.14 | 0.31 |
LLaMA2-13B-Chat | 0.44 | 0.37 | 0.28 | 0.23 | 0.21 | 0.31 |
LLaMA2-13B | 0.44 | 0.38 | 0.28 | 0.22 | 0.19 | 0.30 |
ChatGLM2-6B | 0.40 | 0.33 | 0.25 | 0.21 | 0.17 | 0.27 |
LLaMA2-7B-Chat | 0.41 | 0.34 | 0.25 | 0.13 | 0.11 | 0.25 |
LLaMA2-7B | 0.39 | 0.33 | 0.25 | 0.13 | 0.07 | 0.23 |
π 3. Results [Back to Top]
The normalized scores of various models in different context lengths (left), accompanied by the slopes of the corresponding best-fit lines (right).
π 4.οΈ Setup [Back to Top]
The following command sets up a conda environment for inference and evaluation.
conda env create --name m4le --file environment.yml
ποΈ 5. Data [Back to Top]
As described in the paper, we adopt different open-sourced datasets to construct 36 tasks in M4LE. Each task has a jsonl
file containing all the testing instances. You can load the data from the huggingface hub:
from datasets import load_dataset
tasks = [
"arxiv",
"bigpatent_global_cls",
"bigpatent_global_sum",
"booksum",
"c3",
"cepsum",
"clts+",
"cnewsum",
"cnnnews",
"drcd_explicit-single",
"drcd_semantic-single",
"duorc",
"dureader",
"hotpotqa",
"lcsts",
"marc",
"mnds-news_explicit-single",
"mnds-news_explicit-multiple",
"mnds-news_semantic-multiple",
"ncls",
"news-commentary-en2zh",
"news-commentary-zh2en",
"news2016",
"newsqa",
"nq-open",
"online-shopping",
"open-subtitles-en2zh",
"open-subtitles-zh2en",
"pubmed",
"tedtalks-en2zh",
"tedtalks-zh2en",
"thucnews_explicit-single",
"thucnews_explicit-multiple",
"thucnews_semantic-multiple",
"triviaqa",
"wiki2019zh",
"wikihow",
"wikitext-103",
"wow",
]
for task in tasks:
data = load_dataset('wckwan/M4LE', task, split='test')
Each testing instance follows this format:
{
"instruction": "<task description>",
"input": "<task input with one-shot example>",
"answers": ["<answer1>", "<answer2>"],
"input_length": <int, number of words in instruction and input separated by space>,
"total_length": <int, number of words in instruction, input and gold answer separated by space>,
"length_bucket": <int, the length bucket to which this instance belongs>
}
We also provide the scripts to construct the instances from the raw datasets.
First, download and preprocess the data. It downloads many source datasets which takes lots of time and space. It is possible to encounter network errors during the download process where you have to edit the script to resume where you left off.
./raw_data/preprocess.sh
If the preprocessing script is successful, you should have the following folder structure
raw_data/
βββ classification/
β βββ THUCNews/
β βββ arxiv-dataset/
β βββ marc/
β βββ online_shopping_10_cats/
βββ nli/
β βββ wiki_zh/
β βββ wikitext-2-raw/
βββ qa/
β βββ DRCD/
β βββ __MACOSX/
β β βββ newsqa-data-v1/
β βββ c3/
β βββ duorc/
β β βββ dataset/
β β βββ preprocessing/
β β βββ data/
β β βββ utils/
β βββ dureader/
β β βββ devset/
β β βββ evaluation_metric/
β β βββ testset/
β β βββ trainset/
β βββ hotpotqa/
β βββ natural_questions/
β βββ newsqa/
β β βββ stories/
β βββ triviaqa/
β βββ evidence/
β β βββ web/
β β βββ wikipedia/
β βββ qa/
β βββ triviaqa-unfiltered/
βββ summarization/
β βββ CEPSUM/
β βββ CNewSum_v2/
β β βββ final/
β βββ NCLS-Data/
β β βββ EN2ZHSUM/
β β βββ ZH2ENSUM/
β βββ QMSum/
β β βββ data/
β β β βββ ALL/
β β β βββ Academic/
β β β βββ Committee/
β β β βββ Product/
β β βββ extracted_span/
β β βββ figures/
β β βββ model_output/
β βββ arxiv-dataset/
β βββ bigPatentData/
β β βββ test/
β β βββ train/
β β βββ val/
β βββ clts/
β βββ cnn/
β β βββ stories/
β βββ gov-report/
β β βββ crs/
β β βββ gao/
β β βββ split_ids/
β βββ lcsts/
β βββ news2016zh/
β βββ pubmed-dataset/
β βββ wikihow/
βββ translation/
β βββ News-Commentary_v16/
β β βββ en/
β β β βββ News-Commentary/
β β β βββ xml/
β β β βββ en/
β β βββ zh/
β β βββ News-Commentary/
β β βββ xml/
β β βββ zh/
β βββ opensubtitles/
β β βββ OpenSubtitles/
β β βββ xml/
β β βββ en/
β β βββ zh_cn/
β βββ tedtalk/
βββ wow/
Construct the test instances with the following script. To customize the length buckets, edit the variable buckets
in create_data.sh
./data/create_data.sh
π 6. Task [Back to Top]
Ability | Task Name | Task Type | Language | Description |
---|---|---|---|---|
Explicit Single | mnds-news_explicit-single | CLS + RET | En | Classify a specified news article. |
Explicit Single | thucnews_explicit-single | CLS + RET | Zh | Classify a specified news article. |
Explicit Single | newsqa | QA + RET | En | Answer a question based on a specified news article. |
Explicit Single | c3 | QA + RET | Zh | Answer a multi-choice question based on a textbook extract. |
Explicit Single | wow | RET | En | Return the ID of the article related to a specified topic. |
Explicit Single | drcd_explicit-single | RET | Zh | Return the ID of the article related to a specified topic. |
Explicit Single | cnnnews | SUM + RET | En | Summarize a specified news article. |
Explicit Single | cepsum | SUM + RET | Zh | Summarize a specified product description. |
Explicit Single | lcsts | SUM + RET | Zh | Summarize a specified news article. |
Explicit Single | ncls | SUM + RET | En, Zh | Summarize a specified news article. |
Explicit Multiple | mnds-news_explicit-multiple | CLS + RET | En | Return the IDs of all the articles belong to a specified class. |
Explicit Multiple | thucnews_explicit-multiple | CLS + RET | Zh | Return the IDs of all the articles belong to a specified class. |
Explicit Multiple | marc | CLS + RET | En, Zh | Return the IDs of all the positive product reviews. |
Explicit Multiple | online-shopping | CLS + RET | Zh | Return the IDs of all the positive product reviews. |
Semantic Single | wikitext-103 | NLI + RET | En | Return the ID of the paragraph that continues a query paragraph. |
Semantic Single | wiki2019zh | NLI + RET | Zh | Return the ID of the paragraph that continues a query paragraph. |
Semantic Single | duorc | QA | En | Answer a question based on multiple movie plots. |
Semantic Single | nq-open | QA | En | Answer a question based on multiple wikipedia paragraphs. |
Semantic Single | dureader | QA | Zh | Answer a question based on multiple web snippets. |
Semantic Single | drcd_semantic-single | QA | Zh | Answer a question based on multiple wikipedia paragraphs. |
Semantic Single | wikihow | SUM + RET | En | Summarize an article based on a given topic. |
Semantic Single | news2016 | SUM + RET | Zh | Summarize a news article based on a given title. |
Semantic Single | tedtalks-en2zh/tedtalks-zh2en | TRAN + RET | En, Zh | Translate a Ted Talk transcript based on a given title. |
Semantic Multiple | mnds-news_semantic-multiple | CLS + CNT | En | Return the number of news articles belonging to a specified class. |
Semantic Multiple | thucnews_semantic-multiple | CLS + CNT | Zh | Return the number of news articles belonging to a specified class. |
Semantic Multiple | hotpotqa | QA | En | Answer a question based on multiple wikipedia paragraphs. |
Global | bigpatent_global_cls | CLS | En | Classify a patent document. |
Global | triviaqa | QA | En | Answer a question based on a web snippet. |
Global | arxiv | SUM | En | Summarize an academic paper. |
Global | bigpatent_global_sum | SUM | En | Summarize a patent document. |
Global | pubmed | SUM | En | Summarize a medical paper. |
Global | booksum | SUM | En | Summarize one or more chapters of a book. |
Global | cnewsum | SUM | Zh | Summarize a news article. |
Global | clts+ | SUM | Zh | Summarize a news article. |
Global | open-subtitles-en2zh/open-subtitles-zh2en | TRAN | En, Zh | Translate the movie subtitles. |
Global | news-commentary-en2zh/news-commentary-zh2en | TRAN | En, Zh | Translate the movie subtitles. |
π§ 7. Inference [Back to Top]
In the paper, we use the prompt format f"{instruction}\n{input}"
. We recommend editing or using our `inference.py`` code. To run the code, provide the HuggingFace model name and task name:
python inference.py \
--model_path lmsys/vicuna-13b-v1.5-16k \
--task nq-open
For LLaMA2 models, we recommend enabling dynamic NTK scaling:
python inference.py \
--model_path meta-llama/Llama-2-7b \
--task nq-open \
--load_model_args '{"rope_scaling": {"type": "dynamic", "factor": 2.0}}'
Additional arguments:
--resume
: Specify if you wish to pick up from where you left off.--api_key
: The OpenAI key if you are using OpenAI models.
The predictions will be saved at outputs/{model}/{task}.jsonl
π 8. Evaluation [Back to Top]
Run the following script to evaluate:
python evaluation.py
It will evaluate all the models saved at outputs/{model}/
. The results will be saved at results/all_results.csv
, where each row contains
Model
: Model name.Task
: Task Name.Bucket
: The context length bucket.Score
: The evaluation score.
If you find our paper and resources useful, please consider citing our paper:
@misc{kwan_m4le_2023,
title = {{{M4LE}}: {{A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark}} for {{Large Language Models}}},
author = {Kwan, Wai-Chung and Zeng, Xingshan and Wang, Yufei and Sun, Yusen and Li, Liangyou and Shang, Lifeng and Liu, Qun and Wong, Kam-Fai},
year = {2023},
}