Source code for the EMNLP 2024 (Findings) paper Multi-Target Cross-Lingual Summarization: a novel task and a language-neutral approach.
Create a virtual environment and install the requirements:
conda create -n mtxlsum python=3.10
conda activate mtxlsum
pip install torch==2.1.1 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
If you wish to run the evaluation, you will also need to install the multilingual ROUGE package available at https://github.com/csebuetnlp/xl-sum/tree/master/multilingual_rouge_scoring.
-
Download the data from the official repository and extract it to the
original_data
directory. -
Run the following command to cluster the test data (similarly for validation):
mkdir -p data
mkdir -p data/test
python aggregate.py --data_dir original_data/test --output_dir data/test --langs arabic chinese_simplified english french portuguese russian spanish
This will create a data
directory with multiple JSONL files. Each line corresponds to a cluster of documents and has the following format:
{
"num_docs": int,
"url0": str,
"lang0": str,
"text0": str,
"summary0": str,
"url1": str,
"lang1": str,
"text1": str,
"summary1": str,
...
}
E.g., to generate summaries for all English documents using NeutralRR using all the languages in each cluster as targets, run the following command:
python generate.py --source_lang=en --split=test --method=rerank --search_mode=dijkstra --num_candidates=8 --temperature=1.0 --top_k=50 --num_sampling_beams=5 --output=predictions_en.jsonl
This will create a predictions_en.jsonl
file where each line has the following format:
{
"source_url": str,
"summary_english": str,
"summary_spanish": str,
...
}
For other methods and options, run python generate.py --help
.
To evaluate the generated summaries, run the following command:
python evaluate.py --predictions=./predictions_en.jsonl --source_lang=en --split=test --output=predictions_en_eval.json
This will create a JSON file with the results of the evaluation for each target language according to several metrics.
@inproceedings{pernes-etal-2024-multi,
title = "Multi-Target Cross-Lingual Summarization: a novel task and a language-neutral approach",
author = "Pernes, Diogo and
Correia, Gon{\c{c}}alo M. and
Mendes, Afonso",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-emnlp.755",
pages = "12908--12924",
abstract = "Cross-lingual summarization aims to bridge language barriers by summarizing documents in different languages. However, ensuring semantic coherence across languages is an overlooked challenge and can be critical in several contexts. To fill this gap, we introduce multi-target cross-lingual summarization as the task of summarizing a document into multiple target languages while ensuring that the produced summaries are semantically similar. We propose a principled re-ranking approach to this problem and a multi-criteria evaluation protocol to assess semantic coherence across target languages, marking a first step that will hopefully stimulate further research on this problem.",
}