This repository contains the code and resources from the following paper
-
aligner
: Code for neural CRF sentence aligner. -
wiki-manual
: The Wiki-Manual dataset. The definitions of columns are: label, the index of simple sentence, the index of complex sentence, simple sentence, complex sentence. -
wiki-auto
: The Wiki-Auto dataset. -
annotation_tool
: The tool for in-house annotators to annotate the sentence alignment. -
simplification
: Code for text simplification experiments.
We upload all fine-tuned BERT checkpoints to huggingface hub, and provide a sample code to use them.
- We released the checkpoints of
BERT
model fine-tuned on Newsela-Manual and Wiki-Manual datasets. They are trained using the Hugging Face implementation ofBERT_base
architecture in the packagepytorch-transformers==1.1.0
.BERT_newsela
andBERT_wiki
. - If you want to align other monolingual parallel data, please try the fine-tuned BERT models. They should be able to achieve competitive performance. The performance boost of adding the neural CRF model is related to the structure of the articles. We have some experience in designing the paragraph alignment algorithm and using neural CRF model to align sentences, feel free to contact us if you want to have a discussion.
- We also released the code for our neural CRF sentence alignment model, you can use it to train your own model.
-
To request the Newsela-Manual and Newsela-Auto datasets, please first obtain access to the Newsela corpus, then contact the authors.
-
Please use Python 3 to run the code.
-
We also have pre-processed Wikipedia data, alignments between complex and simple Wikipedia articles, and original sentence and paragraph alignments between Wikipedia article pairs, please contact us if you want to use that data.
-
We also have the original sentence and paragraph alignments between the Newsela articles, please contact us if you want to use that data.
Please cite if you use the above resources for your research
@inproceedings{jiang2020neural,
title={Neural CRF Model for Sentence Alignment in Text Simplification},
author={Jiang, Chao and Maddela, Mounica and Lan, Wuwei and Zhong, Yang and Xu, Wei},
booktitle={Proceedings of the Association for Computational Linguistics (ACL)},
year={2020}
}