Improved Multilingual Language Model Pretraining for Social Media Text via Translation Pair Prediction

Code for reproducing the paper Improved Multilingual Language Model Pretraining for Social Media Text via Translation Pair Prediction to appear at The 7th Workshop on Noisy User-generated Text (W-NUT) organized at EMNLP 2021.

Abstract

We evaluate a simple approach to improving zero-shot multilingual transfer of mBERT on social media corpus by adding a pretraining task called translation pair prediction (TPP), which predicts whether a pair of cross-lingual texts are a valid translation. Our approach assumes access to translations (exact or approximate) between source-target language pairs, where we fine-tune a model on source language task data and evaluate the model in the target language. In particular, we focus on language pairs where transfer learning is difficult for mBERT: those where source and target languages are different in script, vocabulary, and linguistic typology. We show improvements from TPP pretraining over mBERT alone in zero-shot transfer from English to Hindi, Arabic, and Japanese on two social media tasks: NER (a 37% average relative improvement in F1 across target languages) and sentiment classification (12% relative improvement in F1) on social media text, while also benchmarking on a non-social media task of Universal Dependency POS tagging (6.7% relative improvement in accuracy). Our results are promising given the lack of social media bitext corpus.

Citation

Please cite as:

Mishra, S., & Haghighi, A. (2021). Improved Multilingual Language Model Pretraining for Social Media Text via Translation Pair Prediction. Proceedings of the 7th Workshop on Noisy User-generated Text (W-NUT 2021). arXiv

@inproceedings{mishra-haghighi-2021-improved,
   title = "Improved Multilingual Language Model Pretraining for Social Media Text via Translation Pair Prediction",
   author = "Mishra, Shubhanshu  and
     Haghighi, Aria",
   booktitle = "Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)",
   month = nov,
   year = "2021",
   address = "Online",
   publisher = "Association for Computational Linguistics",
   url = "https://aclanthology.org/2021.wnut-1.42",
   pages = "381--388",
   eprint={2110.10318},    
   abstract = "We evaluate a simple approach to improving zero-shot multilingual transfer of mBERT on social media corpus by adding a pretraining task called translation pair prediction (TPP), which predicts whether a pair of cross-lingual texts are a valid translation. Our approach assumes access to translations (exact or approximate) between source-target language pairs, where we fine-tune a model on source language task data and evaluate the model in the target language. In particular, we focus on language pairs where transfer learning is difficult for mBERT: those where source and target languages are different in script, vocabulary, and linguistic typology. We show improvements from TPP pretraining over mBERT alone in zero-shot transfer from English to Hindi, Arabic, and Japanese on two social media tasks: NER (a 37{\%} average relative improvement in F1 across target languages) and sentiment classification (12{\%} relative improvement in F1) on social media text, while also benchmarking on a non-social media task of Universal Dependency POS tagging (6.7{\%} relative improvement in accuracy). Our results are promising given the lack of social media bitext corpus. Our code can be found at: https://github.com/twitter-research/multilingual-alignment-tpp.",
}
@inproceedings{mishra2021tpp,
 title={Improved Multilingual Language Model Pretraining for Social Media Text via Translation Pair Prediction},
 author={Mishra, Shubhanshu and Haghighi, Aria},
 booktitle={Proceedings of the 7th Workshop on Noisy User-generated Text (W-NUT 2021)},
 year={2021},
 address={Online},
 publisher={Association for Computational Linguistics},
 pages={1--8},
 eprint={2110.10318},
 archivePrefix={arXiv},
 primaryClass={cs.CL}
}

Reproducibility

Following steps allow reproducing experiments in the paper:

Run mBERT finetuning
Fine-tune on specific task (NER, POS, Sentiment).

Both steps can be run via files in ./notebooks/.

More details in the paper.

Datasets

We provide example formats of the datasets in the /data folder. The NER data for English, Arabic, and Japanese is internal. Details for processing data can be found in ./src folder.

Security Issues?

Please report sensitive security issues via Twitter's bug-bounty program (https://hackerone.com/twitter) rather than GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
figures		figures
models		models
notebooks		notebooks
src		src
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
WNUT_2021_Poster.pdf		WNUT_2021_Poster.pdf
WNUT_2021_Slides.pdf		WNUT_2021_Slides.pdf
multilingual_tpp.png		multilingual_tpp.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Improved Multilingual Language Model Pretraining for Social Media Text via Translation Pair Prediction

Abstract

Citation

Reproducibility

Datasets

Security Issues?

About

Releases

Packages

Languages

License

twitter-research/multilingual-alignment-tpp

Folders and files

Latest commit

History

Repository files navigation

Improved Multilingual Language Model Pretraining for Social Media Text via Translation Pair Prediction

Abstract

Citation

Reproducibility

Datasets

Security Issues?

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages