An empirical analysis of cross-linguality in shared embedding spaces for 101 languages and 5000+ language pairs. Using cross-lingual sentence models LaBSE and LASER, we investigate the factors that predict cross-lingual alignment and isomorphism between the embedding (sub)spaces of any two arbitrary languages. We uncover significant effects of basic word order and morphological complexity, in addition to other variables.
The modules in /src/Data Generation may be used to generate features for the languages in the dataset, as well as the alignment and isomorphism metrics we use as dependent variables. generate_features.py does both these tasks. The main dataset we use is the superparallel Bible corpus from Christodouloupoulos and Steedman (2014). The UDHR multiparallel corpus and Nunavut Hansard English-Inuktitut parallel corpus are used for supplemental experiments in our paper. The relevant portions and versions of these datasets are provided in this repo.
/src/Analysis contains notebooks for carrying out the statistical analysis (bible_bitexts_analysis.ipynb) and visualization (bible_bitexts_tsne.ipynb) portions of our project. If you just want to replicate the analyses without re-generating the data, you can use the data from /Bible experimental vars.
Some additional notebooks used in our experiments are provided in /Additional Notebooks. These contain essentially "rough draft" code and we recommend using files from /src/Data Generation instead for rerunning experiments.
Core libraries
Pandas
scikit-learn
PyTorch
Sentence embeddings
Sentence Transformers
laserembeddings
Statistical analysis & plotting
Seaborn
Pingouin
Matplotlib
Computational tools
Faiss
Gudhi
Typological vectors
lang2vec
Misc
tqdm
Please cite our paper if you use code or data from this repo:
@inproceedings{jones-etal-2021-massively,
title = "A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space",
author = "Jones, Alexander and
Wang, William Yang and
Mahowald, Kyle",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-main.471",
pages = "5833--5847",
}