This repository contains the code for the paper "Investigating the translation capabilities of Large Language Models trained on parallel data only". The preprint is available on arXiv and models are available at HuggingFace 🤗: Plume 32k, Plume 128k and Plume 256k.
In recent years, Large Language Models (LLMs) have demonstrated exceptional proficiency across a broad spectrum of Natural Language Processing (NLP) tasks, including Machine Translation. However, previous methodologies predominantly relied on iterative processes such as instruction fine-tuning or continual pre-training, leaving unexplored the challenges of training LLMs solely on parallel data. In this work, we introduce Plume (Parallel Language Model), a collection of three 2B LLMs featuring varying vocabulary sizes (32k, 128k, and 256k) trained exclusively on Catalan-centric parallel examples. These models perform comparable to previous encoder-decoder architectures on 16 supervised translation directions and 56 zero-shot ones. Utilizing this set of models, we conduct a thorough investigation into the translation capabilities of LLMs, probing their performance, the impact of the different elements of the prompt, and their cross-lingual representation space.
Plume is the first LLM trained for Neural Machine Translation with only parallel Catalan-Centric data from scratch. It is a language model with the same architecture as Gemma 2B. The model is trained for general translation tasks at sentence level. Information about training, architecture and interpretability of the model are described in the paper.
- Developed by: The Language Technologies Unit from Barcelona Supercomputing Center (BSC).
- Languages: Spanish, French, Italian, Portuguese, Galician, German, English, and Basque.
- License: Apache License, Version 2.0
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "projecte-aina/Plume32k" # "projecte-aina/Plume128k" "projecte-aina/Plume256k"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
# language codes: spa_Latn (Spanish), cat_Latn (Catalan), eng_Latn (English), ita_Latn (Italian),
# eus_Latn (Basque), deu_Latn (German), por_Latn (Portuguese), glg_Latn (Galician), fra_Latn (French)
src_lang_code = 'spa_Latn'
tgt_lang_code = 'cat_Latn'
sentence = 'Ayer se fue, tomó sus cosas y se puso a navegar.'
prompt = '<s> [{}] {} \n[{}]'.format(src_lang_code, sentence, tgt_lang_code)
input_ids = tokenizer(prompt, return_tensors='pt').input_ids
output_ids = model.generate( input_ids, max_length=200, num_beams=5 )
input_length = input_ids.shape[1]
generated_text = tokenizer.decode(output_ids[0, input_length: ], skip_special_tokens=True).strip()
# Ahir se'n va anar, va agafar les seves coses i es va posar a navegar.
Install dependencies:
pip install -r requirements.txt
The following scripts will create the tokenizers. Note that the folder ./tokenizer/samplings/
must contain a txt file for each language.
bash ./tokenizer/create_tokenizer_over_eus_deu_eng_1M.sh
bash ./tokenizer/create_tokenizer_equal_1M.sh
The following script will compute all the metrics used to evaluate the tokenizers and will save it in ./tokenizer/assets/
bash ./tokenizer/compute_tokenizations.sh
Results are visualized in the following jupyter notebook: ./tokenizer/Fertility_Plots.ipynb
The following script will execute model training using DeepSpeed (ZeRO stage 2). For training we used 40 NVIDIA H100-64GB GPUs with full float32 precision. Note that some variables must be defined in the script, namely: HF_DATASETS_CACHE, HF_HOME, TOKENIZER_PATH, VOCAB_SIZE, DATASET_PATH
. This code will automatically tokenize the data given a HF dataset.
bash ./training/parlam_distributed.sh
DeepSpeed checkpoints will be saved in ./training/output/
folder that can be converted to HF checkpoints using the following script:
bash ./training/output/convert.sh
Converting DeepSpeed checkpoints is required to run the remaining experiments.
The following script is used to translate the Flores-200 dataset using the trained model. Some variables must be defined. Specifically: checkpoint, name, vocab_size, model_dir
. For inference we use beam search with a beam size of 5 limiting the number of tokens to 512.
bash ./inference/experiments.sh
Translations will be saved in ./inference/translations/
folder.
For running experiments as outlined in section 4.2 of the paper, which evaluates models without indicating the source tag, utilize the provided script:
bash ./inference/experiments_ignore_src.sh
For computing the coverage we provide the following script which computes coverage for each head using Flores-200 dataset. Some variables must be defined: checkpoint, name, model_dir
.
bash ./attention_analysis/run_experiments.sh
This code will save as .npy files the coverage matrices which are then visualized in the following jupyter notebook: ./attention_analysis/Att_metrics_Plots.ipynb
.
We also provide the following script to plot the attention matrices of the first sentence from Flores-200 dataset:
bash ./attention_analysis/get_att_matrix.sh
Resulting plots will be saved in the corresponding folder inside ./results
.
We provide the code to compute the heads masking experiments from section 4.2 in the paper. Note that to mask heads we must first compute the coverage metrics as detailed in previous section and the variable ATT_ANALYSIS_FULL_PATH
must be defined accordingly (./attention_analysis/results
).
bash ./heads_masking/experiments.sh
This code will save the corresponding plots and translation examples in ./heads_masking/results
folder. We provide a jupyter notebook to visualize the results: ./heads_masking/Heads_Masking_Plots.ipynb
.
We provide the scripts to compute the distances between layers. First, we extract the model's representations using the following script. Note that some variables must be pre-defined: model_dir, name_model
.
bash ./representation_space/extract_representations.sh
This will save the extracted token representations for each language as numpy files in ./representation_space/results
folder. Then, to compute distances we provide the following script:
bash ./representation_space/compute_distances.sh
Pairwise distances will be saved in the corresponding folder inside ./representation_space/results
folder. We provide a jupyter notebook to visualize the computed distances: ./representation_space/Distances_Plots.ipynb
.
To visualize token embeddings as done in section 4.3 in the paper we provide the following script which computes UMAP 2D and Spherical Voronoi Diagrams for the token representations:
bash ./representation_space/compute_umap.sh
Pairwise distances will be saved in the corresponding folder inside ./representation_space/results
folder. We provide a jupyter notebook to visualize the computed UMAP latent variables: ./representation_space/UMAP_Plots.ipynb
and another jupyter notebook to create the Spherical Voronoi Diagrams: ./representation_space/Voronoi_Plots.ipynb
.
In addition, we provide as a zip file with some Spherical Voronoi Diagrams that have already computed for the 32k model variant: ./representation_space/voronoi_plots_32k.zip
.
We provide the script to compute the vocabulary overlap between pair of languages:
bash ./voc_overlap/compute_overlapping.sh
@misc{gilabert2024investigating,
title={Investigating the translation capabilities of Large Language Models trained on parallel data only},
author={Javier García Gilabert and Carlos Escolano and Aleix Sant Savall and Francesca De Luca Fornaciari and Audrey Mash and Xixian Liao and Maite Melero},
year={2024},
eprint={2406.09140},
archivePrefix={arXiv}
}
This work has been promoted and financed by the Government of Catalonia through the Aina project, by the Ministerio para la Transformación Digital y de la Función Pública and Plan de Recuperación, Transformación y Resiliencia - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215337, 2022/TL22/00215336, 2022/TL22/00215335, 2022/TL22/00215334, as well as by DeepR3 (TED2021-130295B-C32) founded by MCIN/AEI/10.13039/501100011033 and European Union NextGeneration EU/PRTR.
Please feel free to write us at with any questions you may have to {javier.garcia1, carlos.escolano, aleix.santsavall, francesca.delucafornaciari, audrey.mash, xixian.liao, maite.melero}@bsc.es