Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ

Large language models (LLMs) need to serve everyone, including a global majority of non-English speakers. However, most LLMs today, and open LLMs in particular, are often intended for use in just English (e.g. Llama2, Mistral) or a small handful of high-resource languages (e.g. Mixtral, Qwen). Recent research shows that, despite limits in their intended use, people prompt LLMs in many different languages. Therefore, in this paper, we investigate the basic multilingual capabilities of state-of-the-art open LLMs beyond their intended use. For this purpose, we introduce MultiQ, a new silver standard benchmark for basic open-ended question answering with 27.4k test questions across a typologically diverse set of 137 languages. With MultiQ, we evaluate language fidelity, i.e. whether models respond in the prompted language, and question answering accuracy. All LLMs we test respond faithfully and/or accurately for at least some languages beyond their intended use. Most models are more accurate when they respond faithfully. However, differences across models are large, and there is a long tail of languages where models are neither accurate nor faithful. We explore differences in tokenization as a potential explanation for our findings, identifying possible correlations that warrant further investigation.

This is a joint work by Carolin Holtermann, Paul Röttger, Timm Dill and Anne Lauscher. For further details feel free to check out our paper.

MultiQ on Huggingface

Our silver standard benchmark is available on Huggingface: see here

Getting Started

We conducted all our experiments with Python 3.10. Before getting started, make sure you install the requirements listed in the requirements.txt file.

pip install -r requirements.txt

Using GlotLID

For our experiments we used the model_v2.bin configuration of the publicly available GlotLID model see here. However, since the authors of this model are constantly making changes and covering even more languages with their model in further versions, it is worth visiting the GlotLID repository for the latest updates.

Repository Description

This repository contains all code and data needed to reproduce the experiments and results reported in our paper. All data files can be found in the data folder, while all relevant code files can be found in the src folder, both with corresponding readme files.

License

This project is licensed under the CC-BY-4.0 License - see the LICENSE.md file for details

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ

MultiQ on Huggingface

Getting Started

Using GlotLID

Repository Description

License

About

Releases

Packages

Contributors 2

Languages

License

paul-rottger/multiq

Folders and files

Latest commit

History

Repository files navigation

Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ

MultiQ on Huggingface

Getting Started

Using GlotLID

Repository Description

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages