CIRAL (Cross-Lingual Information Retrieval for African Languages) is a test collection focused on promoting the research and evaluation of Cross-Lingual Information Retrieval (CLIR) for African languages. Our collection covers cross-lingual retrieval between English and 4 African languages, with queries in English and passages in the African languages. This repo provides details of the test collection, guidelines for system evaluations and baselines.
Hosted as a track at the Forum for Information Retrieval Evaluation (FIRE) 2023, the goal of our track was to promote participation and community evaluations in CLIR for African languages. More information regarding the track can be found at the website: Ciral@Fire2023
The current languages in CIRAL are Hausa, Swahili, Somali and Yoruba. The corpora consists of passages from news articles, mined from indigenous websites of the different languages.
Link to Dataset: https://huggingface.co/datasets/CIRAL/ciral-corpus/
Statistics and details of the collection are found below.
Language | News Sources | # of Passages | # of Articles | Link |
---|---|---|---|---|
Hausa (hau) | LegitNG, DailyTrust, VOA, Isyaku, etc. | 715,355 | 240,883 | 🤗 |
Somali (som) | VOA, UN Swahili, MTanzania, etc. | 827,552 | 629,441 | 🤗 |
Swahili (swa) | VOA, Tuko, Risaala, Caasimada, etc. | 949,013 | 146,669 | 🤗 |
Yoruba (yor) | Alaroye, VON, BBC, Asejere, etc. | 82,095 | 27,985 | 🤗 |
For each language, passages are stored in JSONL files where each line corresponds to a passage in JSON format. The fields provided for each passage include:
docid
: Unique identifier of the passagetitle
: Title of the news article from which the passage was extractedtext
: Text of the passageurl
: News article url
CIRAL's queries and relevance judgements are provided for the four languages in three sets: development set, test set A and test set B. Additionally, test set A consists of pools curated from CIRAL's shared task. The queries and relevance judgement files can be accessed in the Hugging Face repo.
Statistics for the queries and relevance judgements are given below. The development set consists of few samples to analyze relevance and evaluate proposed systems using the provided judgements.
Dev | Test A | Test B | ||||||
---|---|---|---|---|---|---|---|---|
Language | #Q | #J | #Q | #J | Pool Size | #Q | #J | Link |
Hausa (ha) | 10 | 165 | 80 | 1,447 | 7,288 | 312 | 5,930 | 🤗 |
Somali (so) | 10 | 187 | 99 | 1,798 | 9,094 | 239 | 4,324 | 🤗 |
Swahili (sw) | 10 | 196 | 85 | 1,656 | 8,079 | 113 | 2,175 | 🤗 |
Yoruba (yo) | 10 | 185 | 100 | 1,921 | 8,311 | 554 | 10, 569 | 🤗 |
Both query and relevance judgements files are in the .tsv
format. Each line in the query file is represented as:
qid\tquery
while the judgements are in the standard TREC format:
qid Q0 docid relevance
This task entails queries formulated as natural language questions in English, and retrieval done at the passage-level for the different African languages. Information retrieval systems developed for the task will receive the collection of passages and a set of queries for the different African languages. For each query in the test set, proposed systems are to return a ranked list of passages ordered by likelihood of binary relevance to the query. Up to 1000 passages per query can be submitted, results with more than 1000 would be truncated.
Details regarding participation can be found in this section of the website.
For more details on getting started with IR and understanding the task, please check the provided Quick Start
Baselines and reproduction guides are provided in this section. Please note that this only covers searching, as the indexes have already been built.
The baselines can be reproduced using Pyserini. To reproduce the baselines:
- Install the development version of Pyserini by following this guide.
- Follow the commands in the 2-click-reproduction (2CR)
Our reranking baseline models are also available on Hugging Face: mT5, AfrimT5.
@inproceedings{10.1145/3626772.3657884,
author = {Adeyemi, Mofetoluwa and Oladipo, Akintunde and Zhang, Xinyu and Alfonso-Hermelo, David and Rezagholizadeh, Mehdi and Chen, Boxing and Omotayo, Abdul-Hakeem and Abdulmumin, Idris and Etori, Naome A. and Musa, Toyib Babatunde and Fanijo, Samuel and Awoyomi, Oluwabusayo Olufunke and Salahudeen, Saheed Abdullahi and Mohammed, Labaran Adamu and Abolade, Daud Olamide and Lawan, Falalu Ibrahim and Sabo Abubakar, Maryam and Nasir Iro, Ruqayya and Imam Abubakar, Amina and Mohamed, Shafie Abdi and Mohamed, Hanad Mohamud and Ajayi, Tunde Oluwaseyi and Lin, Jimmy},
title = {CIRAL: A Test Collection for CLIR Evaluations in African Languages},
year = {2024},
isbn = {9798400704314},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3626772.3657884},
doi = {10.1145/3626772.3657884},
pages = {293–302},
numpages = {10},
keywords = {african languages, cross-lingual information retrieval},
location = {Washington DC, USA},
series = {SIGIR '24}
}