MIRACL πππ (Multilingual Information Retrieval Across a Continuum of Languages) is an WSDM 2023 Cup challenge that focuses on search across 18 different languages, which collectively encompass over three billion native speakers around the world.
The website for the event can be found at miracl.ai
.
This repo provides pointers to access the actual dataset.
For more details, check out our arXiv paper: Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages.
Connect with us!
- π¬ Mailing list
- π¬ Slack Workspace
- π£ Twitter
The Wikipedia corpora used in MIRACL are available as a HuggingFace Dataset. So far, we have released corpora for the 16 "known languages"; the remaining 2 "surprise languages" will be revealed later!
- π€ = direct link to HuggingFace Dataset
- π = link to raw wiki dumps
Language | # of Passages | # of Articles | Links |
---|---|---|---|
Arabic (ar) | 2,061,414 | 656,982 | π€ π |
Bengali (bn) | 297,265 | 63,762 | π€ π |
English (en) | 32,893,221 | 5,758,285 | π€ π |
Spanish (es) | 10,373,953 | 1,669,181 | π€ π |
Persian (fa) | 2,207,172 | 857,827 | π€ π |
Finnish (fi) | 1,883,509 | 447,815 | π€ π |
French (fr) | 14,636,953 | 2,325,608 | π€ π |
Hindi (hi) | 506,264 | 148,107 | π€ π |
Indonesian (id) | 1,446,315 | 446,330 | π€ π |
Japanese (ja) | 6,953,614 | 1,133,444 | π€ π |
Korean (ko) | 1,486,752 | 437,373 | π€ π |
Russian (ru) | 9,543,918 | 1,476,045 | π€ π |
Swahili (sw) | 131,924 | 47,793 | π€ π |
Telugu (te) | 518,079 | 66,353 | π€ π |
Thai (th) | 542,166 | 128,179 | π€ π |
Chinese (zh) | 4,934,368 | 1,246,389 | π€ π |
The corpus for each language is prepared from a Wikipedia dump, where we keep only the plain text and discard images, tables, etc.
Each article is segmented into multiple passages using WikiExtractor based on natural discourse units (e.g., \n\n
in the wiki markup).
Each of these passages comprise a "document" or unit of retrieval.
We preserve the Wikipedia article title of each passage.
The corpus data files are in JSON lines format, compressed with gzip
.
Each line in the file corresponds to a passage.
Consider an example from the English corpus:
{
"docid": "39#0",
"title": "Albedo",
"text": "Albedo (meaning 'whiteness') is the measure of the diffuse reflection of solar radiation out of the total solar radiation received by an astronomical body (e.g. a planet like Earth). It is dimensionless and measured on a scale from 0 (corresponding to a black body that absorbs all incident radiation) to 1 (corresponding to a body that reflects all incident radiation)."
}
The docid
has the schema X#Y
, where all passages with the same X
come from the same Wikipedia article, whereas Y
denotes the passage within that article, numbered sequentially.
The text
field contains the text of the passage.
The title
field contains the name of the article the passage comes from.
Topics (= queries) and relevance judgments (= relevance labels) of the MIRACL training sets and development sets for each of the 16 known languages are available on HuggingFace Dataset!
π€ = direct link to HuggingFace Dataset
Train | Dev | ||||
---|---|---|---|---|---|
Language | #Q | #J | #Q | #J | Links |
Arabic (ar) | 3,495 | 25,382 | 2,896 | 29,197 | π€ |
Bengali (bn) | 1,631 | 16,754 | 411 | 4,206 | π€ |
English (en) | 2,863 | 29,416 | 799 | 8,350 | π€ |
Spanish (es) | 2,162 | 21,531 | 648 | 6,443 | π€ |
Persian (fa) | 2,107 | 21,844 | 632 | 6,571 | π€ |
Finnish (fi) | 2,897 | 20,350 | 1,271 | 12,008 | π€ |
French (fr) | 1,143 | 11,426 | 343 | 3,429 | π€ |
Hindi (hi) | 1,169 | 11,668 | 350 | 3,494 | π€ |
Indonesian (id) | 4,071 | 41,358 | 960 | 9,668 | π€ |
Japanese (ja) | 3,477 | 34,387 | 860 | 8,354 | π€ |
Korean (ko) | 868 | 12,767 | 213 | 3,057 | π€ |
Russian (ru) | 4,683 | 33,921 | 1,252 | 13,100 | π€ |
Swahili (sw) | 1,901 | 9,359 | 482 | 5,092 | π€ |
Telugu (te) | 3,452 | 18,608 | 828 | 1,606 | π€ |
Thai (th) | 2,972 | 21,293 | 733 | 7,573 | π€ |
Chinese (zh) | 1,312 | 13,113 | 393 | 3,928 | π€ |
Total | 40,203 | 343,177 | 13,071 | 126,076 |
The above table shows the number of queries (#Q
) and the number of judgments (#J
) in each (language, split) combination, where the judgments include both positive and negative labels.
The topics are formatted in TSV, with each line organized as follows:
qid\tquery
The relevance judgments are formatted in standard TREC qrels format, as follows:
qid Q0 docid relevance
We have released baselines using BM25, mDPR, and hybrid of the two, as described in our arXiv paper. Reuslts of BM25 and mDPR could be reproduced using Pyserini.
To reproduce our baselines:
- Install the development version of Pyserini following these instructions. (To run baselines on surprise languages, you'll need to re-build both Anserini and Pyserini)
- Manually place all topics and qrels files under
tools/topics-and-qrels
. The topics and qrels files can be found undermiracl-v1.0-${lang}/topics
andmiracl-v1.0-${lang}/qrels
in the HuggingFace dataset.git clone https://huggingface.co/datasets/miracl/miracl mv miracl/*/*/* $PYSERINI_PATH/tools/topics-and-qrels/
- Following the commands in our 2-click-reproduction (2CR) website.
Note that the 2CR above is only for reproducing the search stage, where the indexes are pre-computed and loaded automatically by Pyserini. If you are interested in reproducing the indexing stage, please refer to this documentation:
- mDPR (w/o fine-tuning on MIRACL):
castorini/mdpr-tied-pft-msmarco
- mContriever (w/o fine-tuning on MIRACL):
facebook/mcontriever-msmarco
- mDPR (fine-tuned on MIRACL):
castorini/mdpr-tied-pft-msmarco-ft-miracl-{lang}
, where{lang}
is the two-letter ISO code (e.g.,ar
,bn
, ...)
If you find this dataset and repository helpful, please cite MIRACL as follows:
@article{10.1162/tacl_a_00595,
author = {Zhang, Xinyu and Thakur, Nandan and Ogundepo, Odunayo and Kamalloo, Ehsan and Alfonso-Hermelo, David and Li, Xiaoguang and Liu, Qun and Rezagholizadeh, Mehdi and Lin, Jimmy},
title = "{MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages}",
journal = {Transactions of the Association for Computational Linguistics},
volume = {11},
pages = {1114-1131},
year = {2023},
month = {09},
issn = {2307-387X},
doi = {10.1162/tacl_a_00595},
url = {https://doi.org/10.1162/tacl\_a\_00595},
eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00595/2157340/tacl\_a\_00595.pdf},
}
If you have any questions, feel free to email us (project.miracl [at] gmail.com) or start a Github issue under this repository.