Skip to content

Latest commit

 

History

History
253 lines (199 loc) · 20.1 KB

experiments-msmarco-doc.md

File metadata and controls

253 lines (199 loc) · 20.1 KB

Anserini: BM25 Baselines for MS MARCO Document Ranking

This page contains instructions for running BM25 baselines on the MS MARCO document ranking task. Note that there is a separate MS MARCO passage ranking task.

Setup Note: If you're instantiating an Ubuntu VM on your system or on cloud (AWS and GCP), try to provision enough resources as the tasks such as building the index could take some time to finish such as RAM > 8GB and storage > 100 GB (SSD). This will prevent going back and fixing machine configuration again and again.

If you're a Waterloo undergraduate going through this guide as the screening exercise of joining my research group, make sure you do the passage ranking exercise first. Similarly, try to understand what you're actually doing, instead of simply cargo culting (i.e., blinding copying and pasting commands into a shell).

Data Prep

We're going to use the repository's root directory as the working directory. First, we need to download and extract the MS MARCO document dataset:

mkdir collections/msmarco-doc

wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docs.trec.gz -P collections/msmarco-doc

# Alternative mirror:
# wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-docs.trec.gz -P collections/msmarco-doc

To confirm, msmarco-docs.trec.gz should have MD5 checksum of d4863e4f342982b51b9a8fc668b2d0c0.

Indexing

There's no need to uncompress the file, as Anserini can directly index gzipped files. Build the index with the following command:

sh target/appassembler/bin/IndexCollection -threads 1 -collection CleanTrecCollection \
 -generator DefaultLuceneDocumentGenerator -input collections/msmarco-doc \
 -index indexes/msmarco-doc/lucene-index-msmarco -storePositions -storeDocvectors -storeRaw

On a modern desktop with an SSD, indexing takes around 40 minutes. There should be a total of 3,213,835 documents indexed.

Retrieval

After indexing finishes, we can do a retrieval run. The dev queries are already stored in our repo:

target/appassembler/bin/SearchCollection -hits 1000 -parallelism 4 \
 -index indexes/msmarco-doc/lucene-index-msmarco \
 -topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
 -output runs/run.msmarco-doc.dev.bm25.txt -bm25

Retrieval speed will vary by machine: On a reasonably modern desktop with an SSD, with four threads (as specified above), the run takes less than five minutes. Adjust the parallelism by changing the -parallelism argument.

After the run completes, we can evaluate with trec_eval:

$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -mrecall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc.dev.bm25.txt
map                   	all	0.2310
recall_1000           	all	0.8856

Let's compare to the baselines provided by Microsoft. First, download:

wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docdev-top100.gz -P runs
gunzip runs/msmarco-docdev-top100.gz

Then, run trec_eval to compare. Note that to be fair, we restrict evaluation to top 100 hits per topic (which is what Microsoft provides):

$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -M 100 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/msmarco-docdev-top100
map                   	all	0.2219

$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -M 100 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc.dev.bm25.txt
map                   	all	0.2303

We see that "out of the box" Anserini is already better!

This dataset is part of the MS MARCO Document Ranking Leaderboard. Let's try to reproduce runs on there!

A few minor details to pay attention to: the official metric is MRR@100, so we want to only return the top 100 hits, and the submission files to the leaderboard have a slightly different format.

target/appassembler/bin/SearchCollection -hits 100 -parallelism 4 \
 -index indexes/msmarco-doc/lucene-index-msmarco \
 -topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
 -output runs/run.msmarco-doc.leaderboard-dev.bm25base.txt -format msmarco \
 -bm25 -bm25.k1 0.9 -bm25.b 0.4

The command above uses the default BM25 parameters (k1=0.9, b=0.4), and note we set -hits 100. Command for evaluation:

$ python tools/scripts/msmarco/msmarco_doc_eval.py --judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt --run runs/run.msmarco-doc.leaderboard-dev.bm25base.txt 
#####################
MRR @100: 0.23005723505603573
QueriesRanked: 5193
#####################

The above run corresponds to "Anserini's BM25, default parameters (k1=0.9, b=0.4)" on the leaderboard.

Here's the invocation for BM25 with parameters optimized for recall@100 (k1=4.46, b=0.82):

target/appassembler/bin/SearchCollection -hits 100 -parallelism 4 \
 -index indexes/msmarco-doc/lucene-index-msmarco \
 -topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
 -output runs/run.msmarco-doc.leaderboard-dev.bm25tuned.txt -format msmarco \
 -bm25 -bm25.k1 4.46 -bm25.b 0.82

Command for evaluation:

$ python tools/scripts/msmarco/msmarco_doc_eval.py --judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt --run runs/run.msmarco-doc.leaderboard-dev.bm25tuned.txt 
#####################
MRR @100: 0.2770296928568702
QueriesRanked: 5193
#####################

More details on tuning BM25 parameters below...

BM25 Tuning

It is well known that BM25 parameter tuning is important. The setting of k1=0.9, b=0.4 is often used as a default.

Let's try to do better! We tuned BM25 using the queries found here: these are five different sets of 10k samples from the training queries (using the shuf command). The basic approach is grid search of parameter values in tenth increments. We tuned on each individual set and then averaged parameter values across all five sets (this has the effect of regularization). In separate trials, we optimized for:

  • recall@1000, since Anserini output serves as input to downstream rerankers (e.g., based on BERT), and we want to maximize the number of relevant documents the rerankers have to work with;
  • MRR@10, for the case where Anserini output is directly presented to users (i.e., no downstream reranking).

It turns out that optimizing for MRR@10 and MAP yields the same settings.

Here's the comparison between different parameter settings:

Setting MRR@100 MAP Recall@1000
Default (k1=0.9, b=0.4) 0.2301 0.2310 0.8856
Optimized for MRR@100/MAP (k1=3.8, b=0.87) 0.2784 0.2789 0.9326
Optimized for recall@100 (k1=4.46, b=0.82) 0.2770 0.2775 0.9357

As expected, BM25 tuning makes a big difference!

Note that MRR@100 is computed with the leaderboard eval script (with 100 hits per query), while the other two metrics are computed with trec_eval (with 1000 hits per query). So, we need to use different search programs, for example:

$ target/appassembler/bin/SearchCollection -hits 1000 -parallelism 4 \
   -index indexes/msmarco-doc/lucene-index-msmarco \
   -topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
   -output runs/run.msmarco-doc.dev.opt-mrr.txt \
   -bm25 -bm25.k1 3.8 -bm25.b 0.87

$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -mrecall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc.dev.opt-mrr.txt
map                   	all	0.2789
recall_1000           	all	0.9326

$ target/appassembler/bin/SearchCollection -hits 100 -parallelism 4 \
   -index indexes/msmarco-doc/lucene-index-msmarco \
   -topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
   -output runs/run.msmarco-doc.leaderboard-dev.opt-mrr.txt -format msmarco \
   -bm25 -bm25.k1 3.8 -bm25.b 0.87

$ python tools/scripts/msmarco/msmarco_doc_eval.py --judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt --run runs/run.msmarco-doc.leaderboard-dev.opt-mrr.txt
#####################
MRR @100: 0.27836767424339787
QueriesRanked: 5193
#####################

That's it!

Reproduction Log*