Fixed MS MARCO docs to latest version of pyserini==0.9.3.0 (#1238)

castorini · May 28, 2020 · 2b8453c · 2b8453c
1 parent f3bf7d2
commit 2b8453c
Show file tree

Hide file tree

Showing 2 changed files with 41 additions and 47 deletions.
diff --git a/docs/experiments-msmarco-passage.md b/docs/experiments-msmarco-passage.md
@@ -9,7 +9,7 @@ We also have a [separate page](experiments-doc2query.md) describing document exp
 We're going to use `msmarco-passage/` as the working directory.
 First, we need to download and extract the MS MARCO passage dataset:
 
-```
+```bash
 mkdir collections/msmarco-passage
 mkdir indexes/msmarco-passage
 
@@ -21,17 +21,17 @@ To confirm, `collectionandqueries.tar.gz` should have MD5 checksum of `31644046b
 
 Next, we need to convert the MS MARCO tsv collection into Anserini's jsonl files (which have one json object per line):
 
-```
-python ./src/main/python/msmarco/convert_collection_to_jsonl.py \
+```bash
+python src/main/python/msmarco/convert_collection_to_jsonl.py \
  --collection_path collections/msmarco-passage/collection.tsv --output_folder collections/msmarco-passage/collection_jsonl
 ```
 
 The above script should generate 9 jsonl files in `collections/msmarco-passage/collection_jsonl`, each with 1M lines (except for the last one, which should have 841,823 lines).
 
 We can now index these docs as a `JsonCollection` using Anserini:
 
-```
-sh ./target/appassembler/bin/IndexCollection -collection JsonCollection \
+```bash
+sh target/appassembler/bin/IndexCollection -collection JsonCollection \
  -generator DefaultLuceneDocumentGenerator -threads 9 -input collections/msmarco-passage/collection_jsonl \
  -index indexes/msmarco-passage/lucene-index-msmarco -storePositions -storeDocvectors -storeRaw 
 ```
@@ -43,17 +43,17 @@ The indexing speed may vary... on a modern desktop with an SSD, indexing takes l
 
 Since queries of the set are too many (+100k), it would take a long time to retrieve all of them. To speed this up, we use only the queries that are in the qrels file: 
 
-```
-python ./src/main/python/msmarco/filter_queries.py --qrels collections/msmarco-passage/qrels.dev.small.tsv \
- --queries msmarco-passage/queries.dev.tsv --output_queries collections/msmarco-passage/queries.dev.small.tsv
+```bash
+python src/main/python/msmarco/filter_queries.py --qrels collections/msmarco-passage/qrels.dev.small.tsv \
+ --queries collections/msmarco-passage/queries.dev.tsv --output_queries collections/msmarco-passage/queries.dev.small.tsv
 ```
 
 The output queries file should contain 6980 lines.
 
 We can now retrieve this smaller set of queries:
 
-```
-python ./src/main/python/msmarco/retrieve.py --hits 1000 --threads 1 \
+```bash
+python src/main/python/msmarco/retrieve.py --hits 1000 --threads 1 \
  --index indexes/msmarco-passage/lucene-index-msmarco --qid_queries collections/msmarco-passage/queries.dev.small.tsv \
  --output runs/run.msmarco-passage.dev.small.tsv
 ```
@@ -67,8 +67,8 @@ On a modern desktop with an SSD, we can get ~0.06 s/query (taking about seven mi
 
 Alternatively, we can run the same script implemented in Java, which is a bit faster:
 
-```
-./target/appassembler/bin/SearchMsmarco  -hits 1000 -threads 1 \
+```bash
+sh target/appassembler/bin/SearchMsmarco  -hits 1000 -threads 1 \
  -index indexes/msmarco-passage/lucene-index-msmarco -qid_queries collections/msmarco-passage/queries.dev.small.tsv \
  -output runs/run.msmarco-passage.dev.small.tsv
 ```
@@ -77,8 +77,8 @@ Similarly, we can perform multithreaded retrieval by changing the `-threads` arg
 
 Finally, we can evaluate the retrieved documents using this the official MS MARCO evaluation script: 
 
-```
-python ./src/main/python/msmarco/msmarco_eval.py \
+```bash
+python src/main/python/msmarco/msmarco_eval.py \
  collections/msmarco-passage/qrels.dev.small.tsv runs/run.msmarco-passage.dev.small.tsv
 ```
 
@@ -94,18 +94,18 @@ QueriesRanked: 6980
 We can also use the official TREC evaluation tool, `trec_eval`, to compute other metrics than MRR@10. 
 For that we first need to convert runs and qrels files to the TREC format:
 
-```
-python ./src/main/python/msmarco/convert_msmarco_to_trec_run.py \
+```bash
+python src/main/python/msmarco/convert_msmarco_to_trec_run.py \
  --input_run runs/run.msmarco-passage.dev.small.tsv --output_run runs/run.msmarco-passage.dev.small.trec
 
-python ./src/main/python/msmarco/convert_msmarco_to_trec_qrels.py \
+python src/main/python/msmarco/convert_msmarco_to_trec_qrels.py \
  --input_qrels collections/msmarco-passage/qrels.dev.small.tsv --output_qrels collections/msmarco-passage/qrels.dev.small.trec
 ```
 
 And run the `trec_eval` tool:
 
-```
-./eval/trec_eval.9.0.4/trec_eval -c -mrecall.1000 -mmap \
+```bash
+eval/trec_eval.9.0.4/trec_eval -c -mrecall.1000 -mmap \
  collections/msmarco-passage/qrels.dev.small.trec runs/run.msmarco-passage.dev.small.trec
 ```
 
@@ -145,8 +145,6 @@ Setting                     | MRR@10 | MAP    | Recall@1000 |
 Default (`k1=0.9`, `b=0.4`) | 0.1839 | 0.1925 | 0.8526
 Tuned (`k1=0.82`, `b=0.72`) | 0.1875 | 0.1956 | 0.8578
 
-
-
 ## Replication Log
 
 + Results replicated by [@ronakice](https://github.com/ronakice) on 2019-08-12 (commit [`5b29d16`](https://github.com/castorini/anserini/commit/5b29d1654abc5e8a014c2230da990ab2f91fb340))

diff --git a/src/main/python/msmarco/retrieve.py b/src/main/python/msmarco/retrieve.py
@@ -1,36 +1,32 @@
-# -*- coding: utf-8 -*-
-'''
-Anserini: A Lucene toolkit for replicable information retrieval research
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-'''
+#
+# Pyserini: Python interface to the Anserini IR toolkit built on Lucene
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 
 import argparse
 import time
-
-# Pyserini setup
 import os, sys
-sys.path += ['src/main/python']
+
 from pyserini.search import pysearch
 
 if __name__ == '__main__':
     parser = argparse.ArgumentParser(description='Retrieve MS MARCO Passages.')
     parser.add_argument('--qid_queries', required=True, default='', help='query id - query mapping file')
     parser.add_argument('--output', required=True, default='', help='output filee')
     parser.add_argument('--index', required=True, default='', help='index path')
-    parser.add_argument('--hits', default=10, help='number of hits to retrieve')
-    parser.add_argument('--k1', default=0.82, help='BM25 k1 parameter')
-    parser.add_argument('--b', default=0.68, help='BM25 b parameter')
+    parser.add_argument('--hits', default=10, type=int, help='number of hits to retrieve')
+    parser.add_argument('--k1', default=0.82, type=float, help='BM25 k1 parameter')
+    parser.add_argument('--b', default=0.68, type=float, help='BM25 b parameter')
     # See our MS MARCO documentation to understand how these parameter values were tuned.
     parser.add_argument('--rm3', action='store_true', default=False, help='use RM3')
     parser.add_argument('--fbTerms', default=10, type=int, help='RM3 parameter: number of expansion terms')
@@ -43,10 +39,10 @@
     total_start_time = time.time()
 
     searcher = pysearch.SimpleSearcher(args.index)
-    searcher.set_bm25_similarity(float(args.k1), float(args.b))
+    searcher.set_bm25(args.k1, args.b)
     print('Initializing BM25, setting k1={} and b={}'.format(args.k1, args.b), flush=True)
     if args.rm3:
-        searcher.set_rm3_reranker(args.fbTerms, args.fbDocs, args.originalQueryWeight)
+        searcher.set_rm3(args.fbTerms, args.fbDocs, args.originalQueryWeight)
         print('Initializing RM3, setting fbTerms={}, fbDocs={} and originalQueryWeight={}'.format(args.fbTerms, args.fbDocs, args.originalQueryWeight), flush=True)
 
     if args.threads == 1:
@@ -55,7 +51,7 @@
             start_time = time.time()
             for line_number, line in enumerate(open(args.qid_queries, 'r', encoding='utf8')):
                 qid, query = line.strip().split('\t')
-                hits = searcher.search(query.encode('utf8'), int(args.hits))
+                hits = searcher.search(query, args.hits)
                 if line_number % 100 == 0:
                     time_per_query = (time.time() - start_time) / (line_number + 1)
                     print('Retrieving query {} ({:0.3f} s/query)'.format(line_number, time_per_query), flush=True)
@@ -72,7 +68,7 @@
             qids.append(qid)
             queries.append(query)
 
-        results = searcher.batch_search(queries, qids, args.hits, -1, args.threads)
+        results = searcher.batch_search(queries, qids, args.hits, args.threads)
 
         with open(args.output, 'w') as fout:
             for qid in qids: