Skip to content

Latest commit

 

History

History
241 lines (193 loc) · 15.1 KB

README.md

File metadata and controls

241 lines (193 loc) · 15.1 KB

JPQ

The official repo for our CIKM'21 Full paper, Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance (poster, presentation record).

**************************** Updates ****************************

Quick Links

Quick Tour

JPQ greatly improves the efficiency of Dense Retrieval. It is able to compress the index size by 30x with negligible performance loss. It also provides 10x speedup on CPU and 2x speedup on GPU in query latency.

Here is the effectiveness - index size (log-scale) tradeoff on MSMARCO Passage Ranking. In contrast with trading index size for ranking performance, JPQ achieves high ranking effectiveness with a tiny index.

Results at different trade-off settings are shown below.

MS MARCO Passage Ranking MS MARCO Document Ranking

JPQ is still very effective even if the compression ratio is over 100x and outperforms baselines at different compression ratio settings. For more details, please refer to our paper.

Models and Indexes

You can download trained models and indexes from our dropbox link. After open this link in your browser, you can see two folders, doc and passage. They correspond to MSMARCO passage ranking and document ranking. There are also two folders in either of them, trained_models and indexes. trained_models are the trained query encoders, and indexes are trained PQ indexes. Note, the pid in the index is actually the row number of a passage in the collection.tsv file instead of the official pid provided by MS MARCO. Different query encoders and indexes correspond to different compression ratios. For example, the query encoder named m32.tar.gz or the index named OPQ32,IVF1,PQ32x8.index means 32 bytes per doc, i.e., 768*4/32=96x compression ratio.

You can easily download the files using download_query_encoder.sh and download_index.sh. Just run:

sh ./cmds/download_query_encoder.sh
sh ./cmds/download_index.sh

Ranking Results

We open-source the ranking results in our dropbox links: passage rank link, document rank link. msmarco-dev folder and trec19 folder correspond to MS MARCO development queries and TREC 2019 DL queries, respectively. In either folder, for each m value, we provide two ranking files corresponding to different text-id mapping. The one prefixed with 'official' means that it uses the official MS MARCO / TREC 2019 text-id mapping so you can directly use the official qrel files to evaulate the ranking. The other one uses the mapping generated by our preprocessing where we use line offset as id. Both files will give you the same metric number. The files are generated by run_retrieve.sh. Please see retrieval section to know about how to get those ranking results.

UPDATE 2021/11/9 We additionally released the ranking results for queries from TREC 2020 Deep Learning Track. They are available via the same dropbox link provided above. When M is set to 96, i.e., 32x compression ratio, JPQ achieves 0.580 and 0.671 in NDCG@10 for document and passage ranking, respectively.

Requirements

This repo needs the following libraries (Python 3.x):

torch >= 1.9.0
transformers >= 4.3.3
faiss-gpu == 1.7.1
tensorboard >= 2.5.0
boto3

Preprocess

Here are the commands for preprocessing/tokenization.

If you do not have MS MARCO dataset, run the following command:

sh ./cmds/download_marco.sh

Preprocessing (tokenizing) only requires a simple command:

python -m jpq.preprocess --data_type 0; python -m jpq.preprocess --data_type 1

It will create two directories, i.e., ./data/passage/preprocess and ./data/doc/preprocess. We map the original qid/pid to new ids, the row numbers in the file. The mapping is saved to pid2offset.pickle and qid2offset.pickle, and new qrel files (train/dev/test-qrel.tsv) are generated. The passages and queries are tokenized and saved in the numpy memmap file.

Note: JPQ, as long as our SIGIR'21 models, utilizes Transformers 2.x version to tokenize text. However, when Transformers library updates to 3.x or 4.x versions, the RobertaTokenizer behaves differently. To support REPRODUCIBILITY, we copy the RobertaTokenizer source codes from 2.x version to star_tokenizer.py. During preprocessing, we use from star_tokenizer import RobertaTokenizer instead of from transformers import RobertaTokenizer. It is also necessary for you to do this if you use our JPQ model on other datasets.

Evaluate Open-sourced Checkpoints

TREC 2019 Retrieval

Our paper utilizes datasets from TREC 2019 Deep Learning track. This section shows how to reproduce the reported results using our open-sourced models and indexes. Since we use TREC_EVAL toolkit for evaluation, please download it and compile:

sh ./cmds/download_trec_eval.sh

We show how to retrieve candidates and evaluate results in run_retrieve.sh. Just run the command

sh cmds/run_retrieve.sh

Then you are expected to get the results reported in our paper.

In run_retrieve.sh, it calls run_retrieval.py. Arguments for this evaluation script are as follows,

  • --preprocess_dir: preprocess dir
    • ./data/passage/preprocess: default dir for passage preprocessing.
    • ./data/doc/preprocess: default dir for document preprocessing.
  • --mode: Evaluation mode
    • dev run retrieval for msmarco development queries.
    • test: run retrieval for TREC 2019 DL Track queries.
  • --index_path: Index path.
  • --query_encoder_dir: Query encoder dir, which involves config.json and pytorch_model.bin.
  • --output_path: Output ranking file path, formatted following msmarco guideline (qid\tpid\trank) for dev set or TREC guideline for test set.
  • --max_query_length: Max query length, default: 32.
  • --batch_size: Encoding and retrieval batch size at each iteration.
  • --topk: Retrieve topk passages/documents.
  • --gpu_search: Whether to use gpu for embedding search.

TREC 2020 Retrieval

Here we provide instructions on how to retrieve candidates for TREC 2020 queries, which is not included in the paper. We use this retrieval script, which supports on-the-fly query tokenization.

Please download TREC 2020 queries:

sh ./cmds/download_trec20.sh

Run this shell script for retrieval and evaluation:

sh ./cmds/run_tokenize_retrieve.sh

It calls tokenize_retrieve. Arguments for this evaluation script are as follows,

  • --query_file_path: Query file with TREC format.
  • --index_path: Index path.
  • --query_encoder_dir: Query encoder dir, which involves config.json and pytorch_model.bin.
  • --output_path: Output ranking file path.
  • --pid2offset_path: It is used only for converting offset pids to official pids.
  • --dataset: "doc" or "passage". It is used to convert offset pids to official pids because msmarco doc adds a 'D' as docid prefix.
  • --max_query_length: Max query length, default: 32.
  • --batch_size: Encoding and retrieval batch size at each iteration.
  • --topk: Retrieve topk passages/documents.
  • --gpu_search: Whether to use gpu for embedding search.

Zero-shot Retrieval

This section shows how to use JPQ for other datasets in a zero-shot fashion. Please download the JPQ dual-encoders by running

sh ./cmds/download_jpq_encoders.sh

In fact, these query encoders are equivalent to the data in Models and Indexes, and the document encoder is equivalent to STAR model. The difference is that they are objects of class JPQTower, which involves the encoding parameters and PQ index parameters. Thanks to it, we can easily adapt JPQ to other datasets in a zero-shot fashion.

Note, the downloaded dual-encoders are trained on MS MARCO passage ranking task. We do not use ones trained on document ranking task because they are trained with URL, which is often not available in other datasets.

We use BEIR as an example because it involves a wide range of datasets. For your own dataset, you only need to format it in the same way as BEIR and you are good to go. Now, we show how to use JPQ for TREC-Covid dataset. Run

sh ./cmds/run_eval_beir.sh trec-covid

You can also replace trec-covid with other datasets, such as nq. The script calls eval_beir.py. Arguments are as follows,

  • --dataset: Dataset name in BEIR .
  • --beir_data_root: Where to save BEIR dataset.
  • --query_encoder: Path to JPQ query encoder.
  • --doc_encoder: Path to JPQ document encoder.
  • --split: test/dev/train.
  • --encode_batch_size: Batch size, default: 64.
  • --output_index_path: Optional, where to save the compact index. If the pointed file exists, it will be loaded to save the corpus-encoding time.
  • --output_ranking_path: Optional, where to save the retrieval results.

Here are the NDCG@10 on several datasets when M=96, i.e., 32x compression ratio:

Dataset TREC-COVID NFCorpus NQ HotpotQA FiQA-2018 ArguAna Touche-2020 Quora DBPedia SCIDOCS FEVER Climate-FEVER SciFact
ANCE (Uncompressed) 0.654 0.237 0.446 0.456 0.295 0.415 0.284 n.a. 0.281 0.122 0.669 0.198 0.507
JPQ (32x Compression) 0.636 0.272 0.449 0.450 0.286 0.429 0.200 0.853 0.304 0.120 0.636 0.194 0.531

Even though JPQ compresses the index by 32x, it achieves ranking performance on par with or even better than ANCE, a competitive uncompressed Dense Retrieval model.

Training

JPQ is initialized by STAR. STAR trained on passage ranking is available here. STAR trained on document ranking is available here.

First, use STAR to encode the corpus and run OPQ to initialize the index. For example, on document ranking task, please run:

python -m jpq.run_init \
  --preprocess_dir ./data/doc/preprocess/ \
  --model_dir ./data/doc/star \
  --max_doc_length 512 \
  --output_dir ./data/doc/init \
  --subvector_num 96

On passage ranking task, you can set the max_doc_length to 256 for faster inference.

Now you can train the query encoder and PQ index. For example, on document ranking task, the command is

python -m jpq.run_train \
    --preprocess_dir ./data/doc/preprocess \
    --model_save_dir ./data/doc/train/m96/models \
    --log_dir ./data/doc/train/m96/log \
    --init_index_path ./data/doc/init/OPQ96,IVF1,PQ96x8.index \
    --init_model_path ./data/doc/star \
    --lambda_cut 10 \
    --centroid_lr 1e-4 \
    --train_batch_size 32

--gpu_search is optional for fast gpu search during training. lambda_cut should be set to 200 for passage ranking task. centroid_lr is different for different compression ratios. Let M be the number of subvectors. centroid_lr equals to 5e-6 for M = 16/24, 2e−5 for M = 32, and 1e−4 for M = 48/64/96. The number of training epochs is set to 6. In fact, the performance is already quite satisfying after 1 or 2 epochs. Each epoch costs less than 2 hours on our machine.

Citation

If you find this repo useful, please consider citing our work:

@inproceedings{zhan2021jointly,
author = {Zhan, Jingtao and Mao, Jiaxin and Liu, Yiqun and Guo, Jiafeng and Zhang, Min and Ma, Shaoping},
title = {Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance},
year = {2021},
isbn = {9781450384469},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3459637.3482358},
doi = {10.1145/3459637.3482358},
pages = {2487–2496},
numpages = {10},
location = {Virtual Event, Queensland, Australia},
series = {CIKM '21}
}

Related Work