Automatic Construction of Text Retrieval Evaluation Set

Given a collection of unpaired queries and documents, this repository offers a simple and cost-efficient method for constructing an evaluation dataset for retrieval quality. It automatically labels the gold document for each query using GPT-4. Instead of employing GPT-4 to review the entire corpus for each query—which incurs costs linear to the size of the corpus and is often prohibitive—we utilize a variety of embedding models to pre-filter the corpus. This creates a smaller set of candidate documents, within which GPT-4 identifies the gold documents.

For a corpus of size $n$, the original approach would require $n$ GPT calls to label a single query. Our method reduces this number to approximately 100 calls. In typical scenarios, the cost of labeling each query is thus reduced to less than $1.

Installation

We recommend using Conda for the installation.

conda create -n auto_retrieval_eval_env python=3.10
conda activate auto_retrieval_eval_env
pip install -r requirements.txt

Prepare data

Store the queries and documents in the data/{task_name} directory:

queries.jsonl should include the query id and query text.

{"id": "00000", "text": "This is the text of the first query."}
{"id": "00001", "text": "This is the text of the second query."}

corpus.jsonl should contain the document id and document text.

{"id":"000000000", "text": "This is the text of the first document."}
{"id":"000000001", "text": "This is the text of the second document."}

API Key Configuration

We need to invoke APIs for embedding models and generative language models. To configure the API keys using environment variables, please store them in .env file located in the root directory of your project.

Prefilter query-document pairs

For each query, use a set of embedding models to create a set of pre-filtered candidate documents. Generated candidates are saved under the folder ./data/task-name/meta_data.

python prefilter_pairs.py --task-name example_task --embedding-models voyage-large-2,text-embedding-3-large --topk 20

Label prefiltered pairs using GPT4

For each prefiltered pair, use GPT-4 to determine if they constitute a relevant match. The document and query are assessed as a pair based on criteria that are divided into four levels: reject (label 1), borderline reject (label 2), borderline accept (label 3), and accept (label 4). We will elect valid pairs (those labeled as [3,4]) and build query-document datasets for text retrieval evaluation. The final pairs are saved in ./data/task-name/relevance.json. This file contains one dictionary, where relevance is indexed as relevance[query_id][document_id].

python label_pairs.py --task-name example_task --topk 20 --generative-model gpt-4o

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
embedding		embedding
generation		generation
.env		.env
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
config.py		config.py
label_pairs.py		label_pairs.py
prefilter_pairs.py		prefilter_pairs.py
requirements.txt		requirements.txt
setup_api.py		setup_api.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatic Construction of Text Retrieval Evaluation Set

Installation

Prepare data

API Key Configuration

Prefilter query-document pairs

Label prefiltered pairs using GPT4

About

Releases

Packages

Contributors 2

Languages

License

voyage-ai/auto_retrieval_eval

Folders and files

Latest commit

History

Repository files navigation

Automatic Construction of Text Retrieval Evaluation Set

Installation

Prepare data

API Key Configuration

Prefilter query-document pairs

Label prefiltered pairs using GPT4

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages