GitHub

GroupDRO Dense Retrieval

Repository for paper: Distributionally Robust Unsupervised Dense Retrieval Training on Web Graphs

The Code is Based on OpenMatch.

Installment

git clone [email protected]:Hanpx20/GroupDRO_Dense_Retrieval.git
python setup.py install
cd src/openMatch
pip install -e .

Besides the requirements listed above, you also need to install a modified version of Transformers to adapt our model.

Embedding Model Training

prerequisites:

corpus.tsv(format: d_id title content url)
queries.train.tsv(format: q_id query)
qrels.train.tsv(format: q_id _ d_id 1)
a tokznier of our model

According to the paper, training data should be links extracted from the Internet. Run: experiments/embedding_model.sh

After this, you should have an embedding model.

Data Clustering

prerequisites:

corpus.tsv(format: d_id title content)
queries.train.tsv(format: q_id query)
qrels.train.tsv(format: q_id _ d_id 1)

According to the paper, training data should be anchor-document pairs. Run: experiments/cluster.sh

After this, you should have a qrel file with cluster ids.

Retrieval Model Training

prerequisites:

corpus.tsv(format: d_id title content)
queries.train.tsv(format: q_id query)
[cluster_name]/qrels.train.tsv(format: q_id _ d_id 1 cluster_id)
[cluster_name]/counter.pt(a file to record the size of each cluster)

Run: experiments/DRO_dense_retrieval.sh After this, you should have a retrieval model trained with GroupDRO.

Evaluation

prerequisites: A model to be evaluated

Run: experiments/eval_marco.sh for MsMarco Run: experiments/eval_beir.sh [model_name] for BEIR You need to set certain variables inside these files MANUALLY.

Reminders

It's recommended to run instructions one by one.

To better organize files, we recommend you to use the following notions:

export BASE_DIR=...
export COLLECTION_DIR=$BASE_DIR/data
export PROCESSED_DIR=$BASE_DIR/processed_data
export PLM_DIR=$BASE_DIR/models
export CHECKPOINT_DIR=$BASE_DIR/ckpts
export LOG_DIR=$BASE_DIR/log
export EMBEDDING_DIR=$BASE_DIR/embedding
export RESULT_DIR=$BASE_DIR/res

You can refer to OpenMatch Documentation for more information.

openmatch.driver.build_index and openmatch.driver.retrieve can also be accelerated by distribution.

Models

You can download our model through Huggingface Transformers.

Web-DRO-DR(the final model)

The Embedding Model

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.idea		.idea
experiments		experiments
filtering		filtering
pytrec_eval-0.5		pytrec_eval-0.5
scripts		scripts
src/openmatch		src/openmatch
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GroupDRO Dense Retrieval

Repository for paper: Distributionally Robust Unsupervised Dense Retrieval Training on Web Graphs

The Code is Based on OpenMatch.

Installment

Embedding Model Training

Data Clustering

Retrieval Model Training

Evaluation

Reminders

It's recommended to run instructions one by one.

Models

About

Releases

Packages

Languages

License

Hanpx20/GroupDRO_Dense_Retrieval

Folders and files

Latest commit

History

Repository files navigation

GroupDRO Dense Retrieval

Repository for paper: Distributionally Robust Unsupervised Dense Retrieval Training on Web Graphs

The Code is Based on OpenMatch.

Installment

Embedding Model Training

Data Clustering

Retrieval Model Training

Evaluation

Reminders

It's recommended to run instructions one by one.

Models

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages