You can't pick your neighbors, or can you? When and how to rely on retrieval in the kNN-LM

This repository is an optimized version of urvashik/knnlm and includes script to reproduce experiments from our EMNLP 2022 Findings paper.

@inproceedings{drozdov2022knnlm,
    title = "You can't pick your neighbors, or can you? {W}hen and how to rely on retrieval in the {kNN-LM}",
    author = "Andrew Drozdov and Shufan Wang and Razieh Rahimi and Andrew McCallum and Hamed Zamani and Mohit Iyyer",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    year = "2022"
}

This code is based on the original kNN-LM repo: https://github.com/urvashik/knnlm NOTE: Please review the documentation from the original repo before proceeding.

@inproceedings{khandelwal20generalization,
  title={{Generalization through Memorization: Nearest Neighbor Language Models}},
  author={Khandelwal, Urvashi and Levy, Omer and Jurafsky, Dan and Zettlemoyer, Luke and Lewis, Mike},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2020}
}

Contact Andrew Drozdov ([email protected]) with any questions.

Install Dependencies

conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch -c conda-forge -y
pip install --editable .
pip install faiss-cpu

Fast Evaluation

First run these steps from the original kNN-LM repo:

Prepare your data.
Train your model (our download a checkpoint).
Save the keys and values to a datastore, but use our code instead. We cache some additional properties (i.e. the next-token probabilities).
Build the faiss index.

Then cache the neighbors and vector distances. And finally evaluate the model.

# We use the wiki_valid preset for convenience, but please double check the filepaths and replace with your own.

python rq/fast_evaluate.py --preset wiki_valid --save_knns # Save the neighbors.
python rq/fast_evaluate.py --preset wiki_valid --save_exact # Save the exact vector distances.
python rq/fast_evaluate.py --preset wiki_valid --exact # Compute perplexity using exact vector distance.

# Note: The first two steps can be time consuming, but the last step should run very fast.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github		.github
docs		docs
examples		examples
fairseq		fairseq
fairseq_cli		fairseq_cli
rq		rq
scripts		scripts
tests		tests
user_scripts/ptb		user_scripts/ptb
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
README_fairseq.md		README_fairseq.md
build_dstore.py		build_dstore.py
eval_lm.py		eval_lm.py
fairseq.gif		fairseq.gif
fairseq_logo.png		fairseq_logo.png
generate.py		generate.py
hubconf.py		hubconf.py
interactive.py		interactive.py
preprocess.py		preprocess.py
score.py		score.py
setup.py		setup.py
train.py		train.py
validate.py		validate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

You can't pick your neighbors, or can you? When and how to rely on retrieval in the kNN-LM

Install Dependencies

Fast Evaluation

About

Releases

Packages

Languages

License

iesl/knnlm-retrieval-quality

Folders and files

Latest commit

History

Repository files navigation

You can't pick your neighbors, or can you? When and how to rely on retrieval in the kNN-LM

Install Dependencies

Fast Evaluation

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages