Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insidious fastutil, FeatureVector, and RM3 bug: massive regression impact! #840

Closed
lintool opened this issue Oct 25, 2019 · 10 comments
Closed
Assignees

Comments

@lintool
Copy link
Member

lintool commented Oct 25, 2019

I was trying to upgrade fastutil from version 6.5.6 (an ancient version from Jun 14, 2013) to the latest, version 8.3.0, when I came across a really insidious multi-part bug. The tl;dr is that there's a bug in RM3, which will affect all regressions. Here's the full story:

The class FeatureVector is built around the fastutil Object2FloatOpenHashMap class, which is used by the RM3 implementation to estimate relevance models. In the current implementation, when estimating the relevance model for the feedback docs, we truncate each individual feedback document:

docVector.pruneToSize(fbTerms);

This is the first part of the bug. Just because we ultimately want to select fbTerms terms for feedback doesn't mean that we should only consider fbTerms terms from each document. This was probably done for performance reasons, although query latency really isn't affected. I checked: on my iMac Pro, query latency doesn't increase with that line removed.

Now this leads to the second part of the bug: the method pruneToSize sorts the features by weight, but it doesn't consistently perform tie breaking. This means tie breaking is implementation specific, which means that the fastutil upgrade changed the tie-breaking behavior, which means that different terms are selected from documents, which changes the results.

Insert face plam here.

So to fix this, we need to:

  1. Not prune selection from individual docs.
  2. To prevent future issues along these lines, implement consistent tie-breaking behavior in the FeatureVector implementation.
@lintool lintool self-assigned this Oct 25, 2019
@lintool lintool changed the title Insidious fastutil, FeatureVector, and RM3: massive regression impact! Insidious fastutil, FeatureVector, and RM3 bug: massive regression impact! Oct 25, 2019
@daltonj
Copy link

daltonj commented Dec 10, 2019

Does this mean there are new Anserini RM3 results corrected? And is this 'bug' actually a feature?

@lintool
Copy link
Member Author

lintool commented Dec 10, 2019

Nope, I am loathe to fix this bug because all the regression numbers will change slightly. Punting for now.

@daltonj
Copy link

daltonj commented Dec 10, 2019

It doesn't seem right to not fix a bug because it would change numbers. Isn't this the correct, desired outcome of a bug fix? Fix the bug, update the tests..? It doesn't seem right to use / cite an RM3 implementation that is incorrect...?

@lintool
Copy link
Member Author

lintool commented Dec 10, 2019

I agree, this should be fixed, but it's a question of priorities...

An additional consideration is that this fix will make a bunch of papers already published - both by Waterloo and others that have started to depend on Anserini - not reproducible on master branch. This will lead to a proliferation of different numbers as "baselines" - which will all be correct, just on different versions. Yes, I understand that a proliferation of slightly different numbers is inevitable, but I'd like to hold on as long as I can...

@arjenpdevries
Copy link

I would have expected the effect to be small, as only low impact terms will be ignored from each document, and then the tie breaking behaviour is not really a bug but merely an undefined property in the whole algorithm. But you write "massive impact" so maybe the effect is not small?

I think the point by @daltonj is that what is now called RM3 appears not to implement RM3. I think I disagree with your last comment @lintool - I do not think that future papers should use a buggy implementation simply because previous papers did; future papers should get the algorithm they think they will use! It seems much more reasonable to have it fixed on master, and then have a branch for buggy-old-version-that-we-once-thought-implemented-RM3 for reproducibility purpose?

@daltonj
Copy link

daltonj commented Dec 10, 2019

Can we also update this issue quantifying the impact on MAP, and other standard metrics. How big is it? I expect the tie breaking to be small. But what about the term selection issue?

I would be happy to do a code review as well as provide sample expansion term weights from the Galago implementation to compare against. 

@lintool
Copy link
Member Author

lintool commented Dec 10, 2019

Sorry, to clarify - "massive regression impact" means that all the regression numbers for every collection will change (we now have 25 different collections that we have regressions for)... but the changes will be small. I will quantify.

@lintool
Copy link
Member Author

lintool commented Dec 10, 2019

Okay, here are the results, on Robust04:

AP Paper 1 Paper 2
BM25+RM3 (default) 0.2903 0.2903
BM25+RM3 (default): fixed 0.2920 0.2920
BM25+RM3 (tuned) 0.3043 0.3021
BM25+RM3 (tuned): fixed 0.3004 0.2989

Note that the tuned "fixed" results use the old parameter settings, without retuning.

cf: https://github.com/castorini/anserini/blob/master/docs/experiments-forum2018.md

For the record, these are the commands:

python src/main/python/fine_tuning/reconstruct_robus04_tuned_run.py \
 --index lucene-index.robust04.pos+docvectors+rawdocs \
 --folds src/main/resources/fine_tuning/robust04-paper1-folds.json \
 --params src/main/resources/fine_tuning/params/params.map.robust04-paper1-folds.bm25+rm3.json \
 --output run.robust04.bm25+rm3.paper1.txt


python src/main/python/fine_tuning/reconstruct_robus04_tuned_run.py \
 --index lucene-index.robust04.pos+docvectors+rawdocs \
 --folds src/main/resources/fine_tuning/robust04-paper2-folds.json \
 --params src/main/resources/fine_tuning/params/params.map.robust04-paper2-folds.bm25+rm3.json \
 --output run.robust04.bm25+rm3.paper2.txt


eval/trec_eval.9.0.4/trec_eval src/main/resources/topics-and-qrels/qrels.robust04.txt run.robust04.bm25+rm3.paper1.txt

eval/trec_eval.9.0.4/trec_eval src/main/resources/topics-and-qrels/qrels.robust04.txt run.robust04.bm25+rm3.paper2.txt

@arjenpdevries
Copy link

So marginal differences, phew.

@daltonj
Copy link

daltonj commented Dec 11, 2019

Thanks. I appreciate the fast turn around.

Is there a corresponding pull request / diff to review the RM3 changes? Maybe I could take a stab at reviewing the RM3 implementation.

Beyond that, I would also like to try and sync other implementations to make sure they are consistent -- e.g. the QL + RM3 for Galago vs Anserini. The terms selected and weights should be "similar".

lintool added a commit that referenced this issue Sep 18, 2022
Good time to fix this bug, given that Lucene 8->9 transition is already disruptive.
@lintool lintool closed this as completed Sep 18, 2022
crystina-z pushed a commit to crystina-z/anserini that referenced this issue Oct 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants