-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-8477: Automatically rewrite disjunctions when internal gaps matter #620
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good to me. I am not sure if the rewriting should be the default option since minimum interval semantic is the basis of the paper and I have the feeling that we try to hide this from users. However I agree that missing matches can be problematic since users may never notice it so I am fine either way ;).
List<IntervalsSource> singletons = new ArrayList<>(); | ||
List<IntervalsSource> nonSingletons = new ArrayList<>(); | ||
for (IntervalsSource disj : source.pullUpDisjunctions()) { | ||
if (disj.minExtent() == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment explaining why we separate singleton source ?
…tter (#620) We have a number of IntervalsSource implementations where automatic minimization of disjunctions can lead to surprising results: * PHRASE queries can miss matches because a longer matching sub-source is minimized away, leaving a gap * MAXGAPS queries can miss matches for the same reason * CONTAINING, NOT_CONTAINING, CONTAINED_BY and NOT_CONTAINED_BY queries can miss matches if the 'big' interval gets minimized The proper way to deal with this is to rewrite the queries by pulling disjunctions to the top of the query tree, so that PHRASE("a", OR(PHRASE("b", "c"), "c")) is rewritten to OR(PHRASE("a", "b", "c"), PHRASE("a", "c")). To be able to do this generally, we need to add a new pullUpDisjunctions() method to IntervalsSource that performs this rewriting for each source that it would apply to. Because these rewritten queries will in general be less efficient due to the duplication of effort (eg the rewritten PHRASE query above pulls 5 term iterators rather than 4 in the original), we also add an option to Intervals.or() that will prevent this happening, so that consumers can choose speed over accuracy if it suits their usecase.
Since apache/lucene-solr#620, intervals disjunctions are automatically rewritten to handle cases where minimizations can miss valid matches. This change updates the documentation to take this behaviour into account (users don't need to manually pull intervals disjunctions to the top anymore).
Since apache/lucene-solr#620, intervals disjunctions are automatically rewritten to handle cases where minimizations can miss valid matches. This change updates the documentation to take this behaviour into account (users don't need to manually pull intervals disjunctions to the top anymore).
Since apache/lucene-solr#620, intervals disjunctions are automatically rewritten to handle cases where minimizations can miss valid matches. This change updates the documentation to take this behaviour into account (users don't need to manually pull intervals disjunctions to the top anymore).
Since apache/lucene-solr#620, intervals disjunctions are automatically rewritten to handle cases where minimizations can miss valid matches. This change updates the documentation to take this behaviour into account (users don't need to manually pull intervals disjunctions to the top anymore).
Since apache/lucene-solr#620, intervals disjunctions are automatically rewritten to handle cases where minimizations can miss valid matches. This change updates the documentation to take this behaviour into account (users don't need to manually pull intervals disjunctions to the top anymore).
Since apache/lucene-solr#620, intervals disjunctions are automatically rewritten to handle cases where minimizations can miss valid matches. This change updates the documentation to take this behaviour into account (users don't need to manually pull intervals disjunctions to the top anymore).
Since apache/lucene-solr#620, intervals disjunctions are automatically rewritten to handle cases where minimizations can miss valid matches. This change updates the documentation to take this behaviour into account (users don't need to manually pull intervals disjunctions to the top anymore).
Since apache/lucene-solr#620, intervals disjunctions are automatically rewritten to handle cases where minimizations can miss valid matches. This change updates the documentation to take this behaviour into account (users don't need to manually pull intervals disjunctions to the top anymore).
We have a number of IntervalsSource implementations where automatic minimization of
disjunctions can lead to surprising results:
away, leaving a gap
can miss matches if the 'big' interval gets minimized
The proper way to deal with this is to rewrite the queries by pulling disjunctions to the top
of the query tree, so that
PHRASE("a", OR(PHRASE("b", "c"), "c"))
is rewritten toOR(PHRASE("a", "b", "c"), PHRASE("a", "c"))
. To be able to do this generally, we need toadd a new
pullUpDisjunctions()
method toIntervalsSource
that performs this rewritingfor each source that it would apply to.
Because these rewritten queries will in general be less efficient due to the duplication of
effort (eg the rewritten PHRASE query above pulls 5 term iterators rather than 4 in the
original), we also add an option to
Intervals.or()
that will prevent this happening, so thatconsumers can choose speed over accuracy if it suits their usecase.