Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-8477: Automatically rewrite disjunctions when internal gaps matter #620

Merged
merged 7 commits into from
Mar 27, 2019

Conversation

romseygeek
Copy link
Contributor

We have a number of IntervalsSource implementations where automatic minimization of
disjunctions can lead to surprising results:

  • PHRASE queries can miss matches because a longer matching sub-source is minimized
    away, leaving a gap
  • MAXGAPS queries can miss matches for the same reason
  • CONTAINING, NOT_CONTAINING, CONTAINED_BY and NOT_CONTAINED_BY queries
    can miss matches if the 'big' interval gets minimized

The proper way to deal with this is to rewrite the queries by pulling disjunctions to the top
of the query tree, so that PHRASE("a", OR(PHRASE("b", "c"), "c")) is rewritten to
OR(PHRASE("a", "b", "c"), PHRASE("a", "c")). To be able to do this generally, we need to
add a new pullUpDisjunctions() method to IntervalsSource that performs this rewriting
for each source that it would apply to.

Because these rewritten queries will in general be less efficient due to the duplication of
effort (eg the rewritten PHRASE query above pulls 5 term iterators rather than 4 in the
original), we also add an option to Intervals.or() that will prevent this happening, so that
consumers can choose speed over accuracy if it suits their usecase.

@romseygeek romseygeek changed the title Automatically rewrite disjunctions when internal gaps matter LUCENE-8477: Automatically rewrite disjunctions when internal gaps matter Mar 25, 2019
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me. I am not sure if the rewriting should be the default option since minimum interval semantic is the basis of the paper and I have the feeling that we try to hide this from users. However I agree that missing matches can be problematic since users may never notice it so I am fine either way ;).

List<IntervalsSource> singletons = new ArrayList<>();
List<IntervalsSource> nonSingletons = new ArrayList<>();
for (IntervalsSource disj : source.pullUpDisjunctions()) {
if (disj.minExtent() == 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment explaining why we separate singleton source ?

@romseygeek romseygeek merged commit f1782d0 into apache:master Mar 27, 2019
asfgit pushed a commit that referenced this pull request Mar 27, 2019
…tter (#620)

We have a number of IntervalsSource implementations where automatic minimization of
disjunctions can lead to surprising results:

* PHRASE queries can miss matches because a longer matching sub-source is minimized
  away, leaving a gap
* MAXGAPS queries can miss matches for the same reason
* CONTAINING, NOT_CONTAINING, CONTAINED_BY and NOT_CONTAINED_BY queries
  can miss matches if the 'big' interval gets minimized

The proper way to deal with this is to rewrite the queries by pulling disjunctions to the top
of the query tree, so that PHRASE("a", OR(PHRASE("b", "c"), "c")) is rewritten to
OR(PHRASE("a", "b", "c"), PHRASE("a", "c")). To be able to do this generally, we need to
add a new pullUpDisjunctions() method to IntervalsSource that performs this rewriting
for each source that it would apply to.

Because these rewritten queries will in general be less efficient due to the duplication of
effort (eg the rewritten PHRASE query above pulls 5 term iterators rather than 4 in the
original), we also add an option to Intervals.or() that will prevent this happening, so that
consumers can choose speed over accuracy if it suits their usecase.
jimczi added a commit to jimczi/elasticsearch that referenced this pull request Aug 12, 2024
Since apache/lucene-solr#620, intervals disjunctions are automatically rewritten to handle cases where minimizations can miss valid matches.
This change updates the documentation to take this behaviour into account (users don't need to manually pull intervals disjunctions to the top anymore).
jimczi added a commit to elastic/elasticsearch that referenced this pull request Aug 13, 2024
Since apache/lucene-solr#620, intervals disjunctions are automatically rewritten to handle cases where minimizations can miss valid matches.
This change updates the documentation to take this behaviour into account (users don't need to manually pull intervals disjunctions to the top anymore).
jimczi added a commit to jimczi/elasticsearch that referenced this pull request Aug 13, 2024
Since apache/lucene-solr#620, intervals disjunctions are automatically rewritten to handle cases where minimizations can miss valid matches.
This change updates the documentation to take this behaviour into account (users don't need to manually pull intervals disjunctions to the top anymore).
jimczi added a commit to jimczi/elasticsearch that referenced this pull request Aug 13, 2024
Since apache/lucene-solr#620, intervals disjunctions are automatically rewritten to handle cases where minimizations can miss valid matches.
This change updates the documentation to take this behaviour into account (users don't need to manually pull intervals disjunctions to the top anymore).
elasticsearchmachine pushed a commit to elastic/elasticsearch that referenced this pull request Aug 13, 2024
Since apache/lucene-solr#620, intervals disjunctions are automatically rewritten to handle cases where minimizations can miss valid matches.
This change updates the documentation to take this behaviour into account (users don't need to manually pull intervals disjunctions to the top anymore).
elasticsearchmachine pushed a commit to elastic/elasticsearch that referenced this pull request Aug 13, 2024
Since apache/lucene-solr#620, intervals disjunctions are automatically rewritten to handle cases where minimizations can miss valid matches.
This change updates the documentation to take this behaviour into account (users don't need to manually pull intervals disjunctions to the top anymore).
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this pull request Sep 4, 2024
Since apache/lucene-solr#620, intervals disjunctions are automatically rewritten to handle cases where minimizations can miss valid matches.
This change updates the documentation to take this behaviour into account (users don't need to manually pull intervals disjunctions to the top anymore).
davidkyle pushed a commit to davidkyle/elasticsearch that referenced this pull request Sep 5, 2024
Since apache/lucene-solr#620, intervals disjunctions are automatically rewritten to handle cases where minimizations can miss valid matches.
This change updates the documentation to take this behaviour into account (users don't need to manually pull intervals disjunctions to the top anymore).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants