Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wildcard field regex query fix. #78520

Closed
wants to merge 3 commits into from

Conversation

markharwood
Copy link
Contributor

Don’t revert to match_all when query only exists of required clauses that can’t be expressed as queries on ngram index.

Reverting to a match_all is an important optimisation e.g. if users type a* - we don't want to decompress all docs and run a regex on the contents when we can know this is literally a match all expression.

Closes #78391

@markharwood markharwood added >bug :Search/Search Search-related issues that do not fall into other categories labels Sep 30, 2021
@markharwood markharwood self-assigned this Sep 30, 2021
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Sep 30, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

Don’t revert to match_all when query only exists of required clauses that can’t be expressed as queries on ngram index.

Closes elastic#78391
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the logic difficult to read to be honest. Is it really worth the complexity ?
I am afraid that we could shoot ourselves in the foot if we change the term queries and miss the usage of MatchAllButRequireVerificationQuery.

@markharwood
Copy link
Contributor Author

markharwood commented Oct 1, 2021

I find the logic difficult to read to be honest. Is it really worth the complexity ?

We already need this level of complexity to faithfully represent all of the logic in the regex when it comes to the construction of the ngram query. We cannot create an approximation query that under-matches or we get false negative bugs which often go undetected (unlike the false positive we spotted here).

Rolling back this optimisation won't reduce much of the complexity - we already need to know the difference between:

  1. Match all queries (.* etc)
  2. MatchAllButVerify (expressions in the regex we don't have an equivalent ngram query for) and
  3. All other "concrete" queries (expressions in the regex we can use as filters in the ngram approximation query)

If we find a 2) we must drop all other ORed 3) clauses to avoid false negatives.
There's no avoiding the complexity of this level of parsing.

The revert-to-match-all-with-no-verification optimisation is a comparatively small addition to the necessarily complex task of translating the regex to an ngram query.

@jimczi
Copy link
Contributor

jimczi commented Oct 1, 2021

Rolling back this optimisation won't reduce much of the complexity

I don't get that part. We can remove MatchAllButVerify if we decide to always run the verification. That's the simplification I am talking about.

If we find a 2) we must drop all other ORed 3) clauses to avoid false negatives.

The simplification here would be to add the match_all as an additional should clause and let the boolean query simplify this for us during rewrite ?

@markharwood
Copy link
Contributor Author

OK, I see what you mean.
However, it would be good to retain something that optimises for a* being a match all without verification - I imagine that's a pretty common mistake for those familiar with wildcard syntax.
@cbuescher suggested a simple string based filter (a regex that looks for simple match-all regexes). That might work but we already have logic with a deep understanding of the parsed expressions so there's also an argument for keeping the logic where it is.
I wonder if, as you suggest, we could let BooleanQuery do the rewrite and use MatchAllDocsQuery instead of MatchAllButVerify but do something sneaky. We can't subclass MatchAllDocsQuery because it is final but we could set a marker on it like a boost:12345 or something for the regex clauses that are genuine match alls like a* (as opposed to the clauses like a+ where we just don't have an equivalent ngram query).
We then let BooleanQuery rewrite all the logic and if the surviving query is a MatchAll with this special boost 12345 we assume it is safe to run without verification. Would that work?

@markharwood
Copy link
Contributor Author

Closed in favour of new approach in #78839

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v8.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Matching difference between keyword and wildcard field types
5 participants