Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add regex query support to wildcard field (approach 2) #55548

Merged
merged 14 commits into from
May 26, 2020

Conversation

markharwood
Copy link
Contributor

@markharwood markharwood commented Apr 21, 2020

This is the second cut at extracting ngrams from automata for use in the approximation query.

Unlike #54946 this PR uses FiniteStringsIterator on a simplified form of automaton. The automaton simplification is achieved by two means:

  1. No expansions -rather than generating complex automata a single null character is used in place of * type expansions. This make the job of extracting runs of concrete characters simple via FiniteStringsIterator. This was acheived by forking Lucene's RegExp and WildcardQuery's logic to create automatons.
  2. Unified case - to avoid overly-complex paths and ngram queries for mixed-case searches like [Ee][Nn][Cc][Oo][Dd] I have opted to lower-case the search inputs and the ngram index.

Fuzzy queries remain a problem because they generate automatons that are too complex to map to an efficient BooleanQuery on the ngram index so are not supported.

Closes #54725

@markharwood markharwood added :Search Foundations/Mapping Index mappings, including merging and defining field types v8.0.0 v7.8.0 labels Apr 21, 2020
@markharwood markharwood self-assigned this Apr 21, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Mapping)

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments but the approach looks good to me.

String patterns[] = { "*foobar", "foobar*", "foo*bar", "foo?bar", "?foo*bar?", "*c"};
for (String pattern : patterns) {
Query wildcardFieldQuery = wildcardFieldType.fieldType().wildcardQuery(pattern, null, MOCK_QSC);
assertTrue(wildcardFieldQuery instanceof BooleanQuery);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would help to have explicit checks on the clauses that we create, not just the fact that it's a boolean query. I know we have random tests that check the validity of these queries but that's generally useful to also have simple tests that test the logic more carefully.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the tests, can you do the same to test the logic of fuzzy query ?


@Override
public Query regexpQuery(String value, int flags, int maxDeterminizedStates, RewriteMethod method, QueryShardContext context) {
if (context.allowExpensiveQueries() == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we should differentiate wildcard and regex, I think we should just ignore this setting for wildcard field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth reserving it for patterns that don't produce an approximation query?

failIfNotIndexed();
RegExp regex = new RegExp(value, flags);
Automaton automaton = regex.toAutomaton(maxDeterminizedStates);
ApproximateRegExp ngramRegex = new ApproximateRegExp(toLowerCase(value), flags);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should lowercase eagerly here. Lowercasing can happen in the ApproximateRegExp directly but I wonder if this should be tackled in a follow up ?

// determine a common section all the paths shared in order to simplify.
// So, we simplify up-front by:
// 1) replacing all repetitions eg (foo)* with a single invalid char in the regex string used to build the automaton
// 2) lowercasing all concrete values so searches like [Ee][Nn][Cc][Oo][Dd][Ee][Dd] don't fork myriad paths
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see the solution without lowercasing the ngram index. We can try to optimize lowercased search in a follow up but for now we should ensure that we handle all types of automaton without relying on normalization ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need to understand better how an FSI approach can be made to work with an approach that tries to find articulation points in the automaton. The former relies on FSI to chase down all the possible paths while the latter takes direct control of exploring the graph paths to understand the branching (like Nik's code).
If we try to make life simple by using FSI only then some up-front case normalisation is useful to avoid searches like the [Pp][Oo][Ww][Ee][Rr][Ss][Hh][Ee][Ll][Ll] example from this blog blowing the BooleanQuery limits.



private Query createApproximationQueryFromAutomaton(Automaton simplifiedAutomaton) {
FiniteStringsIterator iterator = new FiniteStringsIterator(simplifiedAutomaton);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's ok for a first iteration but we should look at dividing the boolean query around the automaton's articulation points. That could optimize a lot the boolean query that we produce on complex automaton.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't that take us back to Nik's code and exploring the transition points in the automaton to understand the graph? I'm not sure how FiniteStringsIterator works in conjunction with that sort of logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have an example in Lucene in QueryBuilder#analyzeGraphBoolean. I don't think we should do that in the first iteration though so let's assume that we have to visit all the paths for now.
However I am thinking of one thing that we could try to optimize in this pr, when we have a few transitions that have the same target. For instance [aA][bB][cD] that would create 8 paths, we could instead allow each path to contain a few unicode points per positions (2 maybe 3 max) and fork FiniteStringIterator to return optimized IntsRef[][] ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we don't normalize case given:

  • case is likely to be the biggest cause of logic-branching (based on the existing rules we've seen from the security space)
  • normalising reduces our code complexity and increases the number of expressions we can hope to accelerate
  • the ngram search just has to be fast, not 100% correct. Normalising will:
    • dramatically reduce the speed of searches (up to 8 x fewer unique terms)
    • not increase false positives massively for most cases (gut feel on this)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure but I still think we should consider this change as a follow up. This pr should ensure that we can handle all types of automaton correctly no matter if we lowercase or not.
Then we can discuss how to handle case insensitive search but that's a different discussions imo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we can discuss how to handle case insensitive search but that's a different discussions imo.

The idea behind normalisation in this context isn't about enabling case insensitivity for end users - it's about optimising the search performance and minimising the complexity of automatons.
There will inevitably be limits on the number of permutations we can consider and working with a lower-cased vocabulary will help reduce the numbers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's assuming that lots of query will contains [aA][bB] variations but this could be avoided if we provide a true case-insensitive search. Again, I am not saying we shouldn't do this "optimization" but to me this change would deserve a separate discussion where we'd also see think about exposing the case insensitive search on this field.

@markharwood markharwood requested a review from jimczi April 22, 2020 10:43
@markharwood
Copy link
Contributor Author

I don't understand why we cannot extract @&~ in the first regexp and aaa AND bbb in the second one ?

In the first example the regex is for matching everything except terms beginning with 'abc'. Currently we ignore any negatives and focus on creating MUST clauses for the positive elements of a regex.
In the second example I'm not sure what's going on. The RegExp fork I made currently produces an empty automaton and the standard RegExp class produces an automaton with cycles which means we can't use FSI on it directly.

@jimczi
Copy link
Contributor

jimczi commented Apr 22, 2020

In the first example the regex is for matching everything except terms beginning with 'abc'.

I see it now, thanks.

In the second example I'm not sure what's going on.

I also missed the intersection that is used & so it's ok that we rewrite to an empty automaton. However I wonder if we should have a special treatment for intersection that explicitly skip them (replace right and left with the special char) in the approximate regexp ?

@markharwood markharwood force-pushed the fix/54725v2 branch 2 times, most recently from 6a4b49e to 52a908c Compare April 28, 2020 10:33
@markharwood
Copy link
Contributor Author

markharwood commented Apr 28, 2020

@jimczi I've had a rethink on the approach to extracting queries from automatons. The original motivation was the hope that a single implementation could work for all query types (wildcard, regex, fuzzy). Life didn't turn out that simple for a number of reasons:

  1. Fuzzy queries were too complex to derive from automaton (too many permutations to represent as query terms). We opted to create ngram queries directly from the search string.
  2. We had to fork RegExp and WildcardQuery parsing logic to simplify the automatons we dealt with for regex and wildcard searches. The parsing logic for each class was modified to introduce null characters in place of too-complex expressions like negations or repetitions.

So we either side-stepped automatons completely (fuzzy) or had to create alternative ones to the automatons used in verification by using forks of the regex and wildcard parsing logic.

What I have in this PR I think is much simplified.

  1. Wildcards and prefix queries create simple ngram runs directly from the input string
  2. Fuzzy queries create ngram queries directly from the input string with min-should-match settings that reflect the allowed edit distances and MUST clauses that respect the prefix length settings
  3. The ApproximateRegExp fork of RegExp uses the regex parser logic to pull out BooleanQuery and TermQuery objects rather than having an interim step of generating automata. This preserves the original logic of the expression without additional translation steps.

Regex complexity

The [Cc][Aa][Ss][Ee] type regex queries we know are out there waiting to hit us will be responsible for producing complex ngram queries. These can be translated in any of these ways:
a) Character-wise pairings e.g. +(C* OR c*) +(A* OR a*) +(S* OR s*) +(E* OR e*)
b) Sequence expansion e.g. cas OR Cas OR cAs OR caS OR CAs OR CAS OR CaS...
c) Ngram-size regexes eg /[Cc][Aa][Ss]/ AND /[Ss][Ee]_/

Obviously option b) has limits based on input string length because of the possible permutations. However trimming is not an option - with OR lists dropping any one clause will introduce false negatives which is not allowed. It's an all-or-nothing approach to capturing the logic. You can't take a sample of tokens from across the range of the input sequence.
I have implemented option a). This is not great because of all the x* wildcard queries required. These are expensive and not very selective.
Option c) is selective but could be more expensive to run than if we normalized the ngram index (see below)

Regex normalization

Given the above, I have opted to use lowercasing on the ngram index and the ApproximateRegExp class will optimise these character sequences so that [Pp][Oo][Ww][Ee][Rr][Ss][Hh][Ee][Ll][Ll] killer query becomes the much cheaper and selective powershell which is ngrammed to +_po +owe +ers +she +ell_ +l__

@markharwood
Copy link
Contributor Author

markharwood commented Apr 29, 2020

Updating this feature comparison table as of this PR.

Feature keyword wildcard
Sort by speeds Fast Not quite as fast (see *1)
Aggregation speeds Fast Not quite as fast (see *1)
Prefix query speeds (foo*) Fast Not quite as fast (see *2)
Leading wildcard query speeds on low-cardinality fields (*foo) Fast Slower (see *3)
Leading wildcard query speeds on high-cardinality fields (*foo) Terrible Much faster
Term query. full value match (foo) Fast Not quite as fast (see *2)
Fuzzy query. Y (if allow expensive queries enabled) Y
Regex query. Y (if allow expensive queries enabled) Y
Range query. Y (if allow expensive queries enabled) N
Supports highlighting Y N
Searched by "all field" queries Y Y
Disk costs for mostly unique values high lower
Disk costs for mostly identical values low medium
Max character size for a field value 256 for default JSON string mappings, 32,766 Lucene max unlimited
Supports normalizers in mappings Y N (but mixed case regex queries e.g [Ff][Oo][Oo] should be quick)
  • 1: somewhat slower as doc values retrieved from compressed blocks of 32
  • 2: somewhat slower because approximate matches with ngrams need verification
  • 3: keyword field visits every unique value only once but wildcard field assesses every utterance of values

@markharwood
Copy link
Contributor Author

markharwood commented Apr 30, 2020

@jpountz @jimczi
This is something that can be reviewed now as I have no major changes planned.
I've added acceleration for .* type queries so we skip the verification phase and revert to a plain match all. A new MatchAllButRequireVerificationQuery class was introduced for the parser to signal where there's nothing we can do on the ngram index (eg for regex ..) but a verification query is still required.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the logic for the regex, wildcard and fuzzy queries.
I am not convinced that the approximate regexp should use Lucene queries as intermediate states but you don't like visiting Automaton either so I left some comments on the approach.

// TODO match all was a nice assumption here for optimising .* but breaks
// for (a){0,3} which isn't a logical match all but empty string or up to 3 a's.
// result = new MatchAllDocsQuery();
result = new MatchAllButRequireVerificationQuery();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use an exists query instead of adding a new query ?

new BooleanQuery.Builder()
  .add(fieldType.existsQuery(), Occur.SHOULD)
  .add(result)

?

Copy link
Contributor Author

@markharwood markharwood May 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this class made the intention much clearer. Without it we'd have to interpret any FieldExistsQuery to mean "DO run a verification" query (e.g. for regex ..) and for a MatchAllDocsQuery to mean don't run a verification query (e.g. for regex .*).
There are certain queries like .* where we know we can satisfy all criteria using the ngram index only. However I thought today that we could extend this support by making MatchAllButRequireVerificationQuery a base class or marker interface. The regex a for example could also be satisfied using the ngram index only so returning a form of term query which tested for true in instanceof MatchAllButRequireVerificationQuery would help the WildcardField avoid running pointless verification for more than one regex type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see but is it still the case if the ngram index is lowercased ? The search is case-sensitive so unless we differentiate characters eligible for case folding, we have to check every match ?

Copy link
Contributor Author

@markharwood markharwood May 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can see if the search value changes after normalisation and only optimise the query if there's no change. Perhaps a rare example of a user regex but illustrates the need to communicate where we think we can handle queries with the ngram index only.
Another example might be if we later index field lengths and can accelerate things like ??

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can see if the search value changes after normalisation and only optimise the query if there's no change.
Perhaps a rare example of a user regex but illustrates the need to communicate where we think we can handle queries with the ngram index only.

I don't think this works since values in documents would be lowercased too so you'd need a verification for any character that can upper and lower cased ? We can think about optimizations in the future but as you said in previous comments we should aim for correctness first.

case REGEXP_INTERVAL:
case REGEXP_EMPTY:
case REGEXP_AUTOMATON:
result = new MatchAllButRequireVerificationQuery();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a marker rather than a full Lucene query ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my earlier comment re introducing a base class or marker interface for denoting query clauses that require no verification step. Thoughts?

* This class is a fork of Lucene's RegExp class and is used to create a simplified
* Query for use in accelerating searches to find those documents likely to match the regex.
*
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it's a bad thing or not but creating boolean queries upfront removes the simplification of the automaton.

String patterns[] = { "*foobar", "foobar*", "foo*bar", "foo?bar", "?foo*bar?", "*c"};
for (String pattern : patterns) {
Query wildcardFieldQuery = wildcardFieldType.fieldType().wildcardQuery(pattern, null, MOCK_QSC);
assertTrue(wildcardFieldQuery instanceof BooleanQuery);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the tests, can you do the same to test the logic of fuzzy query ?

@markharwood
Copy link
Contributor Author

It's worth noting the logic that this PR (and any other approach) has to implement.
It is more important to be correct than fast with this ngram acceleration logic - we can't run the risk of introducing false negatives.
When it comes to converting regex clauses into ngram queries they fall into 3 categories:
A) Representable clauses we can search for e.g. foo
B) Verified match-all clauses e.g. .*
C) Unrepresentable clauses e.g. ...

These clauses can be surrounded with arbitrary nesting of layers of AND/OR Boolean logic.
When it comes to avoiding false negatives the category C clauses are the ones we have to pay special attention to. They require us to rewrite any Boolean logic that surround these expressions using these rules (in order of precedence):

  1. Any clause in an OR list which is a verified "match all" (e.g. .*) replaces all other clauses
  2. Any clause in an OR list which is unrepresentable will rewrite the entire list to be a single unrepresentable clause
  3. Any clause in an AND list which is unrepresentable can be dropped. An empty list is rewritten as an unrepresentable clause

I found implementing these rules easier to reason about using the hierarchy of RegExp/BooleanQuery objects rather than introducing an intermediate stage of an Automaton. The Automaton's graph of character-level transitions between states makes it harder to reason about the above Boolean rewriting logic.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left more comments

// TODO match all was a nice assumption here for optimising .* but breaks
// for (a){0,3} which isn't a logical match all but empty string or up to 3 a's.
// result = new MatchAllDocsQuery();
result = new MatchAllButRequireVerificationQuery();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can see if the search value changes after normalisation and only optimise the query if there's no change.
Perhaps a rare example of a user regex but illustrates the need to communicate where we think we can handle queries with the ngram index only.

I don't think this works since values in documents would be lowercased too so you'd need a verification for any character that can upper and lower cased ? We can think about optimizations in the future but as you said in previous comments we should aim for correctness first.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left minor comments but this looks ready to me. Thanks for iterating on this @markharwood!

… fuzzy, wildcard and prefix queries are all supported.

All queries use an approximation query backed by an automaton-based verification queries.

Closes elastic#54275
…n wildcard.

Updated tests with simpler syntax and documented regexes that we’d like to improve on, showing current suboptimal queries and the future form we’d like to see.
* Unlimited prefix length.
* Delayed Autotmaton creation
* FuzzyQuery tests
…per class. All query simplification logic is now consolidated in the WildcardFieldMapper class.
@markharwood markharwood merged commit e1fb29c into elastic:master May 26, 2020
markharwood added a commit to markharwood/elasticsearch that referenced this pull request May 26, 2020
Adds equivalence for keyword field to the wildcard field. Regex, fuzzy, wildcard and prefix queries are all supported.
All queries use an approximation query backed by an automaton-based verification queries.

Closes elastic#54275
markharwood added a commit that referenced this pull request May 26, 2020
Backport of #55548

Adds equivalence for keyword field to the wildcard field. Regex, fuzzy, wildcard and prefix queries are all supported.
All queries use an approximation query backed by an automaton-based verification queries.

Closes #54275
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Meta label for search team v7.9.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for regex queries on new wildcard field
6 participants