-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add regex query support to wildcard field (approach 2) #55548
Conversation
Pinging @elastic/es-search (:Search/Mapping) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments but the approach looks good to me.
String patterns[] = { "*foobar", "foobar*", "foo*bar", "foo?bar", "?foo*bar?", "*c"}; | ||
for (String pattern : patterns) { | ||
Query wildcardFieldQuery = wildcardFieldType.fieldType().wildcardQuery(pattern, null, MOCK_QSC); | ||
assertTrue(wildcardFieldQuery instanceof BooleanQuery); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would help to have explicit checks on the clauses that we create, not just the fact that it's a boolean query. I know we have random tests that check the validity of these queries but that's generally useful to also have simple tests that test the logic more carefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the tests, can you do the same to test the logic of fuzzy query ?
|
||
@Override | ||
public Query regexpQuery(String value, int flags, int maxDeterminizedStates, RewriteMethod method, QueryShardContext context) { | ||
if (context.allowExpensiveQueries() == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure we should differentiate wildcard and regex, I think we should just ignore this setting for wildcard field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth reserving it for patterns that don't produce an approximation query?
...ugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java
Outdated
Show resolved
Hide resolved
failIfNotIndexed(); | ||
RegExp regex = new RegExp(value, flags); | ||
Automaton automaton = regex.toAutomaton(maxDeterminizedStates); | ||
ApproximateRegExp ngramRegex = new ApproximateRegExp(toLowerCase(value), flags); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should lowercase eagerly here. Lowercasing can happen in the ApproximateRegExp
directly but I wonder if this should be tackled in a follow up ?
// determine a common section all the paths shared in order to simplify. | ||
// So, we simplify up-front by: | ||
// 1) replacing all repetitions eg (foo)* with a single invalid char in the regex string used to build the automaton | ||
// 2) lowercasing all concrete values so searches like [Ee][Nn][Cc][Oo][Dd][Ee][Dd] don't fork myriad paths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see the solution without lowercasing the ngram index. We can try to optimize lowercased search in a follow up but for now we should ensure that we handle all types of automaton without relying on normalization ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I need to understand better how an FSI approach can be made to work with an approach that tries to find articulation points in the automaton. The former relies on FSI to chase down all the possible paths while the latter takes direct control of exploring the graph paths to understand the branching (like Nik's code).
If we try to make life simple by using FSI only then some up-front case normalisation is useful to avoid searches like the [Pp][Oo][Ww][Ee][Rr][Ss][Hh][Ee][Ll][Ll]
example from this blog blowing the BooleanQuery limits.
...ugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java
Show resolved
Hide resolved
|
||
|
||
private Query createApproximationQueryFromAutomaton(Automaton simplifiedAutomaton) { | ||
FiniteStringsIterator iterator = new FiniteStringsIterator(simplifiedAutomaton); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's ok for a first iteration but we should look at dividing the boolean query around the automaton's articulation points. That could optimize a lot the boolean query that we produce on complex automaton.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't that take us back to Nik's code and exploring the transition points in the automaton to understand the graph? I'm not sure how FiniteStringsIterator works in conjunction with that sort of logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have an example in Lucene in QueryBuilder#analyzeGraphBoolean
. I don't think we should do that in the first iteration though so let's assume that we have to visit all the paths for now.
However I am thinking of one thing that we could try to optimize in this pr, when we have a few transitions that have the same target. For instance [aA][bB][cD]
that would create 8 paths, we could instead allow each path to contain a few unicode points per positions (2 maybe 3 max) and fork FiniteStringIterator
to return optimized IntsRef[][]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we don't normalize case given:
- case is likely to be the biggest cause of logic-branching (based on the existing rules we've seen from the security space)
- normalising reduces our code complexity and increases the number of expressions we can hope to accelerate
- the ngram search just has to be fast, not 100% correct. Normalising will:
- dramatically reduce the speed of searches (up to 8 x fewer unique terms)
- not increase false positives massively for most cases (gut feel on this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure but I still think we should consider this change as a follow up. This pr should ensure that we can handle all types of automaton correctly no matter if we lowercase or not.
Then we can discuss how to handle case insensitive search but that's a different discussions imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we can discuss how to handle case insensitive search but that's a different discussions imo.
The idea behind normalisation in this context isn't about enabling case insensitivity for end users - it's about optimising the search performance and minimising the complexity of automatons.
There will inevitably be limits on the number of permutations we can consider and working with a lower-cased vocabulary will help reduce the numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's assuming that lots of query will contains [aA][bB]
variations but this could be avoided if we provide a true case-insensitive search. Again, I am not saying we shouldn't do this "optimization" but to me this change would deserve a separate discussion where we'd also see think about exposing the case insensitive search on this field.
In the first example the regex is for matching everything except terms beginning with 'abc'. Currently we ignore any negatives and focus on creating MUST clauses for the positive elements of a regex. |
I see it now, thanks.
I also missed the intersection that is used |
6a4b49e
to
52a908c
Compare
@jimczi I've had a rethink on the approach to extracting queries from automatons. The original motivation was the hope that a single implementation could work for all query types (wildcard, regex, fuzzy). Life didn't turn out that simple for a number of reasons:
So we either side-stepped automatons completely (fuzzy) or had to create alternative ones to the automatons used in verification by using forks of the regex and wildcard parsing logic. What I have in this PR I think is much simplified.
Regex complexityThe Obviously option b) has limits based on input string length because of the possible permutations. However trimming is not an option - with OR lists dropping any one clause will introduce false negatives which is not allowed. It's an all-or-nothing approach to capturing the logic. You can't take a sample of tokens from across the range of the input sequence. Regex normalizationGiven the above, I have opted to use lowercasing on the ngram index and the ApproximateRegExp class will optimise these character sequences so that |
bc10d9d
to
d161fdc
Compare
Updating this feature comparison table as of this PR.
|
af27645
to
be10d0f
Compare
@jpountz @jimczi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the logic for the regex, wildcard and fuzzy queries.
I am not convinced that the approximate regexp should use Lucene queries as intermediate states but you don't like visiting Automaton either so I left some comments on the approach.
// TODO match all was a nice assumption here for optimising .* but breaks | ||
// for (a){0,3} which isn't a logical match all but empty string or up to 3 a's. | ||
// result = new MatchAllDocsQuery(); | ||
result = new MatchAllButRequireVerificationQuery(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use an exists
query instead of adding a new query ?
new BooleanQuery.Builder()
.add(fieldType.existsQuery(), Occur.SHOULD)
.add(result)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this class made the intention much clearer. Without it we'd have to interpret any FieldExistsQuery to mean "DO run a verification" query (e.g. for regex ..
) and for a MatchAllDocsQuery to mean don't run a verification query (e.g. for regex .*
).
There are certain queries like .*
where we know we can satisfy all criteria using the ngram index only. However I thought today that we could extend this support by making MatchAllButRequireVerificationQuery a base class or marker interface. The regex a
for example could also be satisfied using the ngram index only so returning a form of term query which tested for true in instanceof MatchAllButRequireVerificationQuery
would help the WildcardField avoid running pointless verification for more than one regex type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see but is it still the case if the ngram index is lowercased ? The search is case-sensitive so unless we differentiate characters eligible for case folding, we have to check every match ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can see if the search value changes after normalisation and only optimise the query if there's no change. Perhaps a rare example of a user regex but illustrates the need to communicate where we think we can handle queries with the ngram index only.
Another example might be if we later index field lengths and can accelerate things like ??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can see if the search value changes after normalisation and only optimise the query if there's no change.
Perhaps a rare example of a user regex but illustrates the need to communicate where we think we can handle queries with the ngram index only.
I don't think this works since values in documents would be lowercased too so you'd need a verification for any character that can upper and lower cased ? We can think about optimizations in the future but as you said in previous comments we should aim for correctness first.
...ugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java
Outdated
Show resolved
Hide resolved
...plugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/ApproximateRegExp.java
Outdated
Show resolved
Hide resolved
case REGEXP_INTERVAL: | ||
case REGEXP_EMPTY: | ||
case REGEXP_AUTOMATON: | ||
result = new MatchAllButRequireVerificationQuery(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a marker rather than a full Lucene query ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my earlier comment re introducing a base class or marker interface for denoting query clauses that require no verification step. Thoughts?
* This class is a fork of Lucene's RegExp class and is used to create a simplified | ||
* Query for use in accelerating searches to find those documents likely to match the regex. | ||
* | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if it's a bad thing or not but creating boolean queries upfront removes the simplification of the automaton.
...ugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java
Show resolved
Hide resolved
...ugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java
Outdated
Show resolved
Hide resolved
String patterns[] = { "*foobar", "foobar*", "foo*bar", "foo?bar", "?foo*bar?", "*c"}; | ||
for (String pattern : patterns) { | ||
Query wildcardFieldQuery = wildcardFieldType.fieldType().wildcardQuery(pattern, null, MOCK_QSC); | ||
assertTrue(wildcardFieldQuery instanceof BooleanQuery); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the tests, can you do the same to test the logic of fuzzy query ?
...ugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java
Outdated
Show resolved
Hide resolved
It's worth noting the logic that this PR (and any other approach) has to implement. These clauses can be surrounded with arbitrary nesting of layers of AND/OR Boolean logic.
I found implementing these rules easier to reason about using the hierarchy of RegExp/BooleanQuery objects rather than introducing an intermediate stage of an Automaton. The Automaton's graph of character-level transitions between states makes it harder to reason about the above Boolean rewriting logic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left more comments
// TODO match all was a nice assumption here for optimising .* but breaks | ||
// for (a){0,3} which isn't a logical match all but empty string or up to 3 a's. | ||
// result = new MatchAllDocsQuery(); | ||
result = new MatchAllButRequireVerificationQuery(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can see if the search value changes after normalisation and only optimise the query if there's no change.
Perhaps a rare example of a user regex but illustrates the need to communicate where we think we can handle queries with the ngram index only.
I don't think this works since values in documents would be lowercased too so you'd need a verification for any character that can upper and lower cased ? We can think about optimizations in the future but as you said in previous comments we should aim for correctness first.
...ugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left minor comments but this looks ready to me. Thanks for iterating on this @markharwood!
...ugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java
Show resolved
Hide resolved
… fuzzy, wildcard and prefix queries are all supported. All queries use an approximation query backed by an automaton-based verification queries. Closes elastic#54275
…n wildcard. Updated tests with simpler syntax and documented regexes that we’d like to improve on, showing current suboptimal queries and the future form we’d like to see.
* Unlimited prefix length. * Delayed Autotmaton creation * FuzzyQuery tests
…per class. All query simplification logic is now consolidated in the WildcardFieldMapper class.
…til this PR is backported to 7x
Adds equivalence for keyword field to the wildcard field. Regex, fuzzy, wildcard and prefix queries are all supported. All queries use an approximation query backed by an automaton-based verification queries. Closes elastic#54275
This is the second cut at extracting ngrams from automata for use in the approximation query.
Unlike #54946 this PR uses FiniteStringsIterator on a simplified form of automaton. The automaton simplification is achieved by two means:
[Ee][Nn][Cc][Oo][Dd]
I have opted to lower-case the search inputs and the ngram index.Fuzzy queries remain a problem because they generate automatons that are too complex to map to an efficient BooleanQuery on the ngram index so are not supported.
Closes #54725