Add support for regex queries on new wildcard field #54725

markharwood · 2020-04-03T15:29:25Z

This feature is largely about building good approximation queries on the ngram index to limit the number of documents that need verification using an automaton built from the regex.

Lucene's Regexp.toStringTree() method gives a good template for walking a parsed regex query's logic. Rather than building a string we can do something similar which builds an approximation BooleanQuery on the 3gram index. This logic will have to walk a line between:

Being selective enough to efficiently narrow the set of documents considered and
Avoid being overly-restrictive and introducing false negatives (ignoring docs that should match).

elasticmachine · 2020-04-03T15:29:27Z

Pinging @elastic/es-search (:Search/Search)

markharwood · 2020-04-03T17:47:52Z

Hmm I had something working in eclipse but the build process packages jars in a way that means I don't have the privileged access to Regexp internals :(
Will need to take another look next week.

failed to access class org.apache.lucene.util.automaton.RegExp$Kind from class org.apache.lucene.util.automaton.RegexDecoder$1 (org.apache.lucene.util.automaton.RegExp$Kind is in unnamed module of loader 'app'; org.apache.lucene.util.automaton.RegexDecoder$1 is in unnamed module of loader java.net.FactoryURLClassLoader @6fe04f2a)

markharwood · 2020-04-06T09:10:57Z

One workaround is to change Lucene's Regexp class. It already has a "toStringTree" method but I propose adding another similar method but which creates a BooleanQuery rather than a string representation. This query can be used as an approximation to limit the set of documents visited with an automaton - something like this:

public Query toApproximation(ExpressionMatcher matcher) {
   ... // walks tree of expressions calling matcher.createQuery() for char sequences
}

/**
* Provides Query objects to accelerate regex matching by looking for sequences 
* using an index (typically ngram-based).
*/
public interface ExpressionMatcher {
  /**
   * returns a Query object that is guaranteed to find all instances of documents containing the provided character sequence
   * or null if this guarantee cannot be met. Over-matching (false positives) is possible but false negatives must be avoided
   * @param value A character sequence which is part of the regexp
   * @return a query object or null.
   */
  public Query createQuery(String value);
}

romseygeek · 2020-04-08T11:09:31Z

Another option would be to add a more generic Regexp walker method, and refactor both the toStringTree() and toAutomaton() methods to use it. Something like:

public T walkTree(Function<RegExp, T> handleRegexp);

markharwood added >enhancement :Search/Search Search-related issues that do not fall into other categories labels Apr 3, 2020

markharwood self-assigned this Apr 3, 2020

markharwood mentioned this issue Apr 29, 2020

Add regex query support to wildcard field (approach 2) #55548

Merged

rjernst added the Team:Search Meta label for search team label May 4, 2020

markharwood mentioned this issue May 15, 2020

Kibana Regex query authoring tools elastic/kibana#66735

Closed

markharwood closed this as completed in #55548 May 26, 2020

russcam mentioned this issue Jul 23, 2020

7.9.0 Meta ticket elastic/elasticsearch-net#4872

Closed

29 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for regex queries on new wildcard field #54725

Add support for regex queries on new wildcard field #54725

markharwood commented Apr 3, 2020

elasticmachine commented Apr 3, 2020

markharwood commented Apr 3, 2020 •

edited

Loading

markharwood commented Apr 6, 2020

romseygeek commented Apr 8, 2020

Add support for regex queries on new wildcard field #54725

Add support for regex queries on new wildcard field #54725

Comments

markharwood commented Apr 3, 2020

elasticmachine commented Apr 3, 2020

markharwood commented Apr 3, 2020 • edited Loading

markharwood commented Apr 6, 2020

romseygeek commented Apr 8, 2020

markharwood commented Apr 3, 2020 •

edited

Loading