Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for regex queries on new wildcard field #54725

Closed
markharwood opened this issue Apr 3, 2020 · 4 comments · Fixed by #55548
Closed

Add support for regex queries on new wildcard field #54725

markharwood opened this issue Apr 3, 2020 · 4 comments · Fixed by #55548
Assignees
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team

Comments

@markharwood
Copy link
Contributor

This feature is largely about building good approximation queries on the ngram index to limit the number of documents that need verification using an automaton built from the regex.

Lucene's Regexp.toStringTree() method gives a good template for walking a parsed regex query's logic. Rather than building a string we can do something similar which builds an approximation BooleanQuery on the 3gram index. This logic will have to walk a line between:

  1. Being selective enough to efficiently narrow the set of documents considered and
  2. Avoid being overly-restrictive and introducing false negatives (ignoring docs that should match).
@markharwood markharwood added >enhancement :Search/Search Search-related issues that do not fall into other categories labels Apr 3, 2020
@markharwood markharwood self-assigned this Apr 3, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@markharwood
Copy link
Contributor Author

markharwood commented Apr 3, 2020

Hmm I had something working in eclipse but the build process packages jars in a way that means I don't have the privileged access to Regexp internals :(
Will need to take another look next week.

failed to access class org.apache.lucene.util.automaton.RegExp$Kind from class org.apache.lucene.util.automaton.RegexDecoder$1 (org.apache.lucene.util.automaton.RegExp$Kind is in unnamed module of loader 'app'; org.apache.lucene.util.automaton.RegexDecoder$1 is in unnamed module of loader java.net.FactoryURLClassLoader @6fe04f2a)

@markharwood
Copy link
Contributor Author

One workaround is to change Lucene's Regexp class. It already has a "toStringTree" method but I propose adding another similar method but which creates a BooleanQuery rather than a string representation. This query can be used as an approximation to limit the set of documents visited with an automaton - something like this:

public Query toApproximation(ExpressionMatcher matcher) {
   ... // walks tree of expressions calling matcher.createQuery() for char sequences
}

/**
* Provides Query objects to accelerate regex matching by looking for sequences 
* using an index (typically ngram-based).
*/
public interface ExpressionMatcher {
  /**
   * returns a Query object that is guaranteed to find all instances of documents containing the provided character sequence
   * or null if this guarantee cannot be met. Over-matching (false positives) is possible but false negatives must be avoided
   * @param value A character sequence which is part of the regexp
   * @return a query object or null.
   */
  public Query createQuery(String value);
}

@romseygeek
Copy link
Contributor

Another option would be to add a more generic Regexp walker method, and refactor both the toStringTree() and toAutomaton() methods to use it. Something like:

public T walkTree(Function<RegExp, T> handleRegexp);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants