Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phrases in span_near #16796

Closed
cknv opened this issue Feb 24, 2016 · 11 comments
Closed

Phrases in span_near #16796

cknv opened this issue Feb 24, 2016 · 11 comments
Labels
discuss >feature high hanging fruit :Search/Search Search-related issues that do not fall into other categories

Comments

@cknv
Copy link

cknv commented Feb 24, 2016

I am using span_near queries and while I can put in span_terms and even span_or, I am missing a span_phrase clause in the query dsl.

As I understand it, a regular phrase clause, is very similar (at least conceptually) to a span_near, where the slop is 0 and the terms are the phrase that has been run through the same analyzer as the field uses.

However: while I could construct that span_near and its span_terms by hand, I would have to be very careful making the terms (getting it close is easy; getting it right is hard) and I would only be able to cover one analyzer at a time, which is hardly ideal.
Alternatively, I could also ask elasticsearch to analyze the phrase for me (according to the fields analyzer), and then use that in the query, but that would cost me an extra round trip. Not to mention that the query construction is actually in a library that is currently blissfully unaware of the actual elasticsearch nodes, it just builds the query.

So having elasticsearch take the phrase and construct a lucene SpanNearQuery or something akin to that, would be very nice, and save me a lot of trouble.

Maybe the dsl could look something like:

{
    "span_phrase": {
        "<field>": "<phrase>"
    }
}

Which I could then embed in my span_near, like any other span clause.

@clintongormley
Copy link
Contributor

Hi @cknv

All of the span queries are term-level queries, ie the query string needs to be analyzed before the resulting terms are used with span queries. A phrase query would behave completely differently as it would include analysis. Given that you already need to deal directly with terms, I'm not sure what a span_phrase query would buy you?

@cknv
Copy link
Author

cknv commented Feb 29, 2016

I know that it would include analyzis of the submitted text. That is the whole point. The tl;dr of it is that I have a DSL where I want to add the ablility to search for phrases in near clauses, something like:

  • [term "some phrase"]~4, somewhat similar to your own proximity searches - although I had to change the grammar a bit.
  • term NEAR/4 "some phrase", sometimes called a near group.

These two examples is not the hardest to break down, but it is not hard to imagine worse cases.

Now, the problem is that I am reluctant to implement the logic to break down the phrase into terms in my own (python) code, as I can currently spot two options, both of which are not very good:

  • Ask Elasticsearch what those terms are before making my query, costing me an extra roundtrip (and in my specific case rewrite how I translate from my DSL to yours).
  • Mimick what the analyzer of the specific field does. This would probably be brittle because of subtle differences in the two analyzers.

My problem boils down to the fact that while I can make a DSL on top of yours that support phrases, terms, wildcards, and etc. in the normal clauses that do not care much about positioning. It becomes difficult when I want to add phrases to clauses that translate into span clauses.

However, inside Elasticsearch, you know what analyzer to use for a given field and can just use that directly, you can perhaps even figure out what to do if that analyzer differs across multiple indices (or maybe that is something lucene would do better).

@clintongormley
Copy link
Contributor

Hi @cknv

The bit I'm missing is this: you're already using span clauses, which are term based, so you already need to do the analysis to convert text to terms. e.g. to take your example:

 term NEAR/4 "some phrase"

Imagine you're using the english analyzer and terms has been stemmed to term. It is quite possible that your user will enter terms NEAR/4 "some phrase" which will find nothing unless you analyze terms -> term.

So you already have to deal with analysis for the words outside the phrase. Why would the words inside the phrase be any different?

I wonder if you shouldn't be looking at creating a plugin based on the surround query parser available in Lucene: https://lucene.apache.org/core/5_4_0/queryparser/index.html?org/apache/lucene/queryparser/surround/parser/package-summary.html

@mcuelenaere
Copy link

I have a similar request.
Given the query "Foo-BAR" and dataset ["bar foo bar", "foo bar", "foo bar foo"] I want "foo bar" to rank the highest.

This can be implemented with a span_first wrapping a span_near with two terms, however I need to perform the analysis part client-side as there is no ES clause AFAIK that does this (except for match_phrase, which cannot be wrapped in a span_first though).

http://grokbase.com/t/gg/elasticsearch/12bv1ee7ah/forcing-analysis-of-terms-and-span-terms describes pretty much the same.

@cknv
Copy link
Author

cknv commented Mar 15, 2016

It's true that I can deal with terms, but actually now that I think of it, it will break down if I ever decide to use stemming or smiliar modification of words that go into the index.
Currently I am mostly saved by my DSL that enables me to restrict the text I accept as terms, thus my need to change them is minimal, but actually I would need to do something and I realize now that my DSL is actually incomplete in certain edge cases.

I still think that part of the problem is having to replicate exactly what the different analyzers are doing, not to mention custom ones. As @mcuelenaere pointed out, I think this can provide a lot of help to provide the correct analysis of text into tokens. I am not sure how hard this is in Elasticsearch, but I hope that it could give the span queries a little more ease of use. Allowing developers to focus on whatever product we base upon Elasticsearch instead of having to figure out how to do text analysis.

@clintongormley
Copy link
Contributor

The span queries are low level term-oriented queries. They are building blocks that can be used to implement a custom query syntax, similar to the query_string query syntax, but more position aware. This isn't going to change.

Really exposing them via the query DSL is a bit of an anomaly. Normally they'd be used by a query parser written in Java and living on the server. Analysis is a vital part of the construction of queries which use span queries.

I think the solution here is to look for (or write) a custom query parser that supports operators like NEAR/4 etc.

@speedplane
Copy link
Contributor

@clintongormley I agree that this is likely the solution. I use ES and did exactly this, see here for an example.

@clintongormley
Copy link
Contributor

Related to #11328

@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Query DSL labels Feb 14, 2018
@javanna
Copy link
Member

javanna commented Mar 16, 2018

@elastic/es-search-aggs

@jimczi
Copy link
Contributor

jimczi commented Jun 1, 2018

We don't have plan to add query parser for the span queries at the moment.
However we're currently working on integrating a new position-based query called intervals:
#29636
They are very similar to span queries and we are going to handle analysis in the API as described in the issue.
For these reason I am going to close this issue and the progress for interval queries can be track in #29636 directly.

@jimczi jimczi closed this as completed Jun 1, 2018
@gideon-grossman
Copy link

How does the interval query handle the issue described here? Don't we still need to break up our search text into phrases before passing them into the intervals query?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss >feature high hanging fruit :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

No branches or pull requests

7 participants