Phrases in span_near #16796

cknv · 2016-02-24T14:09:48Z

I am using span_near queries and while I can put in span_terms and even span_or, I am missing a span_phrase clause in the query dsl.

As I understand it, a regular phrase clause, is very similar (at least conceptually) to a span_near, where the slop is 0 and the terms are the phrase that has been run through the same analyzer as the field uses.

However: while I could construct that span_near and its span_terms by hand, I would have to be very careful making the terms (getting it close is easy; getting it right is hard) and I would only be able to cover one analyzer at a time, which is hardly ideal.
Alternatively, I could also ask elasticsearch to analyze the phrase for me (according to the fields analyzer), and then use that in the query, but that would cost me an extra round trip. Not to mention that the query construction is actually in a library that is currently blissfully unaware of the actual elasticsearch nodes, it just builds the query.

So having elasticsearch take the phrase and construct a lucene SpanNearQuery or something akin to that, would be very nice, and save me a lot of trouble.

Maybe the dsl could look something like:

{
    "span_phrase": {
        "<field>": "<phrase>"
    }
}

Which I could then embed in my span_near, like any other span clause.

The text was updated successfully, but these errors were encountered:

clintongormley · 2016-02-28T22:19:18Z

Hi @cknv

All of the span queries are term-level queries, ie the query string needs to be analyzed before the resulting terms are used with span queries. A phrase query would behave completely differently as it would include analysis. Given that you already need to deal directly with terms, I'm not sure what a span_phrase query would buy you?

cknv · 2016-02-29T13:53:15Z

I know that it would include analyzis of the submitted text. That is the whole point. The tl;dr of it is that I have a DSL where I want to add the ablility to search for phrases in near clauses, something like:

[term "some phrase"]~4, somewhat similar to your own proximity searches - although I had to change the grammar a bit.
term NEAR/4 "some phrase", sometimes called a near group.

These two examples is not the hardest to break down, but it is not hard to imagine worse cases.

Now, the problem is that I am reluctant to implement the logic to break down the phrase into terms in my own (python) code, as I can currently spot two options, both of which are not very good:

Ask Elasticsearch what those terms are before making my query, costing me an extra roundtrip (and in my specific case rewrite how I translate from my DSL to yours).
Mimick what the analyzer of the specific field does. This would probably be brittle because of subtle differences in the two analyzers.

My problem boils down to the fact that while I can make a DSL on top of yours that support phrases, terms, wildcards, and etc. in the normal clauses that do not care much about positioning. It becomes difficult when I want to add phrases to clauses that translate into span clauses.

However, inside Elasticsearch, you know what analyzer to use for a given field and can just use that directly, you can perhaps even figure out what to do if that analyzer differs across multiple indices (or maybe that is something lucene would do better).

clintongormley · 2016-03-02T09:09:06Z

Hi @cknv

The bit I'm missing is this: you're already using span clauses, which are term based, so you already need to do the analysis to convert text to terms. e.g. to take your example:

 term NEAR/4 "some phrase"

Imagine you're using the english analyzer and terms has been stemmed to term. It is quite possible that your user will enter terms NEAR/4 "some phrase" which will find nothing unless you analyze terms -> term.

So you already have to deal with analysis for the words outside the phrase. Why would the words inside the phrase be any different?

I wonder if you shouldn't be looking at creating a plugin based on the surround query parser available in Lucene: https://lucene.apache.org/core/5_4_0/queryparser/index.html?org/apache/lucene/queryparser/surround/parser/package-summary.html

mcuelenaere · 2016-03-10T14:50:03Z

I have a similar request.
Given the query "Foo-BAR" and dataset ["bar foo bar", "foo bar", "foo bar foo"] I want "foo bar" to rank the highest.

This can be implemented with a span_first wrapping a span_near with two terms, however I need to perform the analysis part client-side as there is no ES clause AFAIK that does this (except for match_phrase, which cannot be wrapped in a span_first though).

http://grokbase.com/t/gg/elasticsearch/12bv1ee7ah/forcing-analysis-of-terms-and-span-terms describes pretty much the same.

cknv · 2016-03-15T10:45:01Z

It's true that I can deal with terms, but actually now that I think of it, it will break down if I ever decide to use stemming or smiliar modification of words that go into the index.
Currently I am mostly saved by my DSL that enables me to restrict the text I accept as terms, thus my need to change them is minimal, but actually I would need to do something and I realize now that my DSL is actually incomplete in certain edge cases.

I still think that part of the problem is having to replicate exactly what the different analyzers are doing, not to mention custom ones. As @mcuelenaere pointed out, I think this can provide a lot of help to provide the correct analysis of text into tokens. I am not sure how hard this is in Elasticsearch, but I hope that it could give the span queries a little more ease of use. Allowing developers to focus on whatever product we base upon Elasticsearch instead of having to figure out how to do text analysis.

clintongormley · 2016-03-15T17:46:35Z

The span queries are low level term-oriented queries. They are building blocks that can be used to implement a custom query syntax, similar to the query_string query syntax, but more position aware. This isn't going to change.

Really exposing them via the query DSL is a bit of an anomaly. Normally they'd be used by a query parser written in Java and living on the server. Analysis is a vital part of the construction of queries which use span queries.

I think the solution here is to look for (or write) a custom query parser that supports operators like NEAR/4 etc.

speedplane · 2017-03-30T07:05:03Z

@clintongormley I agree that this is likely the solution. I use ES and did exactly this, see here for an example.

clintongormley · 2017-06-01T08:03:02Z

Related to #11328

javanna · 2018-03-16T11:10:55Z

@elastic/es-search-aggs

jimczi · 2018-06-01T13:49:20Z

We don't have plan to add query parser for the span queries at the moment.
However we're currently working on integrating a new position-based query called intervals:
#29636
They are very similar to span queries and we are going to handle analysis in the API as described in the issue.
For these reason I am going to close this issue and the progress for interval queries can be track in #29636 directly.

gideon-grossman · 2020-05-19T08:19:09Z

How does the interval query handle the issue described here? Don't we still need to break up our search text into phrases before passing them into the intervals query?

clintongormley added feedback_needed :Search Relevance/Analysis How text is split into tokens labels Feb 28, 2016

clintongormley added >feature high hanging fruit discuss :Query DSL and removed :Search Relevance/Analysis How text is split into tokens feedback_needed labels Mar 15, 2016

clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Query DSL labels Feb 14, 2018

jimczi closed this as completed Jun 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phrases in span_near #16796

Phrases in span_near #16796

cknv commented Feb 24, 2016

clintongormley commented Feb 28, 2016

cknv commented Feb 29, 2016

clintongormley commented Mar 2, 2016

mcuelenaere commented Mar 10, 2016

cknv commented Mar 15, 2016

clintongormley commented Mar 15, 2016

speedplane commented Mar 30, 2017

clintongormley commented Jun 1, 2017

javanna commented Mar 16, 2018

jimczi commented Jun 1, 2018

gideon-grossman commented May 19, 2020

Phrases in span_near #16796

Phrases in span_near #16796

Comments

cknv commented Feb 24, 2016

clintongormley commented Feb 28, 2016

cknv commented Feb 29, 2016

clintongormley commented Mar 2, 2016

mcuelenaere commented Mar 10, 2016

cknv commented Mar 15, 2016

clintongormley commented Mar 15, 2016

speedplane commented Mar 30, 2017

clintongormley commented Jun 1, 2017

javanna commented Mar 16, 2018

jimczi commented Jun 1, 2018

gideon-grossman commented May 19, 2020