Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minimum_should_match is not working for query_string query while searching on attachment field #34142

Closed
ropal opened this issue Sep 28, 2018 · 11 comments
Labels
>docs General docs changes :Search/Search Search-related issues that do not fall into other categories

Comments

@ropal
Copy link

ropal commented Sep 28, 2018

Elasticsearch version (bin/elasticsearch --version): 6.1.2

Plugins installed: [ingest-attachemnt, analysis-phonetic]

JVM version (java -version): 1.8.181

OS version (uname -a if on a Unix-like system): Windows 2012R2

Description of the problem including expected versus actual behavior:
While using query_string query with multiple fields provided in fields attribute, if one of the field is attachment content field, minimum_should_match is not giving expected behavior. For example, if the search text is "foo bar" and minimum_should_match as 100%, it returns documents with either foo or bar which is not expected. If I specify operator in the search text like "foo OR bar" with minimum_should_match as 100% it looks for both foo and bar in the document.

Steps to reproduce:

Query used:
{ "query":{ "bool":{ "must":[ { "match":{ "title":"English" } }, { "query_string": { "fields":[ "address", "message", "attachment*.content" ], "query":"foo bar", "minimum_should_match":"100%" } } ] } } }

all the properties are mapped with custom analyzer shown below.

"my_custom_analyzer": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "standard" ] }

This works as expected if I remove the field "attachment*.content" from the list of fields.

@jimczi jimczi added >bug :Search/Search Search-related issues that do not fall into other categories labels Sep 28, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@cbismuth
Copy link
Contributor

I'll have a look at this one.

@cbismuth
Copy link
Contributor

I can reproduce the issue on master (see cURL recreation script here and output there). I'll give it a debug shot soon.

@jimczi
Copy link
Contributor

jimczi commented Nov 26, 2018

I don't think your recreation is correct. The foo AND bar and foo OR bar cases work as expected with and without "minimum_should_match":"100%". The case that works differently is when you omit the operator and rely on the default operator set on the query (foo bar). Since the query_string doesn't split on whitespace anymore, we analyze foo bar entirely, create one query for each field and finally we merge the field queries into a single query. minimum_should_match is always applied on the final query so if you have two fields and use all default options you end up with a single query that is a disjunction over the two fields present in the query (below is an example with text and title):

((title:foo title:bar) | (text:foo text:bar))

So that's not really a bug, just a side effect of the fact that we don't consider whitespaces as operators anymore. Here is the same query if you use an explicit OR between the terms (e.g.: foo OR bar):

((title:foo | text:foo) (title:bar | text:bar))

One way to retrieve the old behavior is to add "type": "cross_fields" in the query_string to indicate that fields that have the same analyzer should be grouped together when we analyze the input. However If you use different analyzers then you have no other choice than adding explicit operators in the query to ensure that the minimum_should_match setting is applied to each term (e.g. foo OR bar).
I don't think there's anything to fix here unfortunately but we could add a small section in the documentation to explain the behavior of minimum_should_match when no explicit operators are present ?

@jimczi jimczi added >docs General docs changes and removed >bug labels Nov 26, 2018
@cbismuth
Copy link
Contributor

Thanks a lot for the very detailed explanation and examples @jimczi! That's a tricky area 😅

I need play a tiny bit more with the minimum_should_match parameter when no explicit operator is present. I'll suggest a documentation improvement as soon as I am comfortable with it. Thanks again.

@cbismuth
Copy link
Contributor

cbismuth commented Nov 28, 2018

Alright, I've played a tiny bit more with this query.

You've probably explained it before and I wasn't able to understand, but here it is with my own understanding of the issue: in the updated recreation script, the minimum_should_match parameter is taken into account only when there is only one single field in the fields array, anyone of them, otherwise the minimum_should_match parameter is ignored.

Is it this side effect you talked about?

In the updated recreation query is made of terms foo bar baz azerty qwerty elastic lucene.

Explain output with only one field (attachment.content) in query, no hit due to the ~4 operator:

{
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "valid" : true,
  "explanations" : [
    {
      "index" : "index1",
      "valid" : true,
      "explanation" : "+title:title1 +((attachment.content:foo attachment.content:bar attachment.content:baz attachment.content:azerty attachment.content:qwerty attachment.content:elastic attachment.content:lucene)~4)"
    }
  ]
}

Explain output with all three fields in query, one hit because no ~4 operator:

{
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "valid" : true,
  "explanations" : [
    {
      "index" : "index1",
      "valid" : true,
      "explanation" : "+title:title1 +((attachment.content:foo attachment.content:bar attachment.content:baz attachment.content:azerty attachment.content:qwerty attachment.content:elastic attachment.content:lucene) | (message:foo message:bar message:baz message:azerty message:qwerty message:elastic message:lucene) | (address:foo address:bar address:baz address:azerty address:qwerty address:elastic address:lucene))"
    }
  ]
}

@jimczi
Copy link
Contributor

jimczi commented Nov 28, 2018

the minimum_should_match parameter is taken into account only when there is only one single field in the fields array, anyone of them, otherwise the minimum_should_match parameter is ignored.

No, it doesn't work in your recreation because you don't use explicit operator (OR, AND). Since whitespaces are not considered as operators we consider the whole text as a single clause and we build the query for each field using this input. The final boolean query has a single clause (the disjunction max query over the fields) so we don't apply the minimum should match. If you have a single field there is no ambiguity and we can apply the minimum should match even if there are no explicit operators. Operators when using multi fields are a way to separate the input into clauses, so foo AND bar is considered as two clauses no matter how many fields you have and foo bar is a single clause for each field.

@cbismuth
Copy link
Contributor

I've got it, thanks for the (very) quick and detailed answer @jimczi, I've played with explicit operators and it worked as I expected.

I should now have everything I need to suggest a documentation improvement 👍

For future readers of this issue, here is a short summary of these experiments.

  • Query foo OR bar and minimum_should_match set to 1 explained as +(((message:foo | address:foo) (message:bar | address:bar))~1)
  • Query foo bar and minimum_should_match set to 1 explained as +((message:foo message:bar) | (address:foo address:bar))

@cbismuth
Copy link
Contributor

Hi, here is a documentation improvement proposal in PR #36109.

@rutuls
Copy link

rutuls commented Jan 13, 2022

I am using following query with explicit OR and AND. But it does not return expected result. Here it is ambiguous that for which field min should match will be applied. What am I missing ?
{
"query": {
"query_string": {
"query": "(entities.id.keyword:entity-21 OR entities.id.keyword:entity-19 OR entities.id.keyword:entity-45 OR entities.id.keyword:entity-68 OR entities.id.keyword:entity-73 OR entities.id.keyword:entity-78) AND processed.keyword:Y",
"minimum_should_match": 1
}
}
}

@imotov
Copy link
Contributor

imotov commented Jan 13, 2022

What am I missing ?

@rutuls we are using github for bug reports and feature requests. The best place to ask questions like this would be our discussion forum. To get a timely response, please add the output of the command and explain what result you would have expected and how they are different form what you actually get.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>docs General docs changes :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

No branches or pull requests

6 participants