Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wildcard Query with Stop Words in Romanian Analyzer Generates MatchNoDocsQuery #114185

Closed
halilbulentorhon opened this issue Oct 5, 2024 · 6 comments · Fixed by #114264
Closed
Labels
:Search Relevance/Analysis How text is split into tokens :Search Relevance/Ranking Scoring, rescoring, rank evaluation. Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@halilbulentorhon
Copy link
Contributor

Description

Hi,
On the Romanian analyzer, using a wildcard for a stop word in a simple_query_string generates a MatchNoDocsQuery, which prevents matching any documents. The expected behavior is to remove the stop word from the query instead of generating a MatchNoDocsQuery. This behavior seems similar to issue #1272.

Environment

{
  "name" : "7f11a5b18353",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "_XRGD9iLQP6BSlAcqjVV0A",
  "version" : {
    "number" : "8.1.3",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "39afaa3c0fe7db4869a161985e240bd7182d7a07",
    "build_date" : "2022-04-19T08:13:25.444693396Z",
    "build_snapshot" : false,
    "lucene_version" : "9.0.0",
    "minimum_wire_compatibility_version" : "7.17.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "You Know, for Search"
}

Steps to Reproduce:

  1. Create an index with the Romanian analyzer:

    PUT test_index
    {
      "settings": {
        "index": {
          "refresh_interval": "1s",
          "number_of_shards": "1",
          "analysis": {
            "analyzer": {
              "default": {
                "type": "romanian"
              }
            }
          },
          "number_of_replicas": "0"
        }
      },
      "mappings": {
        "properties": {
          "text": {
            "type": "text"
          }
        }
      }
    }
    
  2. Index a document containing the stopword "Pentru":

    PUT test_index/_doc/1
    {
      "text": "Textile Pentru Casă"
    }
    
  3. Search using simple_query_string with a wildcard:

    GET test_index/_search
    {
      "profile": true, 
      "query": {
        "simple_query_string": {
          "query": "(Textile*+Pentru*+Casă*)",
          "fields": ["text"],
          "analyze_wildcard": true
        }
      }
    }
    

Observed Behavior

The search result returns no matches, and the query generates a MatchNoDocsQuery. The profile output shows:

{
  "type": "MatchNoDocsQuery",
  "description": "MatchNoDocsQuery('empty BooleanQuery')"
}

Expected Behavior

The stop word should be removed from the query, and the document should match as expected.

@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Oct 5, 2024
@benwtrent
Copy link
Member

I replicated on latest with a simplified format.

GET test_index/_search
{
  "query": {
    "query_string": {
      "query": "Textile* Pentru* Casă*",
      "fields": ["text"],
      "analyze_wildcard": true,
      "default_operator": "AND"
    }
  }
}

This works. I think this is a bug in simple_query_string and analyzing wildcards.

For example:

GET test_index/_search
{ 
  "query": {
    "simple_query_string": {
      "query": "Textile* Pentru* Casă*",
      "fields": ["text"],
      "analyze_wildcard": true,
      "default_operator": "AND"
    }
  }
}

This is basically the same query and should result in the same, but it gets transformed into a match-none.

FWIW, this is a general problem. Here is another failing example:

PUT test_index
{
  "mappings": {
    "properties": {
      "text": {
        "analyzer": "english",
        "type": "text"
      }
    }
  }
}

POST test_index/_doc
{
  "text": "The Times"
}

GET test_index/_search
{
  "query": {
    "simple_query_string": {
      "query": "the* time*",
      "fields": ["text"],
      "analyze_wildcard": true,
      "default_operator": "AND"
    }
  }
}

@benwtrent benwtrent added :Search Relevance/Analysis How text is split into tokens :Search Relevance/Ranking Scoring, rescoring, rank evaluation. labels Oct 7, 2024
@elasticsearchmachine elasticsearchmachine added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed needs:triage Requires assignment of a team area label labels Oct 7, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@halilbulentorhon
Copy link
Contributor Author

Hi,
Thank you for reviewing. I’m willing to fix this bug ASAP and will send a PR shortly.

@benwtrent
Copy link
Member

The bug was introduced in https://issues.apache.org/jira/browse/LUCENE-10022

It was fixed in query_string, but never in simple_query_string: #35756

@benwtrent
Copy link
Member

test to replicate the bug

 public void testSimpleQueryStringWithAnalysisStopWords() throws Exception {
        String mapping = Strings.toString(
            XContentFactory.jsonBuilder()
                .startObject()
                .startObject("properties")
                .startObject("body")
                .field("type", "text")
                .field("analyzer", "stop")
                .endObject()
                .endObject()
                .endObject()
        );

        CreateIndexRequestBuilder mappingRequest = indicesAdmin().prepareCreate("test1").setMapping(mapping);
        mappingRequest.get();
        indexRandom(true, prepareIndex("test1").setId("1").setSource("body", "Some Text"));
        refresh();

        assertHitCount(prepareSearch().setQuery(simpleQueryStringQuery("the* text*").analyzeWildcard(true).defaultOperator(Operator.AND).field("body")), 1);
    }

In SimpleQueryStringIT

The fix should be like 3 loc.

@benwtrent
Copy link
Member

@halilbulentorhon go ahead and open a PR :) the fix should be simple, add the test ^ above and 3 lines in the simple query string parser to return null if the number of disjuctions is empty instead of creating an empty DisMax query.

benwtrent pushed a commit that referenced this issue Oct 8, 2024
…is empty (#114264)

This change fixes analyzed wildcard query in simple_query_string when disjunctions is empty.

Closes #114185
benwtrent pushed a commit to benwtrent/elasticsearch that referenced this issue Oct 8, 2024
…is empty (elastic#114264)

This change fixes analyzed wildcard query in simple_query_string when disjunctions is empty.

Closes elastic#114185

(cherry picked from commit 6955bc1)
benwtrent pushed a commit to benwtrent/elasticsearch that referenced this issue Oct 8, 2024
…is empty (elastic#114264)

This change fixes analyzed wildcard query in simple_query_string when disjunctions is empty.

Closes elastic#114185

(cherry picked from commit 6955bc1)
elasticsearchmachine pushed a commit that referenced this issue Oct 8, 2024
…is empty (#114264) (#114355)

This change fixes analyzed wildcard query in simple_query_string when disjunctions is empty.

Closes #114185

(cherry picked from commit 6955bc1)

Co-authored-by: Halil Bülent Orhon <[email protected]>
elasticsearchmachine pushed a commit that referenced this issue Oct 9, 2024
…is empty (#114264) (#114354)

This change fixes analyzed wildcard query in simple_query_string when disjunctions is empty.

Closes #114185

(cherry picked from commit 6955bc1)

Co-authored-by: Halil Bülent Orhon <[email protected]>
Co-authored-by: Elastic Machine <[email protected]>
matthewabbott pushed a commit to matthewabbott/elasticsearch that referenced this issue Oct 10, 2024
…is empty (elastic#114264)

This change fixes analyzed wildcard query in simple_query_string when disjunctions is empty.

Closes elastic#114185
davidkyle pushed a commit to davidkyle/elasticsearch that referenced this issue Oct 13, 2024
…is empty (elastic#114264)

This change fixes analyzed wildcard query in simple_query_string when disjunctions is empty.

Closes elastic#114185
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Search Relevance/Analysis How text is split into tokens :Search Relevance/Ranking Scoring, rescoring, rank evaluation. Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
3 participants