Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Highlight issue with elision analyzer: prefix before apostrophe is highlighted instead of the query #52264

Closed
lsamper opened this issue Feb 12, 2020 · 4 comments
Labels
>bug :Search Relevance/Highlighting How a query matched a document Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@lsamper
Copy link

lsamper commented Feb 12, 2020

Description of the problem including expected versus actual behavior:

When indexing documents with the (french) elision analyzer,

If the document contains: l’avocat

And the query is : avocat

The highlight returns <em>l’</em> instead of <em>avocat<em>

More generally the problem happens:

When indexing documents with the elision analyzer,

If the document contains:

  • an article from the elision articles
  • followed by an apostrophe, the 'RIGHT SINGLE QUOTATION MARK' (U+2019), (note it does work if the apostrophe is the normal simple quote: ' )
  • followed by a term (here avocat in my examle)

and if the query to highlight is this term (avocat)

then the article is highlighted instead of the term

Steps to reproduce the bug:

# delete the index
curl --request DELETE --url http://localhost:9200/bug_highlight

# create an index 
curl --request PUT \
  --url http://localhost:9200/bug_highlight \
  --header 'content-type: application/json' \
  --data '{	  "settings": {
            "analysis": {
                "analyzer": {
                    "light": {
                        "tokenizer": "standard",
                        "filter": ["french_elision"]
                    }
                },
                "filter": {
                    "french_elision": {
    "type": "elision",
    "articles_case": "true",
    "articles": ["l", "m", "t", "qu", "n", "s", "j", "d", "c", "jusqu", "quoiqu", "lorsqu", "puisqu"]
}
                }
            }
        },
        "mappings": {
            "dynamic": "false",
            "properties": {
                "full_text": {
                    "type": "text",
                    "analyzer": "light"
                }
            }
        }
    }'

# Index one very small document
curl --request POST \
  --url http://localhost:9200/bug_highlight/_doc/small_doc \
  --header 'content-type: application/json' \
  --data '{ "full_text": "l’avocat"}'

# Query this document
curl --request GET \
  --url http://localhost:9200/bug_highlight/_search \
  --header 'content-type: application/json' \
  --data '{"query": {"bool": {"filter": [{"term": {"_id": "small_doc"}}]}}, "highlight": {"fields": {"full_text": {"highlight_query": {"match": {"full_text": {"query": "avocat"}}}, "number_of_fragments": 1, "fragment_size": 1}}}}'

# Index a longer document
curl --request POST \
  --url http://localhost:9200/bug_highlight/_doc/doc_int \
  --header 'content-type: application/json' \
  --data '{"full_text": "voluptates repudiandae sint et molestiae non recusandae. Itaque earum rerum hic tenetur a sapiente delectus, ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat l’avocat Sed ut perspiciatis, unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam eaque ipsa, quae ab illo inventore veritatis et quasi architecto "}
'

# query this documents
curl --request GET \
  --url http://localhost:9200/bug_highlight/_search \
  --header 'content-type: application/json' \
  --data '{"query": {"bool": {"filter": [{"term": {"_id": "doc_int"}}]}}, "size": 1, "highlight": {"order": "score", "require_field_match": "false", "fields": {"full_text": {"highlight_query": {"match": {"full_text": {"query": "avocat"}}}, "number_of_fragments": 1, "fragment_size": 155}}}}'

Responses to the queries:

1: The very small document coitaining only l’avocat

 curl --request GET   --url http://localhost:9200/bug_highlight/_search   --header 'content-type: application/json'   --data '{"query": {"bool": {"filter": [{"term": {"_id": "small_doc"}}]}}, "highlight": {"fields": {"full_text": {"highlight_query": {"match": {"full_text": {"query": "avocat"}}}, "number_of_fragments": 1, "fragment_size": 1}}}}'
{"took":2,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":0.0,"hits":[{"_index":"bug_highlight","_type":"_doc","_id":"small_doc","_score":0.0,"_source":{ "full_text": "l’avocat"},"highlight":{"full_text":["<em>l’</em>"]}}]}}
  1. A longer doc with a more realistic highlight fragment
curl --request GET   --url http://localhost:9200/bug_highlight/_search   --header 'content-type: application/json'   --data '{"query": {"bool": {"filter": [{"term": {"_id": "doc_int"}}]}}, "size": 1, "highlight": {"order": "score", "require_field_match": "false", "fields": {"full_text": {"highlight_query": {"match": {"full_text": {"query": "avocat"}}}, "number_of_fragments": 1, "fragment_size": 155}}}}' | python -m json.tool | grep -A 3 '"highlight"'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1142  100   862  100   280   120k  40000 --:--:-- --:--:-- --:--:--  159k
                "highlight": {
                    "full_text": [
                        "Itaque earum rerum hic tenetur a sapiente delectus, ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat <em>l\u2019</em>"

Elasticsearch version (bin/elasticsearch --version):
Version: 7.6.0, Build: default/deb/7f634e9f44834fbc12724506cc1da681b0c3b1e3/2020-02-06T00:09:00.449973Z, JVM: 13.0.2

Plugins installed: []

JVM version (java -version):
openjdk 13.0.2 2020-01-14
OpenJDK Runtime Environment AdoptOpenJDK (build 13.0.2+8)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 13.0.2+8, mixed mode, sharing)

OS version (uname -a if on a Unix-like system):
Linux user-desktop 5.3.0-28-generic #30~18.04.1-Ubuntu SMP Fri Jan 17 06:14:09 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

@martijnvg martijnvg added the :Search Relevance/Highlighting How a query matched a document label Feb 13, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Highlighting)

@rjernst rjernst added the Team:Search Meta label for search team label May 4, 2020
@joshuacords
Copy link

This issue is also affecting us, hoping to see it addressed at some point.

@javanna javanna added the >bug label May 3, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@mayya-sharipova
Copy link
Contributor

Closing this, as the issue is not present in the Elasticsearch anymore (v 8.12). The highlight for the about request returns:

    "hits": [
      {
        "_index": "index1",
        "_id": "1",
        "_score": 1.0,
        "_source": {
          "full_text": "l’avocat"
        },
        "highlight": {
          "full_text": [
            "<em>l’avocat</em>"
          ]
        }
      },
      {
        "_index": "index1",
        "_id": "2",
        "_score": 1.0,
        "_source": {
          "full_text": "voluptates repudiandae sint et molestiae non recusandae. Itaque earum rerum hic tenetur a sapiente delectus, ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat l’avocat Sed ut perspiciatis, unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam eaque ipsa, quae ab illo inventore veritatis et quasi architecto "
        },
        "highlight": {
          "full_text": [
            "<em>l’avocat</em>"
          ]
        }
      }
    ]
  }

@javanna javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search Relevance/Highlighting How a query matched a document Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

8 participants