Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Highlighter breaks phrases #29561

Closed
jacool opened this issue Apr 17, 2018 · 42 comments · Fixed by #96068
Closed

Highlighter breaks phrases #29561

jacool opened this issue Apr 17, 2018 · 42 comments · Fixed by #96068
Labels
>enhancement :Search Relevance/Highlighting How a query matched a document Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@jacool
Copy link

jacool commented Apr 17, 2018

Elasticsearch version (bin/elasticsearch --version):
6.2.3

Plugins installed: []

JVM version (java -version):
openjdk version "1.8.0_161"

OS version (uname -a if on a Unix-like system):
Linux 5137c3a21142 4.9.87-linuxkit-aufs

Description of the problem including expected versus actual behavior:
Highlighter breaks searched phrases into separate highlights - makes the highlighter results quite annoying to a user. In the example below the expected highlight would look like this:
shuffled off <em>this mortal coil</em>, must give us
Notice, while the Unified highlighter has this issue the FVH highlighter behaves according to the expectation.

Steps to reproduce:

PUT /test
{
  "mappings": {
    "t": {
      "properties": {
        "message": {
          "type": "text",
          "term_vector": "with_positions_offsets"
        }
      }
    }
  }
}

POST /test/t/1
{
    "message": "What dreams may come, when we have shuffled off this mortal coil, must give us pause."
}

GET /test/_search
{
  "version": true,
  "query": {
    "match_phrase": {
      "message": {
        "query": "this mortal coil"
      }
    }
  },
  "highlight": {
    "fields": {
      "message": {
        "type": "unified",
        "fragment_size": 40
      }
    }
  }
}

This results in the following highlighting, which is practically unusable:

"highlight": {
  "message": [
     "dreams may come, when we have shuffled off <em>this</em>",
     "<em>mortal</em> <em>coil</em>, must give us pause."
   ]
}
@cbuescher cbuescher added the :Search Relevance/Highlighting How a query matched a document label Apr 18, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@jimczi
Copy link
Contributor

jimczi commented Apr 18, 2018

The unified highlighter uses the fragment_size to build snippets so if you set a size of 40 it will try to build snippets of that size. If you set a bigger size (or use the default value of 100) it will return entire sentence. The unified highlighter detects sentences in the text to build snippets but we added a mechanism to cut sentences if they are bigger than the fragment_size otherwise you could end up with giant snippets because the sentence detection couldn't find a boundary for a big chunk of text.
I am going to close this issue because this is the expected behavior and you can modify it by specifying a bigger fragment_size. Also note that starting in 6.2, the unified highlighter can return more than 1 sentence if the whole fragment size is smaller than the provided fragment_size:
#28132

@jimczi jimczi closed this as completed Apr 18, 2018
@jacool
Copy link
Author

jacool commented Apr 19, 2018

@jimczi I believe you've missed the nature of this report. It is not about the sentence nor its size. The complaint is pertinent to the query used: match_phrase. Once PHRASE search is performed customers reasonably expect the highlights not to break the phrase in the middle. In other words, there is only one such phrase in the text and it is not reasonable that two highlights would be returned. The fact is the FVH highlighter understands this and performs accordingly.

@jimczi
Copy link
Contributor

jimczi commented Apr 19, 2018

Ok I understand better now, sorry @jacool I missed the match_phrase part.
I think this is legit and we should also have a way to highlight it as <em>this mortal coil</em> instead of annotating each word separately. I am reopening this issue and mark as it as enhancement. Thanks for the clarification !

@jacool
Copy link
Author

jacool commented May 10, 2018

We really see this as a bug rather than enhancement. Because when we search for a phrase and two highlights come back we would erroneously assume the phrase appears twice in the text. Moreover, our customers would be disappointed if clicking on the second highlight brought them to the same point in the text they've visited a second ago by clicking on the first highlight.

@david-sitsky
Copy link

I second that - this is a bug not an enhancement.

@byronvoorbach
Copy link
Contributor

Any progress on this bug by any chance? :)

@jimczi
Copy link
Contributor

jimczi commented Aug 1, 2018

The progress are in Lucene at the moment. We're iterating on a new API that is able to retrieve the positions and offsets of the query without introspection:
https://issues.apache.org/jira/browse/LUCENE-8229
This new API is able to preserve the blocks of positional queries (https://issues.apache.org/jira/browse/LUCENE-8306 and https://issues.apache.org/jira/browse/LUCENE-8404) so the main issue is now to integrate this work in the highlighters. The effort is started for the unified highlighter (https://issues.apache.org/jira/browse/LUCENE-8286) but a lot more work is needed.

@byronvoorbach
Copy link
Contributor

Thanks for the update! @jimczi

@sergii-sakharov
Copy link

All of the Lucene issues mentioned above are resolved as of Lucene 7.5. Does this imply that issue will be resolved in e.g. Elastic 6.5 ?

@jimczi
Copy link
Contributor

jimczi commented Sep 20, 2018

We're still discussing how we can introduce the new capabilities of the Lucene Matches API in Elasticsearch. The issues mentioned below are part of a bigger change that aims to make highlighters more accurate. I opened #33578 to discuss how we can introduce this new mode in the unified highlighter but we're also exploring other possibilities. One of them is to introduce a new highlighter that would always use the Lucene matches API. We'll update #33578 with the final decision but I can't make any promises regarding the 6.5 release, ideally it will be supported in 6x is the best I can say right now ;).

@sergii-sakharov
Copy link

sergii-sakharov commented Sep 21, 2018

So for the time being FVH is the only option for phrase highlighting?

Coincidentally I reproduced a bug with splitting phrase into words and highlighting all word occurrences even with FVH. This happens in case of "match_phrase_prefix" search with "max_expansions" set to something high enough.
For us even default value of 50 causes highlighter to fallback to this behaviour. In case of max_expansions set to 15 or lower, FVH works as expected and highlights whole phrases.

There is a support case in progress at the moment and that has more details for now...

@jimczi
Copy link
Contributor

jimczi commented Sep 24, 2018

So for the time being FVH is the only option for phrase highlighting?

Well as you noticed already the FVH has other issues with positional query. The unified highlighter made some progress regarding the handling of these queries but it still splits phrase queries into individual terms. The real fix of this issue is in the new Matches API that we need to use in Elasticsearch to provide accurate highlighting. We discussed internally and we think that a new highlighter is the better option toward this goal. I opened #34015 to track the progress and will work on a pr in the coming weeks.

@ORYLY
Copy link

ORYLY commented Oct 26, 2018

Hi, I'd like to chime in with another comment about the sentence boundary scanner, because maybe the limitation will be resolved by the new highlighter in the works.

As a workaround, I've tried raising the fragment size to make it more likely that the highlighter will find a sentence boundary as the fragment's start. However, I've come across content that uses line breaks as boundaries instead of punctuation (ex. bullet points that have been simplified to plaintext). In my corpus, it seems like a good call to include line breaks as sentence boundaries. But the highlighter API doesn't give me a way to fine tune the underlying BreakIterator.

@NoorKhan
Copy link
Contributor

NoorKhan commented Jan 9, 2019

FVH works well for my use case, but I've noticed when you have a query string query with wildcards, it doesn't highlight as you would expect.

Example:
Create mapping & document:

PUT /test/_doc/_mappings
{
    "_doc": {
      "properties": {
        "message": {
          "type": "text",
          "term_vector": "with_positions_offsets"
        }
      }
    }
}

POST /test/_doc/1
{
    "message": "What dreams may come, when we have shuffled off this mortal coil, must give us pause."
}

Run search:

GET /test/_search
{
  "query": {
    "query_string": {
      "query": "message: \"dreams may come\""
    }
  },
  "highlight": {
    "fields": {
      "message": {
        "type": "fvh",
        "fragment_size": 40
      }
    }
  },
  "_source": false
}

Highlight result:

"highlight": {
          "message": [
            "What <em>dreams may come</em>, when we have shuffled"
          ]
}

As you can see, the entire phrase is wrapped in my pre and post tags.

But if my query string query includes a wildcard and double quotes to surround the phrase, I get no hits:

GET /test/_search
{
  "version": true,
  "query": {
    "query_string": {
      "query": "message: \"dre?ms may come\""
    }
  },
  "highlight": {
    "fields": {
      "message": {
        "type": "fvh",
        "fragment_size": 40
      }
    }
  },
  "_source": false
}
↓
"hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }

If I wrap the phrase in single quotes, it doesn't highlight the term with a wildcard, and it doesn't even seem to use it to perform the search:

GET /test/_search
{
  "version": true,
  "query": {
    "query_string": {
      "query": "message: 'dre?ms may come'"
    }
  },
  "highlight": {
    "fields": {
      "message": {
        "type": "fvh",
        "fragment_size": 40
      }
    }
  },
  "_source": false
}

or (notice the wildcard is misspelled)

GET /test/_search
{
  "version": true,
  "query": {
    "query_string": {
      "query": "message: 'dre?mms may come'"
    }
  },
  "highlight": {
    "fields": {
      "message": {
        "type": "fvh",
        "fragment_size": 40
      }
    }
  },
  "_source": false
}
↓
"highlight": {
          "message": [
            "What dreams <em>may</em> <em>come</em>, when we have shuffled"
          ]
}

If there was no quotes around the phrase, the highlight works as expected with a wildcard:

GET /test/_search
{
  "query": {
    "query_string": {
      "query": "message: dre?ms may come"
    }
  },
  "highlight": {
    "fields": {
      "message": {
        "type": "fvh",
        "fragment_size": 40
      }
    }
  },
  "_source": false
}
↓
"highlight": {
          "message": [
            "What <em>dreams</em> <em>may</em> <em>come</em>, when we have shuffled"
          ]
}

Edit: I guess wildcard in a phrase in a query string query doesn't work at all. According to this: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html, double quoting for a phrase is the right thing to do.

I don't know if this is the relevant place to comment, but I was having the same problem as using the non fvh highlighter was wrapping terms in a phrase individually as opposed to the entire phrase.

@mr-mos
Copy link

mr-mos commented Jan 29, 2019

+1
"We really see this as a bug rather than enhancement. Because when we search for a phrase and two highlights come back we would erroneously assume the phrase appears twice in the text."

@EugeneHerasymchuk
Copy link

Is there any workaround for now?

@nmilford
Copy link

nmilford commented Jun 14, 2019

Just chiming in to keep it lively. We too have an open initiative that is affected by this. It would be delightful to see it addressed.

@elodeans
Copy link

Likewise, my client engagment is focusing on this bug which causes highlights to look like individual words are matched even when a compound term is searched. i.e. at quick glance user sees that each highlight matches only part of the compound term, and frankly makes it look like search itself is not honoring quotes around a compound term.

@elodeans
Copy link

elodeans commented Sep 5, 2019

Any updates on this defect?

@Noctis17
Copy link

any updates about this? "unified" type still chops matched keywords instead of highlighting the whole phrase

@jimczi
Copy link
Contributor

jimczi commented Sep 12, 2019

any updates about this?

Sorry, no ETA on this at the moment.

@Noctis17
Copy link

Hello and good day! any updates about this?

@Noctis17
Copy link

Noctis17 commented Dec 19, 2019

Hello and good day! Any updates about this ES highlight bug? using ES 7.0, highlight still splits my keywords by words

For example:
I searched for "December 2019 Christmas"

The result shows:
<em>December</em> <em>2019</em> <em>Christmas</em>

Instead of:
<em>December 2019 Christmas</em>

@Noctis17
Copy link

Noctis17 commented Jan 7, 2020

Any fixes about this? Is it resolved if we will upgrade our ES version?

@sloev
Copy link

sloev commented Jan 28, 2020

i've used this workaround for when using exact search mode in python3:

$ cat utils.py
Import re

def squash_phrase_highlight(
    highlighted_text,
    query,
    search_mode="exact",
    tag="\u0007",
    new_start_tag="\u0007",
    new_end_tag="\u0007",
):
    """
    This function makes up for ElasticSearch's missing ability to highlight 
    phrases according to the ongoing issue described here:
    
    example of issue:
        I searched for "December 2019 Christmas"

        The result shows:
        <em>December</em> <em>2019</em> <em>Christmas</em>

        Instead of:
        <em>December 2019 Christmas</em>
    
    so to fix that we are:
    * moving over all the TAG's left to right
    * checking if internal buffer exists in input query
    * flushes internal buffer (and removes internal TAG's, 
      only keeping ones in either end)
    
    We only do this when using searchMode==exact, here we expect 
    phrase functinality
    """
    if search_mode != "exact":
        return highlighted_text

    if not len(re.findall(tag, highlighted_text)):
        return new_start_tag + highlighted_text + new_end_tag

    return re.sub(
        re.escape(query),
        lambda m: new_start_tag + m.group(0) + new_end_tag,
        highlighted_text.replace(tag, ""),
    )

some tests:

import utils
import json

def test_two_phrases_in_one_text():
    output_highlighted_text = utils.squash_phrase_highlight(
        (
            "frank says \u0007you\u0007 \u0007are\u0007 \u0007what\u0007 \u0007you\u0007 \u0007is\u0007 and i belive him. "
            "because \u0007you\u0007 \u0007are\u0007 \u0007what\u0007 \u0007you\u0007 \u0007is\u0007 is the truth"
        ),
        "you are what you is",
    )
    output_highlighted_text_json = json.dumps(output_highlighted_text)
    print(output_highlighted_text_json)
    assert (
        output_highlighted_text
        == "frank says \u0007you are what you is\u0007 and i belive him. because \u0007you are what you is\u0007 is the truth"
    )

def test_finding_one_phrase():
    output_highlighted_text = utils.squash_phrase_highlight(
        (
            "frank says \u0007you\u0007 \u0007are\u0007 \u0007what\u0007 \u0007you\u0007 \u0007is\u0007 and i belive him."
        ),
        "you are what you is",
    )
    output_highlighted_text_json = json.dumps(output_highlighted_text)
    print(output_highlighted_text_json)
    assert (
        output_highlighted_text
        == "frank says \u0007you are what you is\u0007 and i belive him."
    )

def test_phrase_equals_whole_field():
    output_highlighted_text = utils.squash_phrase_highlight("wonderful", "wonderful",)
    output_highlighted_text_json = json.dumps(output_highlighted_text)
    print(output_highlighted_text_json)
    assert output_highlighted_text == "\u0007wonderful\u0007"

i know its a brutal hack, and it uses the same TAG for both ends, but i am sharing it here anyway

@rjernst rjernst added the Team:Search Meta label for search team label May 4, 2020
@kaiz123
Copy link

kaiz123 commented Jul 15, 2020

Any updates on this issue ? Has this been fixed yet ?

@kevinblanca1
Copy link

kevinblanca1 commented Oct 19, 2020

Any updates on this issue ? Has this been fixed yet ? Clarifications?

@sharprjodi
Copy link

What is the status on this issue? Was this ever addressed?

@FeschenkoAlex
Copy link

Is there any chance that this will be ever fixed? We are getting the same issue on ES 7.7.1

@jhackett1
Copy link

👀

@pedromoraesh
Copy link

Did someone find any workaround for this issue?

E.g: highlight phrases with one tag and single words with another? Separated queries (one for a single word another for phrases)?

@pappagallos
Copy link

It doesn't work even at ES 7.16.3. Does it work at ES 8.x?

@mecklund
Copy link

mecklund commented Apr 7, 2022

I have gotten this to work, but I guess that I am benefitting from using 7.x and recently 8.x.
My solution to this exact problem is to do a highlight_query

I will show two examples, one with a search boolean and one without. I will use the original phrase used in this thread. I will also add in other stuff to show how it can be used alongside other familiar features.

highlight_query with search boolean:

"highlight": {
    "fields": {
        "contentHtml": {
            "type": "fvh",
            "pre_tags": "<highlight>",
            "post_tags": "</highlight>",
            "number_of_fragments": 0,
            "highlight_query": {
                "bool": {
                    "should": [
                        {
                            "match_phrase": {
                                "contentHtml": {
                                    "query": "this mortal coil"
                                }
                            }
                        },
                        {
                            "match": {
                                "contentHtml": {
                                    "query": "this mortal coil",
                                    "fuzzy_transpositions": true
                                }
                            }
                        }
                    ]
                }
            }
        },
        "content": {
            "type": "fvh",
            "fragment_size": 150,
            "highlight_query": {
                "bool": {
                    "should": [
                        {
                            "match_phrase": {
                                "content": {
                                    "query": "this mortal coil"
                                }
                            }
                        },
                        {
                            "match": {
                                "content": {
                                    "query": "this mortal coil",
                                    "fuzzy_transpositions": true
                                }
                            }
                        }
                    ]
                }
            }
        }
    },
    "pre_tags": "<b>",
    "post_tags": "</b>",
    "fragment_size": 150,
    "boundary_chars": "",
    "number_of_fragments": 1
}

Because this uses a query structure, the highlighter tries to highlight the entire phrase first, but if it can't it will go for the individual words. NOTE: I am currently working on how to have the entire phrase highlight work if there are html tags between the words (ie this mortal coil). It already ignores the html on indexing, but the highlighter can't get around this (yet). I ask for the entire document back on the contentHtml field so that I can then calculate the "byte offsets" of each highlighted hit. That way I can also relay on to the client additional pertinent data (my unique situation)

Here is how you can do it without a search boolean:

"highlight": {
  "fields": {
      "contentHtml": {
          "type": "fvh",
          "pre_tags": "<highlight>",
          "post_tags": "</highlight>",
          "number_of_fragments": 0,
          "highlight_query": {
              "match_phrase": {
                  "contentHtml": {
                      "query": "this mortal coil"
                  }
              }
          }
      },
      "content": {
          "type": "fvh",
          "fragment_size": 150,
          "highlight_query": {
              "match_phrase": {
                  "content": {
                      "query": "this mortal coil"
                  }
              }      
          }
      }
  },
  "pre_tags": "<b>",
  "post_tags": "</b>",
  "fragment_size": 150,
  "boundary_chars": "",
  "number_of_fragments": 1
}

These fields use the html_strip analyzer. Sometimes these fields/documents can be really, really big for my content. I try to keep them as "type": "text" and NOT use keyword (32k limit)

Now, I realize that this might be a little slower (I really have no idea), but it's still pretty quick in my experience. In my index of 100k-700k documents I still get 300-400ms response times.

@Verhaeg
Copy link

Verhaeg commented Aug 25, 2022

Sorry for the long comment:

Complementing the issue at hand, I found some discrepancies in behavior for both types:

ES: 7.14

Mapping

# analyzer
"lowercase_no_accents": {
            "tokenizer": "standard",
            "filter": [
              "asciifolding",
              "lowercase"
            ]
          }
...
"content" : {
    "type" : "text",
    "index_phrases" : true,
    "term_vector" : "with_positions_offsets",
    "index_options" : "offsets",
    "analyzer": "lowercase_no_accents"
},

Unified

When searching for a quoted text, if I have a repeating term, in the highlights only one would be marked, the other(s) won't. But NOT always, and usually with connection words.

Example in PT-BR:

"content": "A cessão não desnatura o sujeito passivo da obrigação tributária, cabendo ao cedente abater do preço da cessão o seu valor correspondente, uma vez que o critério material da hipótese de incidência do imposto de renda, como visto, é a aquisição da disponibilidade econômica ou jurídica de renda, embora o critério temporal ocorra somente com o pagamento."

# Query
{
    "query_string": {
        "query": "\"de renda, como visto, é a aquisição da disponibilidade econômica ou jurídica de renda\"",
        "fields": [
            "content"
        ],
        "default_operator": "AND",
        "phrase_slop": 1
    }
}
...
"highlight": {
    "type": "unified",
    "fields": {
      "content": {
        "number_of_fragments": 0,
        "pre_tags": "<mark>",
        "post_tags": "</mark>",
        "require_field_match": false
      }
    }
}

Output: (probably due to phrase_slop=1)

A cessão não desnatura o sujeito passivo da obrigação tributária, cabendo ao cedente abater do preço da cessão o seu valor correspondente, uma vez que o critério material da hipótese de incidência do imposto <mark>de</mark> <mark>renda</mark>, <mark>como</mark> <mark>visto</mark>, <mark>é</mark> <mark>a</mark> <mark>aquisição</mark> <mark>da</mark> <mark>disponibilidade</mark> <mark>econômica</mark> <mark>ou</mark> <mark>jurídica</mark> <mark>de</mark> <mark>renda</mark>, embora o critério temporal ocorra somente com o pagamento

With issue by changing query string to:

"query_string": "\"passivo da obrigação tributária, cabendo ao cedente abater do preço da cessão\""

output:

A cessão não desnatura o sujeito <mark>passivo</mark> <mark>da</mark> <mark>obrigação</mark> <mark>tributária</mark>, <mark>cabendo</mark> <mark>ao</mark> <mark>cedente</mark> <mark>abater</mark> <mark>do</mark> <mark>preço</mark> da <mark>cessão</mark> o seu valor correspondente, uma vez que o critério material da hipótese de incidência do imposto de renda, como visto, é a aquisição da disponibilidade econômica ou jurídica de renda, embora o critério temporal ocorra somente com o pagamento

it didn't highlight the last da word.

With FVH highlighter it works as expected, creating one big markup.

But with FVH, proximity searches stops highlighting as expected

Example:

"query_string": "\"preço cabendo\"~10"

Output with fvh does not highlight in contrary to unified that does highlight as expected.

@moise-g
Copy link

moise-g commented Jan 24, 2023

This is pretty painful. Do you have any update on this pr? @romseygeek
#85677

@bhaveshpatel640
Copy link

👀

@ScottCov
Copy link

ScottCov commented Mar 3, 2023

Bueller????

@jade-lucas
Copy link

Hi romseygeek, mayya-sharipova. It seems you two the most knowledgeable about this based on your activity in #85677. It seems like you guys are very close, just needs to pass some final check\tests? Any chance that this can make 8.8? It would be very beneficial. Thank you for all the work you have done so far to address it.

@mbushpilot2b
Copy link

We are experiencing this issue too! This is now 5 years old, we need a solution. Our clients are complaining of the same thing, phrases getting returned as multiple hits for the same phrase because its getting split somewhere in the phrase as two or more hits. Please update us on a solution for this bug? @jimczi

@mbushpilot2b
Copy link

mbushpilot2b commented May 10, 2023

So after some research, we were able to just use "type": "plain" as seen here:
Screenshot 2023-05-10 at 3 02 43 PM

With NO other configurations and it works as expected, does NOT split on up a phrase into multiple "hits" and it has a nice bit of characters around it as a proper text snippet. Documentation found here see "type"

@legistek
Copy link

So after some research, we were able to just use "type": "plain" as seen here: Screenshot 2023-05-10 at 3 02 43 PM

With NO other configurations and it works as expected, does NOT split on up a phrase into multiple "hits" and it has a nice bit of characters around it as a proper text snippet. Documentation found here see "type"

Having just dealt with this issue, I found FVH highlights properly (without breaking up phrases) but the other types do not. Not sure if there's been a recent change.

@javanna javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Relevance/Highlighting How a query matched a document Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

Successfully merging a pull request may close this issue.