-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Highlighter breaks phrases #29561
Comments
Pinging @elastic/es-search-aggs |
The |
@jimczi I believe you've missed the nature of this report. It is not about the sentence nor its size. The complaint is pertinent to the query used: match_phrase. Once PHRASE search is performed customers reasonably expect the highlights not to break the phrase in the middle. In other words, there is only one such phrase in the text and it is not reasonable that two highlights would be returned. The fact is the FVH highlighter understands this and performs accordingly. |
Ok I understand better now, sorry @jacool I missed the |
We really see this as a bug rather than enhancement. Because when we search for a phrase and two highlights come back we would erroneously assume the phrase appears twice in the text. Moreover, our customers would be disappointed if clicking on the second highlight brought them to the same point in the text they've visited a second ago by clicking on the first highlight. |
I second that - this is a bug not an enhancement. |
Any progress on this bug by any chance? :) |
The progress are in Lucene at the moment. We're iterating on a new API that is able to retrieve the positions and offsets of the query without introspection: |
Thanks for the update! @jimczi |
All of the Lucene issues mentioned above are resolved as of Lucene 7.5. Does this imply that issue will be resolved in e.g. Elastic 6.5 ? |
We're still discussing how we can introduce the new capabilities of the Lucene Matches API in Elasticsearch. The issues mentioned below are part of a bigger change that aims to make highlighters more accurate. I opened #33578 to discuss how we can introduce this new mode in the |
So for the time being FVH is the only option for phrase highlighting? Coincidentally I reproduced a bug with splitting phrase into words and highlighting all word occurrences even with FVH. This happens in case of "match_phrase_prefix" search with "max_expansions" set to something high enough. There is a support case in progress at the moment and that has more details for now... |
Well as you noticed already the FVH has other issues with positional query. The |
Hi, I'd like to chime in with another comment about the sentence boundary scanner, because maybe the limitation will be resolved by the new highlighter in the works. As a workaround, I've tried raising the fragment size to make it more likely that the highlighter will find a sentence boundary as the fragment's start. However, I've come across content that uses line breaks as boundaries instead of punctuation (ex. bullet points that have been simplified to plaintext). In my corpus, it seems like a good call to include line breaks as sentence boundaries. But the highlighter API doesn't give me a way to fine tune the underlying BreakIterator. |
FVH works well for my use case, but I've noticed when you have a query string query with wildcards, it doesn't highlight as you would expect. Example:
Run search:
Highlight result:
As you can see, the entire phrase is wrapped in my pre and post tags. But if my query string query includes a wildcard and double quotes to surround the phrase, I get no hits:
If I wrap the phrase in single quotes, it doesn't highlight the term with a wildcard, and it doesn't even seem to use it to perform the search:
or (notice the wildcard is misspelled)
If there was no quotes around the phrase, the highlight works as expected with a wildcard:
Edit: I guess wildcard in a phrase in a query string query doesn't work at all. According to this: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html, double quoting for a phrase is the right thing to do. I don't know if this is the relevant place to comment, but I was having the same problem as using the non fvh highlighter was wrapping terms in a phrase individually as opposed to the entire phrase. |
+1 |
Is there any workaround for now? |
Just chiming in to keep it lively. We too have an open initiative that is affected by this. It would be delightful to see it addressed. |
Likewise, my client engagment is focusing on this bug which causes highlights to look like individual words are matched even when a compound term is searched. i.e. at quick glance user sees that each highlight matches only part of the compound term, and frankly makes it look like search itself is not honoring quotes around a compound term. |
Any updates on this defect? |
any updates about this? "unified" type still chops matched keywords instead of highlighting the whole phrase |
Sorry, no ETA on this at the moment. |
Hello and good day! any updates about this? |
Hello and good day! Any updates about this ES highlight bug? using ES 7.0, highlight still splits my keywords by words For example: The result shows: Instead of: |
Any fixes about this? Is it resolved if we will upgrade our ES version? |
i've used this workaround for when using exact search mode in python3:
some tests:
i know its a brutal hack, and it uses the same TAG for both ends, but i am sharing it here anyway |
Any updates on this issue ? Has this been fixed yet ? |
Any updates on this issue ? Has this been fixed yet ? Clarifications? |
What is the status on this issue? Was this ever addressed? |
Is there any chance that this will be ever fixed? We are getting the same issue on ES 7.7.1 |
👀 |
Did someone find any workaround for this issue? E.g: highlight phrases with one tag and single words with another? Separated queries (one for a single word another for phrases)? |
It doesn't work even at ES 7.16.3. Does it work at ES 8.x? |
I have gotten this to work, but I guess that I am benefitting from using 7.x and recently 8.x. I will show two examples, one with a search boolean and one without. I will use the original phrase used in this thread. I will also add in other stuff to show how it can be used alongside other familiar features. highlight_query with search boolean:
Because this uses a query structure, the highlighter tries to highlight the entire phrase first, but if it can't it will go for the individual words. NOTE: I am currently working on how to have the entire phrase highlight work if there are html tags between the words (ie this mortal coil). It already ignores the html on indexing, but the highlighter can't get around this (yet). I ask for the entire document back on the contentHtml field so that I can then calculate the "byte offsets" of each highlighted hit. That way I can also relay on to the client additional pertinent data (my unique situation) Here is how you can do it without a search boolean:
These fields use the html_strip analyzer. Sometimes these fields/documents can be really, really big for my content. I try to keep them as "type": "text" and NOT use keyword (32k limit) Now, I realize that this might be a little slower (I really have no idea), but it's still pretty quick in my experience. In my index of 100k-700k documents I still get 300-400ms response times. |
Sorry for the long comment: Complementing the issue at hand, I found some discrepancies in behavior for both types: ES: 7.14 Mapping
Unified When searching for a quoted text, if I have a repeating term, in the highlights only one would be marked, the other(s) won't. But NOT always, and usually with connection words. Example in PT-BR:
Output: (probably due to phrase_slop=1)
With issue by changing query string to:
output:
it didn't highlight the last With FVH highlighter it works as expected, creating one big markup. But with FVH, proximity searches stops highlighting as expected Example:
Output with |
This is pretty painful. Do you have any update on this pr? @romseygeek |
👀 |
Bueller???? |
Hi romseygeek, mayya-sharipova. It seems you two the most knowledgeable about this based on your activity in #85677. It seems like you guys are very close, just needs to pass some final check\tests? Any chance that this can make 8.8? It would be very beneficial. Thank you for all the work you have done so far to address it. |
We are experiencing this issue too! This is now 5 years old, we need a solution. Our clients are complaining of the same thing, phrases getting returned as multiple hits for the same phrase because its getting split somewhere in the phrase as two or more hits. Please update us on a solution for this bug? @jimczi |
So after some research, we were able to just use With NO other configurations and it works as expected, does NOT split on up a phrase into multiple "hits" and it has a nice bit of characters around it as a proper text snippet. Documentation found here see "type" |
Having just dealt with this issue, I found FVH highlights properly (without breaking up phrases) but the other types do not. Not sure if there's been a recent change. |
Elasticsearch version (
bin/elasticsearch --version
):6.2.3
Plugins installed: []
JVM version (
java -version
):openjdk version "1.8.0_161"
OS version (
uname -a
if on a Unix-like system):Linux 5137c3a21142 4.9.87-linuxkit-aufs
Description of the problem including expected versus actual behavior:
Highlighter breaks searched phrases into separate highlights - makes the highlighter results quite annoying to a user. In the example below the expected highlight would look like this:
shuffled off <em>this mortal coil</em>, must give us
Notice, while the Unified highlighter has this issue the FVH highlighter behaves according to the expectation.
Steps to reproduce:
This results in the following highlighting, which is practically unusable:
The text was updated successfully, but these errors were encountered: