Highlighters shouldn't error on big documents #52155

markharwood · 2020-02-10T17:03:28Z

For the casual searcher it is not particularly helpful to have elasticsearch return an error if they are unlucky enough to match a big document.
The user gets a 400 error with this sort of message:

The length of [xxx] field ... has exceeded [1000000] - maximum allowed to be analyzed 
for highlighting. This maximum can be set by changing the [index.highlight.max_analyzed_offset] 
index level setting. For large texts, indexing with offsets or term vectors is recommended!

At this point the only workarounds the user has are:

a) User rewrites query with a NOT clause to exclude IDs of rogue docs (not ideal)
b) User reindexes content with offsets (a pain)
c) User reindexes content and truncates long strings e.g. with an ingest processor (not ideal)
d) User increases the max_analyzed_offset setting (not ideal)

None of these are great options so the proposal is that highlighters could be prevented from throwing an error and instead use a cheaper "fallback" approach to highlighting e.g. returning the first N characters of a large string field. The open questions are:

Do we need additional properties in the highlight request to define the fallback approach?
How do we warn the user that a fallback approach was applied for a particular result?
Will some users want the old behaviour of errors rather than fallbacks?

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-02-10T17:03:30Z

Pinging @elastic/es-search (:Search/Highlighting)

mayya-sharipova · 2020-02-10T19:37:53Z

I am also wondering what if the first N characters don't contain any query terms to highlight? Is it ok to present a user with empty highlights in this case?

markharwood · 2020-02-11T09:09:23Z

Is it ok to present a user with empty highlights in this case?

For the Kibana use case I imagine having the Discover page showing some content (maybe with a ... to indicate there's more) would be better than a blank space or an error.

markharwood · 2020-03-02T17:05:04Z

We discussed this today and had a proposal for how users can avoid this error without resorting to any of the painful workarounds listed above.
As part of the highlight request a user could provide a form of query-time `"ignore_above" statement:

"ignore_above" : {
     "size": 10000,
     "replacement_value": "Too large ..."
}

The size parameter specifies the maximum field size (where field size is the length in bytes of a value or, if an array, the sum of all element value lengths).
The replacement_value is a bit like the null value setting in index mappings - it's a user-provided string that they can recognise as indicating failing content.
The proposal was that this ignore_above option could be used for all highlighters and tweaked according to the user's taste. Obviously it wouldn't be useful to set size to a value greater than the index.highlight.max_analyzed_offset setting because this would potentially trigger an error rather than the desired ignore behaviour.

eedugon · 2020-03-11T16:07:33Z

@markharwood : Apparently this error is also raised with fields that are not even indexed.

Taking a look at this mapping:

"base64Message": {
      "type": "text",
      "index": false
     }

It also generates errors as Kibana uses highlighting by default:

"The length of [doc.base64Message] field of [AXBikvAZbAZqpEylLWf0] doc of [iib-busnl-prd-service-base64messages-2020.02] index has exceeded [1000000] - maximum allowed to be analyzed for highlighting. This maximum can be set by changing the [index.highlight.max_analyzed_offset] index level setting. For large texts, indexing with offsets or term vectors is recommended!"

So, we have keywords with ignore_above setting, which is ignored by the highlighter (#43800) and where the offsets workaround is not possible, and we have also fields with index: false option where we cannot apply either the offsets workaround.

Just for your consideration for improving the highlighter. Don't really know if the fix should me more at Kibana side when using the highlighter (selecting better options) or at the highlighter engine to avoid failing search requests due to this.

mayya-sharipova · 2020-03-11T21:05:52Z

@eedugon Thanks for letting us know, we will try to address your use case as well

markharwood · 2020-03-30T15:57:44Z

We discussed this today and came up with two additional proposals:

More pre-empting: like a document whose value has exceeded the ignore_above setting for a keyword field, we should avoid even trying to highlight fields that are known to be unindexed.
Better fallbacks - rather than only flagging a field was too large in the response we should also consider providing a best-efforts option of highlighting only the first N characters of a document.

Add a query parameter `limit_to_max_analyzed_offset` to allow users to limit the highlighting of text fields to the value of the `index.highlight.max_analyzed_offset`, thus preventing from throwing an exception when the length of the text field exceeds the limit. The highlighting still takes place but up to the length set by the setting. Relates to: elastic#52155

Add a `max_analyzed_offset` query parameter to allow users to limit the highlighting of text fields to a value less than or equal to the `index.highlight.max_analyzed_offset`, thus avoiding an exception when the length of the text field exceeds the limit. The highlighting still takes place, but stops at the length defined by the new parameter. Closes: #52155

…69016) Add a `max_analyzed_offset` query parameter to allow users to limit the highlighting of text fields to a value less than or equal to the `index.highlight.max_analyzed_offset`, thus avoiding an exception when the length of the text field exceeds the limit. The highlighting still takes place, but stops at the length defined by the new parameter. Closes: #52155 (cherry picked from commit f9af60b)

markharwood added discuss :Search Relevance/Highlighting How a query matched a document labels Feb 10, 2020

markharwood self-assigned this Feb 10, 2020

jasontedor added team-discuss and removed discuss labels Feb 25, 2020

mayya-sharipova mentioned this issue Mar 9, 2020

Highlighters should reject ignored keyword field early #43800

Closed

markharwood removed the team-discuss label Mar 30, 2020

rjernst added the Team:Search Meta label for search team label May 4, 2020

jguay mentioned this issue Sep 30, 2020

Discover returns bad request error when one doc has a field with 1000000+ Bytes elastic/kibana#78947

Closed

matriv self-assigned this Nov 30, 2020

markharwood mentioned this issue Jan 6, 2021

Search using highlight returns less hits #67073

Closed

matriv mentioned this issue Jan 12, 2021

Add query param to limit highlighting to specified length #67325

Merged

matriv closed this as completed in #67325 Feb 16, 2021

matriv mentioned this issue Feb 16, 2021

Add query param to limit highlighting to specified length (#67325) #69016

Merged

stevejgordon mentioned this issue Feb 22, 2021

7.12.0 Meta Ticket elastic/elasticsearch-net#5337

Closed

34 tasks

hauck-jvsh mentioned this issue Jul 12, 2022

Adds option to do not abort the query when highlighting opensearch-project/OpenSearch#3842

Closed

javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Highlighters shouldn't error on big documents #52155

Highlighters shouldn't error on big documents #52155

markharwood commented Feb 10, 2020

elasticmachine commented Feb 10, 2020

mayya-sharipova commented Feb 10, 2020

markharwood commented Feb 11, 2020

markharwood commented Mar 2, 2020

eedugon commented Mar 11, 2020

mayya-sharipova commented Mar 11, 2020

markharwood commented Mar 30, 2020

Highlighters shouldn't error on big documents #52155

Highlighters shouldn't error on big documents #52155

Comments

markharwood commented Feb 10, 2020

elasticmachine commented Feb 10, 2020

mayya-sharipova commented Feb 10, 2020

markharwood commented Feb 11, 2020

markharwood commented Mar 2, 2020

eedugon commented Mar 11, 2020

mayya-sharipova commented Mar 11, 2020

markharwood commented Mar 30, 2020