Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Highlighters shouldn't error on big documents #52155

Closed
markharwood opened this issue Feb 10, 2020 · 7 comments · Fixed by #67325
Closed

Highlighters shouldn't error on big documents #52155

markharwood opened this issue Feb 10, 2020 · 7 comments · Fixed by #67325
Assignees
Labels
:Search Relevance/Highlighting How a query matched a document Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@markharwood
Copy link
Contributor

For the casual searcher it is not particularly helpful to have elasticsearch return an error if they are unlucky enough to match a big document.
The user gets a 400 error with this sort of message:

The length of [xxx] field ... has exceeded [1000000] - maximum allowed to be analyzed 
for highlighting. This maximum can be set by changing the [index.highlight.max_analyzed_offset] 
index level setting. For large texts, indexing with offsets or term vectors is recommended!

At this point the only workarounds the user has are:

a) User rewrites query with a NOT clause to exclude IDs of rogue docs (not ideal)
b) User reindexes content with offsets (a pain)
c) User reindexes content and truncates long strings e.g. with an ingest processor (not ideal)
d) User increases the max_analyzed_offset setting (not ideal)

None of these are great options so the proposal is that highlighters could be prevented from throwing an error and instead use a cheaper "fallback" approach to highlighting e.g. returning the first N characters of a large string field. The open questions are:

  1. Do we need additional properties in the highlight request to define the fallback approach?
  2. How do we warn the user that a fallback approach was applied for a particular result?
  3. Will some users want the old behaviour of errors rather than fallbacks?
@markharwood markharwood added discuss :Search Relevance/Highlighting How a query matched a document labels Feb 10, 2020
@markharwood markharwood self-assigned this Feb 10, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Highlighting)

@mayya-sharipova
Copy link
Contributor

I am also wondering what if the first N characters don't contain any query terms to highlight? Is it ok to present a user with empty highlights in this case?

@markharwood
Copy link
Contributor Author

Is it ok to present a user with empty highlights in this case?

For the Kibana use case I imagine having the Discover page showing some content (maybe with a ... to indicate there's more) would be better than a blank space or an error.

@markharwood
Copy link
Contributor Author

We discussed this today and had a proposal for how users can avoid this error without resorting to any of the painful workarounds listed above.
As part of the highlight request a user could provide a form of query-time `"ignore_above" statement:

"ignore_above" : {
     "size": 10000,
     "replacement_value": "Too large ..."
}

The size parameter specifies the maximum field size (where field size is the length in bytes of a value or, if an array, the sum of all element value lengths).
The replacement_value is a bit like the null value setting in index mappings - it's a user-provided string that they can recognise as indicating failing content.
The proposal was that this ignore_above option could be used for all highlighters and tweaked according to the user's taste. Obviously it wouldn't be useful to set size to a value greater than the index.highlight.max_analyzed_offset setting because this would potentially trigger an error rather than the desired ignore behaviour.

@eedugon
Copy link
Contributor

eedugon commented Mar 11, 2020

@markharwood : Apparently this error is also raised with fields that are not even indexed.

Taking a look at this mapping:

"base64Message": {
      "type": "text",
      "index": false
     }

It also generates errors as Kibana uses highlighting by default:

"The length of [doc.base64Message] field of [AXBikvAZbAZqpEylLWf0] doc of [iib-busnl-prd-service-base64messages-2020.02] index has exceeded [1000000] - maximum allowed to be analyzed for highlighting. This maximum can be set by changing the [index.highlight.max_analyzed_offset] index level setting. For large texts, indexing with offsets or term vectors is recommended!"

So, we have keywords with ignore_above setting, which is ignored by the highlighter (#43800) and where the offsets workaround is not possible, and we have also fields with index: false option where we cannot apply either the offsets workaround.

Just for your consideration for improving the highlighter. Don't really know if the fix should me more at Kibana side when using the highlighter (selecting better options) or at the highlighter engine to avoid failing search requests due to this.

@mayya-sharipova
Copy link
Contributor

@eedugon Thanks for letting us know, we will try to address your use case as well

@markharwood
Copy link
Contributor Author

We discussed this today and came up with two additional proposals:

  1. More pre-empting: like a document whose value has exceeded the ignore_above setting for a keyword field, we should avoid even trying to highlight fields that are known to be unindexed.
  2. Better fallbacks - rather than only flagging a field was too large in the response we should also consider providing a best-efforts option of highlighting only the first N characters of a document.

@rjernst rjernst added the Team:Search Meta label for search team label May 4, 2020
@matriv matriv self-assigned this Nov 30, 2020
matriv added a commit to matriv/elasticsearch that referenced this issue Jan 12, 2021
Add a query parameter `limit_to_max_analyzed_offset` to allow users
to limit the highlighting of text fields to the value of the
`index.highlight.max_analyzed_offset`, thus preventing from throwing
an exception when the length of the text field exceeds the limit.
The highlighting still takes place but up to the length set by the
setting.

Relates to: elastic#52155
matriv added a commit that referenced this issue Feb 16, 2021
Add a `max_analyzed_offset` query parameter to allow users
to limit the highlighting of text fields to a value less than or equal to the
`index.highlight.max_analyzed_offset`, thus avoiding an exception when
the length of the text field exceeds the limit. The highlighting still takes place,
but stops at the length defined by the new parameter.

Closes: #52155
matriv added a commit that referenced this issue Feb 16, 2021
…69016)

Add a `max_analyzed_offset` query parameter to allow users
to limit the highlighting of text fields to a value less than or equal to the
`index.highlight.max_analyzed_offset`, thus avoiding an exception when
the length of the text field exceeds the limit. The highlighting still takes place,
but stops at the length defined by the new parameter.

Closes: #52155
(cherry picked from commit f9af60b)
@javanna javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Search Relevance/Highlighting How a query matched a document Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants