Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add query param to limit highlighting to specified length (#67325) #69016

Merged
merged 1 commit into from
Feb 16, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/reference/index-modules.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -223,6 +223,7 @@ specific index module:
The maximum number of tokens that can be produced using _analyze API.
Defaults to `10000`.

[[index-max-analyzed-offset]]
`index.highlight.max_analyzed_offset`::

The maximum number of characters that will be analyzed for a highlight request.
Expand Down
15 changes: 13 additions & 2 deletions docs/reference/search/search-your-data/highlighting.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ needs highlighting. The `plain` highlighter always uses plain highlighting.
Plain highlighting for large texts may require substantial amount of time and memory.
To protect against this, the maximum number of text characters that will be analyzed has been
limited to 1000000. This default limit can be changed
for a particular index with the index setting `index.highlight.max_analyzed_offset`.
for a particular index with the index setting <<index-max-analyzed-offset,`index.highlight.max_analyzed_offset`>>.

[discrete]
[[highlighting-settings]]
Expand Down Expand Up @@ -242,6 +242,17 @@ require_field_match:: By default, only fields that contains a query match are
highlighted. Set `require_field_match` to `false` to highlight all fields.
Defaults to `true`.

[[max-analyzed-offset]]
max_analyzed_offset:: By default, the maximum number of characters
analyzed for a highlight request is bounded by the value defined in the
<<index-max-analyzed-offset, `index.highlight.max_analyzed_offset`>> setting,
and when the number of characters exceeds this limit an error is returned. If
this setting is set to a non-negative value, the highlighting stops at this defined
maximum limit, and the rest of the text is not processed, thus not highlighted and
no error is returned. The <<max-analyzed-offset, `max_analyzed_offset`>> query setting
does *not* override the <<index-max-analyzed-offset, `index.highlight.max_analyzed_offset`>>
which prevails when it's set to lower value than the query setting.

tags_schema:: Set to `styled` to use the built-in tag schema. The `styled`
schema defines the following `pre_tags` and defines `post_tags` as
`</em>`.
Expand Down Expand Up @@ -1121,4 +1132,4 @@ using the passages's `matchStarts` and `matchEnds` information:
I'll be the <em>only</em> <em>fox</em> in the world for you.

This kind of formatted strings are the final result of the highlighter returned
to the user.
to the user.
Original file line number Diff line number Diff line change
Expand Up @@ -50,8 +50,8 @@ protected List<Object> loadFieldValues(
}

@Override
protected Analyzer wrapAnalyzer(Analyzer analyzer) {
return new AnnotatedHighlighterAnalyzer(super.wrapAnalyzer(analyzer));
protected Analyzer wrapAnalyzer(Analyzer analyzer, Integer maxAnalyzedOffset) {
return new AnnotatedHighlighterAnalyzer(super.wrapAnalyzer(analyzer, maxAnalyzedOffset));
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,14 @@

package org.elasticsearch.search.fetch.subphase.highlight;

import static org.apache.lucene.search.uhighlight.CustomUnifiedHighlighter.MULTIVAL_SEP_CHAR;
import static org.hamcrest.CoreMatchers.equalTo;

import java.net.URLEncoder;
import java.text.BreakIterator;
import java.util.ArrayList;
import java.util.Locale;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
Expand All @@ -31,95 +39,96 @@
import org.apache.lucene.search.uhighlight.CustomUnifiedHighlighter;
import org.apache.lucene.search.uhighlight.Snippet;
import org.apache.lucene.search.uhighlight.SplittingBreakIterator;
import org.apache.lucene.search.uhighlight.UnifiedHighlighter;
import org.apache.lucene.store.Directory;
import org.elasticsearch.common.Strings;
import org.elasticsearch.index.mapper.annotatedtext.AnnotatedTextFieldMapper.AnnotatedHighlighterAnalyzer;
import org.elasticsearch.index.mapper.annotatedtext.AnnotatedTextFieldMapper.AnnotatedText;
import org.elasticsearch.index.mapper.annotatedtext.AnnotatedTextFieldMapper.AnnotationAnalyzerWrapper;
import org.elasticsearch.test.ESTestCase;

import java.net.URLEncoder;
import java.text.BreakIterator;
import java.util.ArrayList;
import java.util.Locale;
public class AnnotatedTextHighlighterTests extends ESTestCase {

import static org.apache.lucene.search.uhighlight.CustomUnifiedHighlighter.MULTIVAL_SEP_CHAR;
import static org.hamcrest.CoreMatchers.equalTo;
private void assertHighlightOneDoc(String fieldName, String[] markedUpInputs,
Query query, Locale locale, BreakIterator breakIterator,
int noMatchSize, String[] expectedPassages) throws Exception {

public class AnnotatedTextHighlighterTests extends ESTestCase {
assertHighlightOneDoc(fieldName, markedUpInputs, query, locale, breakIterator, noMatchSize, expectedPassages,
Integer.MAX_VALUE, null);
}

private void assertHighlightOneDoc(String fieldName, String []markedUpInputs,
Query query, Locale locale, BreakIterator breakIterator,
int noMatchSize, String[] expectedPassages) throws Exception {


// Annotated fields wrap the usual analyzer with one that injects extra tokens
Analyzer wrapperAnalyzer = new AnnotationAnalyzerWrapper(new StandardAnalyzer());
Directory dir = newDirectory();
IndexWriterConfig iwc = newIndexWriterConfig(wrapperAnalyzer);
iwc.setMergePolicy(newTieredMergePolicy(random()));
RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc);
FieldType ft = new FieldType(TextField.TYPE_STORED);
if (randomBoolean()) {
ft.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
} else {
ft.setIndexOptions(IndexOptions.DOCS_AND_FREQS);
}
ft.freeze();
Document doc = new Document();
for (String input : markedUpInputs) {
Field field = new Field(fieldName, "", ft);
field.setStringValue(input);
doc.add(field);
}
iw.addDocument(doc);
DirectoryReader reader = iw.getReader();
IndexSearcher searcher = newSearcher(reader);
iw.close();

AnnotatedText[] annotations = new AnnotatedText[markedUpInputs.length];
for (int i = 0; i < markedUpInputs.length; i++) {
annotations[i] = AnnotatedText.parse(markedUpInputs[i]);
}
AnnotatedHighlighterAnalyzer hiliteAnalyzer = new AnnotatedHighlighterAnalyzer(wrapperAnalyzer);
hiliteAnalyzer.setAnnotations(annotations);
AnnotatedPassageFormatter passageFormatter = new AnnotatedPassageFormatter(new DefaultEncoder());
passageFormatter.setAnnotations(annotations);

ArrayList<Object> plainTextForHighlighter = new ArrayList<>(annotations.length);
for (int i = 0; i < annotations.length; i++) {
plainTextForHighlighter.add(annotations[i].textMinusMarkup);
}
int noMatchSize, String[] expectedPassages,
int maxAnalyzedOffset, Integer queryMaxAnalyzedOffset) throws Exception {

TopDocs topDocs = searcher.search(new MatchAllDocsQuery(), 1, Sort.INDEXORDER);
assertThat(topDocs.totalHits.value, equalTo(1L));
String rawValue = Strings.collectionToDelimitedString(plainTextForHighlighter, String.valueOf(MULTIVAL_SEP_CHAR));
CustomUnifiedHighlighter highlighter = new CustomUnifiedHighlighter(
searcher,
hiliteAnalyzer,
null,
passageFormatter,
locale,
breakIterator,
"index",
"text",
query,
noMatchSize,
expectedPassages.length,
name -> "text".equals(name),
Integer.MAX_VALUE
);
highlighter.setFieldMatcher((name) -> "text".equals(name));
final Snippet[] snippets = highlighter.highlightField(getOnlyLeafReader(reader), topDocs.scoreDocs[0].doc, () -> rawValue);
assertEquals(expectedPassages.length, snippets.length);
for (int i = 0; i < snippets.length; i++) {
assertEquals(expectedPassages[i], snippets[i].getText());
try (Directory dir = newDirectory()) {
// Annotated fields wrap the usual analyzer with one that injects extra tokens
Analyzer wrapperAnalyzer = new AnnotationAnalyzerWrapper(new StandardAnalyzer());
IndexWriterConfig iwc = newIndexWriterConfig(wrapperAnalyzer);
iwc.setMergePolicy(newTieredMergePolicy(random()));
RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc);
FieldType ft = new FieldType(TextField.TYPE_STORED);
if (randomBoolean()) {
ft.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
} else {
ft.setIndexOptions(IndexOptions.DOCS_AND_FREQS);
}
ft.freeze();
Document doc = new Document();
for (String input : markedUpInputs) {
Field field = new Field(fieldName, "", ft);
field.setStringValue(input);
doc.add(field);
}
iw.addDocument(doc);
try (DirectoryReader reader = iw.getReader()) {
IndexSearcher searcher = newSearcher(reader);
iw.close();

AnnotatedText[] annotations = new AnnotatedText[markedUpInputs.length];
for (int i = 0; i < markedUpInputs.length; i++) {
annotations[i] = AnnotatedText.parse(markedUpInputs[i]);
}
AnnotatedHighlighterAnalyzer hiliteAnalyzer = new AnnotatedHighlighterAnalyzer(wrapperAnalyzer);
hiliteAnalyzer.setAnnotations(annotations);
AnnotatedPassageFormatter passageFormatter = new AnnotatedPassageFormatter(new DefaultEncoder());
passageFormatter.setAnnotations(annotations);

ArrayList<Object> plainTextForHighlighter = new ArrayList<>(annotations.length);
for (int i = 0; i < annotations.length; i++) {
plainTextForHighlighter.add(annotations[i].textMinusMarkup);
}

TopDocs topDocs = searcher.search(new MatchAllDocsQuery(), 1, Sort.INDEXORDER);
assertThat(topDocs.totalHits.value, equalTo(1L));
String rawValue = Strings.collectionToDelimitedString(plainTextForHighlighter, String.valueOf(MULTIVAL_SEP_CHAR));
CustomUnifiedHighlighter highlighter = new CustomUnifiedHighlighter(
searcher,
hiliteAnalyzer,
UnifiedHighlighter.OffsetSource.ANALYSIS,
passageFormatter,
locale,
breakIterator,
"index",
"text",
query,
noMatchSize,
expectedPassages.length,
name -> "text".equals(name),
maxAnalyzedOffset,
queryMaxAnalyzedOffset
);
highlighter.setFieldMatcher((name) -> "text".equals(name));
final Snippet[] snippets = highlighter.highlightField(getOnlyLeafReader(reader), topDocs.scoreDocs[0].doc, () -> rawValue);
assertEquals(expectedPassages.length, snippets.length);
for (int i = 0; i < snippets.length; i++) {
assertEquals(expectedPassages[i], snippets[i].getText());
}
}
}
reader.close();
dir.close();
}


public void testAnnotatedTextStructuredMatch() throws Exception {
// Check that a structured token eg a URL can be highlighted in a query
// on marked-up
Expand Down Expand Up @@ -191,4 +200,65 @@ public void testBadAnnotation() throws Exception {
assertHighlightOneDoc("text", markedUpInputs, query, Locale.ROOT, breakIterator, 0, expectedPassages);
}

public void testExceedMaxAnalyzedOffset() throws Exception {
TermQuery query = new TermQuery(new Term("text", "exceeds"));
BreakIterator breakIterator = new CustomSeparatorBreakIterator(MULTIVAL_SEP_CHAR);
assertHighlightOneDoc("text", new String[] { "[Short Text](Short+Text)" }, query, Locale.ROOT, breakIterator, 0, new String[] {},
10, null);

IllegalArgumentException e = expectThrows(
IllegalArgumentException.class,
() -> assertHighlightOneDoc(
"text",
new String[] { "[Long Text exceeds](Long+Text+exceeds) MAX analyzed offset)" },
query,
Locale.ROOT,
breakIterator,
0,
new String[] {},
20,
null
)
);
assertEquals(
"The length [38] of field [text] in doc[0]/index[index] exceeds the [index.highlight.max_analyzed_offset] limit [20]. "
+ "To avoid this error, set the query parameter [max_analyzed_offset] to a value less than index setting [20] and this "
+ "will tolerate long field values by truncating them.",
e.getMessage()
);

final Integer queryMaxOffset = randomIntBetween(21, 1000);
e = expectThrows(
IllegalArgumentException.class,
() -> assertHighlightOneDoc(
"text",
new String[] { "[Long Text exceeds](Long+Text+exceeds) MAX analyzed offset)" },
query,
Locale.ROOT,
breakIterator,
0,
new String[] {},
20,
queryMaxOffset
)
);
assertEquals(
"The length [38] of field [text] in doc[0]/index[index] exceeds the [index.highlight.max_analyzed_offset] limit [20]. "
+ "To avoid this error, set the query parameter [max_analyzed_offset] to a value less than index setting [20] and this "
+ "will tolerate long field values by truncating them.",
e.getMessage()
);

assertHighlightOneDoc(
"text",
new String[] { "[Long Text Exceeds](Long+Text+Exceeds) MAX analyzed offset [Long Text Exceeds](Long+Text+Exceeds)" },
query,
Locale.ROOT,
breakIterator,
0,
new String[] { "Long Text [Exceeds](_hit_term=exceeds) MAX analyzed offset [Long Text Exceeds](Long+Text+Exceeds)" },
20,
15
);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ setup:
body:
settings:
number_of_shards: 1
index.highlight.max_analyzed_offset: 10
index.highlight.max_analyzed_offset: 30
mappings:
properties:
field1:
Expand Down Expand Up @@ -39,6 +39,20 @@ setup:
body: {"query" : {"match" : {"field1" : "fox"}}, "highlight" : {"type" : "unified", "fields" : {"field1" : {}}}}
- match: { error.root_cause.0.type: "illegal_argument_exception" }

---
"Unified highlighter on a field WITHOUT OFFSETS exceeding index.highlight.max_analyzed_offset with max_analyzed_offset=20 should SUCCEED":

- skip:
version: " - 7.11.99"
reason: max_analyzed_offset query param added in 7.12.0

- do:
search:
rest_total_hits_as_int: true
index: test1
body: {"query" : {"match" : {"field1" : "fox"}}, "highlight" : {"type" : "unified", "fields" : {"field1" : {}}, "max_analyzed_offset": "20"}}
- match: {hits.hits.0.highlight.field1.0: "The quick brown <em>fox</em> went to the forest and saw another fox."}


---
"Plain highlighter on a field WITHOUT OFFSETS exceeding index.highlight.max_analyzed_offset should FAIL":
Expand All @@ -50,9 +64,23 @@ setup:
search:
rest_total_hits_as_int: true
index: test1
body: {"query" : {"match" : {"field1" : "fox"}}, "highlight" : {"type" : "unified", "fields" : {"field1" : {}}}}
body: {"query" : {"match" : {"field1" : "fox"}}, "highlight" : {"type" : "plain", "fields" : {"field1" : {}}}}
- match: { error.root_cause.0.type: "illegal_argument_exception" }

---
"Plain highlighter on a field WITHOUT OFFSETS exceeding index.highlight.max_analyzed_offset with max_analyzed_offset=20 should SUCCEED":

- skip:
version: " - 7.11.99"
reason: max_analyzed_offset query param added in 7.12.0

- do:
search:
rest_total_hits_as_int: true
index: test1
body: {"query" : {"match" : {"field1" : "fox"}}, "highlight" : {"type" : "plain", "fields" : {"field1" : {}}, "max_analyzed_offset": 20}}
- match: {hits.hits.0.highlight.field1.0: "The quick brown <em>fox</em> went to the forest and saw another fox."}


---
"Unified highlighter on a field WITH OFFSETS exceeding index.highlight.max_analyzed_offset should SUCCEED":
Expand All @@ -79,3 +107,35 @@ setup:
index: test1
body: {"query" : {"match" : {"field2" : "fox"}}, "highlight" : {"type" : "plain", "fields" : {"field2" : {}}}}
- match: { error.root_cause.0.type: "illegal_argument_exception" }

---
"Plain highlighter on a field WITH OFFSETS exceeding index.highlight.max_analyzed_offset with max_analyzed_offset=20 should SUCCEED":

- skip:
version: " - 7.11.99"
reason: max_analyzed_offset query param added in 7.12.0

- do:
search:
rest_total_hits_as_int: true
index: test1
body: {"query" : {"match" : {"field2" : "fox"}}, "highlight" : {"type" : "plain", "fields" : {"field2" : {}}, "max_analyzed_offset": 20}}
- match: {hits.hits.0.highlight.field2.0: "The quick brown <em>fox</em> went to the forest and saw another fox."}

---
"Plain highlighter with max_analyzed_offset < 0 should FAIL":

- skip:
version: " - 7.11.99"
reason: max_analyzed_offset query param added in 7.12.0

- do:
catch: bad_request
search:
rest_total_hits_as_int: true
index: test1
body: {"query" : {"match" : {"field2" : "fox"}}, "highlight" : {"type" : "plain", "fields" : {"field2" : {}}, "max_analyzed_offset": -10}}
- match: { status: 400 }
- match: { error.root_cause.0.type: "x_content_parse_exception" }
- match: { error.caused_by.type: "illegal_argument_exception" }
- match: { error.caused_by.reason: "[max_analyzed_offset] must be a positive integer" }
Loading