Include all sentences smaller than fragment_size in the unified highlighter #28132

jimczi · 2018-01-08T15:38:12Z

Currently the unified highlighter selects a single sentence per fragment from the offset of the first highlighted term no matter if this sentence is smaller than the provided fragment_size.
This change modifies this selection and allows more than one sentence in a single fragment.
The expansion is done forward (on the right of the matching offset), sentences are added to the current fragment iff the overall size of the fragment is smaller than the maximum length (fragment_size).
We should also add a way to expand the left context with the surrounding sentences but this is currently avoided because the unified highlighter in Lucene uses only the first offset that matches the query to derive the start and end offset of the next fragment.
If we expand on the left we could split multiple terms that would be grouped otherwise. Breaking this limitation implies some changes in the core of the unified highlighter.

Closes #28089

…ighter The unified highlighter selects a single sentence per fragment from the offset of the first highlighted term. This change modifies this selection and allows more than one sentence in a single fragment. The expansion is done forward (on the right of the matching offset), sentences are added to the current fragment iff the overall size of the fragment is smaller than the maximum length (fragment_size). We should also add a way to expand the left context with the surrounding sentences but this is currently avoided because the unified highlighter in Lucene uses only the first offset that matches the query to derive the start and end offset of the next fragment. If we expand on the left we could split multiple terms that would be grouped otherwise. Breaking this limitation implies some changes in the core of the unified highlighter. Closes elastic#28089

romseygeek · 2018-01-10T17:06:31Z

LGTM

romseygeek

LGTM

…ighter (#28132) The unified highlighter selects a single sentence per fragment from the offset of the first highlighted term. This change modifies this selection and allows more than one sentence in a single fragment. The expansion is done forward (on the right of the matching offset), sentences are added to the current fragment iff the overall size of the fragment is smaller than the maximum length (fragment_size). We should also add a way to expand the left context with the surrounding sentences but this is currently avoided because the unified highlighter in Lucene uses only the first offset that matches the query to derive the start and end offset of the next fragment. If we expand on the left we could split multiple terms that would be grouped otherwise. Breaking this limitation implies some changes in the core of the unified highlighter. Closes #28089

* master: (43 commits) Rename core module to server (#28180) upgraded jna from 4.4.0-1 to 4.5.1 (#28183) [TEST] Do not call RandomizedTest.scaledRandomIntBetween from multiple threads Primary send safe commit in file-based recovery (#28038) [Docs] Correct response json in rank-eval.asciidoc Add scroll parameter to _reindex API (#28041) Include all sentences smaller than fragment_size in the unified highlighter (#28132) Modifies the JavaAPI docs related to AggregationBuilder [Docs] Improvements in script-fields.asciidoc (#28174) [Docs] Remove Kerberos/SPNEGO Shield plugin (#28019) Ignore null value for range field (#27845) (#28116) Fix environment variable substitutions in list setting (#28106) docs: Replaces indexed script java api docs with stored script api docs test: ensure we endup with a single segment Make sure that we don't detect files as maven coordinate when installing a plugin (#28163) [Tests] temporary disable meta plugin rest tests #28163 meta-plugin should install bin and config at the top level (#28162) Painless: Add public member read/write access test. (#28156) Docs: Clarify password protection support with keystore (#28157) [Docs] fix plugin properties inclusion for plugins authors ...

* 6.x: (41 commits) Rename core module to server (#28190) [Test] Remove superfluous lines [Test] Add skip test for index range field with null values upgraded jna from 4.4.0-1 to 4.5.1 (#28183) [TEST] Do not call RandomizedTest.scaledRandomIntBetween from multiple threads [Docs] Correct response json in rank-eval.asciidoc Fix environment variable substitutions in list setting (#28106) Add scroll parameter to _reindex API (#28041) Include all sentences smaller than fragment_size in the unified highlighter (#28132) [Docs] Improvements in script-fields.asciidoc (#28174) [Docs] Remove Kerberos/SPNEGO Shield plugin (#28019) Ignore null value for range field (#27845) (#28116) docs: Replaces indexed script java api docs with stored script api docs test: ensure we endup with a single segment Make sure that we don't detect files as maven coordinate when installing a plugin (#28163) [Tests] temporary disable meta plugin rest tests #28163 meta-plugin should install bin and config at the top level (#28162) Painless: Add public member read/write access test. (#28156) Docs: Clarify password protection support with keystore (#28157) [Docs] fix plugin properties inclusion for plugins authors ...

* master: (30 commits) Fix lock accounting in releasable lock Add ability to associate an ID with tasks (elastic#27764) [DOCS] Removed differencies between text and code (elastic#27993) text fixes (elastic#28136) Update getting-started.asciidoc (elastic#28145) [Docs] Spelling fix in painless-getting-started.asciidoc (elastic#28187) Fixed the cat.health REST test to accept 4ms, not just 4.0ms (elastic#28186) Do not keep 5.x commits once having 6.x commits (elastic#28188) Rename core module to server (elastic#28180) upgraded jna from 4.4.0-1 to 4.5.1 (elastic#28183) [TEST] Do not call RandomizedTest.scaledRandomIntBetween from multiple threads Primary send safe commit in file-based recovery (elastic#28038) [Docs] Correct response json in rank-eval.asciidoc Add scroll parameter to _reindex API (elastic#28041) Include all sentences smaller than fragment_size in the unified highlighter (elastic#28132) Modifies the JavaAPI docs related to AggregationBuilder [Docs] Improvements in script-fields.asciidoc (elastic#28174) [Docs] Remove Kerberos/SPNEGO Shield plugin (elastic#28019) Ignore null value for range field (elastic#27845) (elastic#28116) Fix environment variable substitutions in list setting (elastic#28106) ...

* compile-with-jdk-9: (56 commits) TEST: init unassigned gcp in testAcquireIndexCommit Replica start peer recovery with safe commit (elastic#28181) Truncate tlog cli should assign global checkpoint (elastic#28192) Fix lock accounting in releasable lock Add ability to associate an ID with tasks (elastic#27764) [DOCS] Removed differencies between text and code (elastic#27993) text fixes (elastic#28136) Update getting-started.asciidoc (elastic#28145) [Docs] Spelling fix in painless-getting-started.asciidoc (elastic#28187) Fixed the cat.health REST test to accept 4ms, not just 4.0ms (elastic#28186) Do not keep 5.x commits once having 6.x commits (elastic#28188) Rename core module to server (elastic#28180) upgraded jna from 4.4.0-1 to 4.5.1 (elastic#28183) [TEST] Do not call RandomizedTest.scaledRandomIntBetween from multiple threads Primary send safe commit in file-based recovery (elastic#28038) [Docs] Correct response json in rank-eval.asciidoc Add scroll parameter to _reindex API (elastic#28041) Include all sentences smaller than fragment_size in the unified highlighter (elastic#28132) Modifies the JavaAPI docs related to AggregationBuilder [Docs] Improvements in script-fields.asciidoc (elastic#28174) ...

jimczi added :Search Relevance/Highlighting How a query matched a document >enhancement v6.2.0 v7.0.0 labels Jan 8, 2018

remove unused import

087cb5c

jimczi added the review label Jan 9, 2018

romseygeek approved these changes Jan 11, 2018

View reviewed changes

jimczi merged commit 87c841d into elastic:master Jan 11, 2018

jimczi deleted the enhancements/unified_expand_fragment_length branch January 11, 2018 12:26

jimczi removed the review label Jan 11, 2018

jimczi mentioned this pull request Jan 31, 2018

Bug: Unified Highlighter does not care about fragment_size #28462

Closed

jimczi mentioned this pull request Mar 22, 2018

Unified highlighter: include additional context outside of highlighted sentence to reach target fragment_size #28089

Closed

jimczi mentioned this pull request Apr 18, 2018

Highlighter breaks phrases #29561

Closed

jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include all sentences smaller than fragment_size in the unified highlighter #28132

Include all sentences smaller than fragment_size in the unified highlighter #28132

jimczi commented Jan 8, 2018

romseygeek commented Jan 10, 2018

romseygeek left a comment

Include all sentences smaller than fragment_size in the unified highlighter #28132

Include all sentences smaller than fragment_size in the unified highlighter #28132

Conversation

jimczi commented Jan 8, 2018

romseygeek commented Jan 10, 2018

romseygeek left a comment

Choose a reason for hiding this comment