Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include all sentences smaller than fragment_size in the unified highlighter #28132

Merged

Conversation

jimczi
Copy link
Contributor

@jimczi jimczi commented Jan 8, 2018

Currently the unified highlighter selects a single sentence per fragment from the offset of the first highlighted term no matter if this sentence is smaller than the provided fragment_size.
This change modifies this selection and allows more than one sentence in a single fragment.
The expansion is done forward (on the right of the matching offset), sentences are added to the current fragment iff the overall size of the fragment is smaller than the maximum length (fragment_size).
We should also add a way to expand the left context with the surrounding sentences but this is currently avoided because the unified highlighter in Lucene uses only the first offset that matches the query to derive the start and end offset of the next fragment.
If we expand on the left we could split multiple terms that would be grouped otherwise. Breaking this limitation implies some changes in the core of the unified highlighter.

Closes #28089

…ighter

The unified highlighter selects a single sentence per fragment from the offset of the first highlighted term.
This change modifies this selection and allows more than one sentence in a single fragment.
The expansion is done forward (on the right of the matching offset), sentences are added to the current fragment iff
the overall size of the fragment is smaller than the maximum length (fragment_size).
We should also add a way to expand the left context with the surrounding sentences but this is currently avoided because the
unified highlighter in Lucene uses only the first offset that matches the query to derive the start and end offset of the next fragment.
If we expand on the left we could split multiple terms that would be grouped otherwise. Breaking this limitation implies some changes in
the core of the unified highlighter.

Closes elastic#28089
@jimczi jimczi added the review label Jan 9, 2018
@romseygeek
Copy link
Contributor

LGTM

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jimczi jimczi merged commit 87c841d into elastic:master Jan 11, 2018
@jimczi jimczi deleted the enhancements/unified_expand_fragment_length branch January 11, 2018 12:26
@jimczi jimczi removed the review label Jan 11, 2018
jimczi added a commit that referenced this pull request Jan 11, 2018
…ighter (#28132)

The unified highlighter selects a single sentence per fragment from the offset of the first highlighted term.
This change modifies this selection and allows more than one sentence in a single fragment.
The expansion is done forward (on the right of the matching offset), sentences are added to the current fragment iff the overall size of the fragment is smaller than the maximum length (fragment_size).
We should also add a way to expand the left context with the surrounding sentences but this is currently avoided because the unified highlighter in Lucene uses only the first offset that matches the query to derive the start and end offset of the next fragment.
If we expand on the left we could split multiple terms that would be grouped otherwise. Breaking this limitation implies some changes in the core of the unified highlighter.

Closes #28089
jasontedor added a commit that referenced this pull request Jan 11, 2018
* master: (43 commits)
  Rename core module to server (#28180)
  upgraded jna from 4.4.0-1 to 4.5.1 (#28183)
  [TEST] Do not call RandomizedTest.scaledRandomIntBetween from multiple threads
  Primary send safe commit in file-based recovery (#28038)
  [Docs] Correct response json in rank-eval.asciidoc
  Add scroll parameter to _reindex API (#28041)
  Include all sentences smaller than fragment_size in the unified highlighter (#28132)
  Modifies the JavaAPI docs related to AggregationBuilder
  [Docs] Improvements in script-fields.asciidoc (#28174)
  [Docs] Remove Kerberos/SPNEGO Shield plugin (#28019)
  Ignore null value for range field (#27845) (#28116)
  Fix environment variable substitutions in list setting (#28106)
  docs: Replaces indexed script java api docs with stored script api docs
  test: ensure we endup with a single segment
  Make sure that we don't detect files as maven coordinate when installing a plugin (#28163)
  [Tests] temporary disable meta plugin rest tests #28163
  meta-plugin should install bin and config at the top level (#28162)
  Painless: Add public member read/write access test. (#28156)
  Docs: Clarify password protection support with keystore (#28157)
  [Docs] fix plugin properties inclusion for plugins authors
  ...
jasontedor added a commit that referenced this pull request Jan 11, 2018
* 6.x: (41 commits)
  Rename core module to server (#28190)
  [Test] Remove superfluous lines
  [Test] Add skip test for index range field with null values
  upgraded jna from 4.4.0-1 to 4.5.1 (#28183)
  [TEST] Do not call RandomizedTest.scaledRandomIntBetween from multiple threads
  [Docs] Correct response json in rank-eval.asciidoc
  Fix environment variable substitutions in list setting (#28106)
  Add scroll parameter to _reindex API (#28041)
  Include all sentences smaller than fragment_size in the unified highlighter (#28132)
  [Docs] Improvements in script-fields.asciidoc (#28174)
  [Docs] Remove Kerberos/SPNEGO Shield plugin (#28019)
  Ignore null value for range field (#27845) (#28116)
  docs: Replaces indexed script java api docs with stored script api docs
  test: ensure we endup with a single segment
  Make sure that we don't detect files as maven coordinate when installing a plugin (#28163)
  [Tests] temporary disable meta plugin rest tests #28163
  meta-plugin should install bin and config at the top level (#28162)
  Painless: Add public member read/write access test. (#28156)
  Docs: Clarify password protection support with keystore (#28157)
  [Docs] fix plugin properties inclusion for plugins authors
  ...
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jan 13, 2018
* master: (30 commits)
  Fix lock accounting in releasable lock
  Add ability to associate an ID with tasks  (elastic#27764)
  [DOCS] Removed differencies between text and code (elastic#27993)
  text fixes (elastic#28136)
  Update getting-started.asciidoc (elastic#28145)
  [Docs] Spelling fix in painless-getting-started.asciidoc (elastic#28187)
  Fixed the cat.health REST test to accept 4ms, not just 4.0ms (elastic#28186)
  Do not keep 5.x commits once having 6.x commits (elastic#28188)
  Rename core module to server (elastic#28180)
  upgraded jna from 4.4.0-1 to 4.5.1 (elastic#28183)
  [TEST] Do not call RandomizedTest.scaledRandomIntBetween from multiple threads
  Primary send safe commit in file-based recovery (elastic#28038)
  [Docs] Correct response json in rank-eval.asciidoc
  Add scroll parameter to _reindex API (elastic#28041)
  Include all sentences smaller than fragment_size in the unified highlighter (elastic#28132)
  Modifies the JavaAPI docs related to AggregationBuilder
  [Docs] Improvements in script-fields.asciidoc (elastic#28174)
  [Docs] Remove Kerberos/SPNEGO Shield plugin (elastic#28019)
  Ignore null value for range field (elastic#27845) (elastic#28116)
  Fix environment variable substitutions in list setting (elastic#28106)
  ...
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jan 16, 2018
* compile-with-jdk-9: (56 commits)
  TEST: init unassigned gcp in testAcquireIndexCommit
  Replica start peer recovery with safe commit (elastic#28181)
  Truncate tlog cli should assign global checkpoint (elastic#28192)
  Fix lock accounting in releasable lock
  Add ability to associate an ID with tasks  (elastic#27764)
  [DOCS] Removed differencies between text and code (elastic#27993)
  text fixes (elastic#28136)
  Update getting-started.asciidoc (elastic#28145)
  [Docs] Spelling fix in painless-getting-started.asciidoc (elastic#28187)
  Fixed the cat.health REST test to accept 4ms, not just 4.0ms (elastic#28186)
  Do not keep 5.x commits once having 6.x commits (elastic#28188)
  Rename core module to server (elastic#28180)
  upgraded jna from 4.4.0-1 to 4.5.1 (elastic#28183)
  [TEST] Do not call RandomizedTest.scaledRandomIntBetween from multiple threads
  Primary send safe commit in file-based recovery (elastic#28038)
  [Docs] Correct response json in rank-eval.asciidoc
  Add scroll parameter to _reindex API (elastic#28041)
  Include all sentences smaller than fragment_size in the unified highlighter (elastic#28132)
  Modifies the JavaAPI docs related to AggregationBuilder
  [Docs] Improvements in script-fields.asciidoc (elastic#28174)
  ...
@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants