Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the postings highlighter and make unified the default highlighter choice #25028

Merged
merged 3 commits into from
Jun 9, 2017

Conversation

jimczi
Copy link
Contributor

@jimczi jimczi commented Jun 2, 2017

This change removes the postings highlighter. This highlighter has been removed from Lucene master (7.x) because it behaves
exactly like the unified highlighter when index_options is set to offsets:
https://issues.apache.org/jira/browse/LUCENE-7815

It also makes the unified highlighter the default choice for highlighting a field (if type is not provided).
The strategy used internally by this highlighter remain the same as before, it checks term_vectors first, then postings and ultimately it re-analyzes the text.
This change also rewrites the docs so that the options that the unified highlighter cannot handle are clearly marked as such.
There are few features that the unified highlighter is not able to handle which is why the other highlighters (plain and fvh) are still available.
I'll open separate issues for these features and we'll deprecate the fvh and plain highlighters when full support for these features have been added to the unified.

@@ -98,3 +98,14 @@ but the only reason why it has not been deprecated too is because it is used
for the `random_score` function. If you really need access to the id of
documents for sorting, aggregations or search scripts, the recommandation is
to duplicate the id as a field in the document.

==== Highlighers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing t

Copy link
Contributor

@clintongormley clintongormley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few docs comments

==== Highlighers

The `unified` highlighter is the new default choice for highlighter.
The offset strategy for each field is picked internally by this highlighter depending on the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by the offset strategy?

Allows to highlight search results on one or more fields. The
implementation uses either the lucene `plain` highlighter, the
fast vector highlighter (`fvh`) or `postings` highlighter.
Allows to highlight search results on one or more fields.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Highlighters allow you to produce highlighted snippets from one or more fields in your search results."


The default choice of highlighter is of type `plain` and uses the Lucene highlighter.
It tries hard to reflect the query matching logic in terms of understanding word importance and any word positioning criteria in phrase queries.
The default choice of highlighter is of type `unified` and uses the Lucene Unified highlighter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unified highlighter (which is used by default if no highlighter type is specified) uses the Lucene Unified Highlighter.

It also supports accurate phrase and multi-term (fuzzy, prefix, regex) highlighting.

[float]
===== Offsets Strategy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offsets Strategy requires some explanation, perhaps:

In order to create meaningful search snippets from the terms being queried, a highlighter needs to know the start and end character offsets of each word in the original text. These offsets can be obtained from:

  • The postings list (fields mapped as "index_options": "offsets").
  • Term vectors (fields mapped as "term_vectors": "with_positions_offsets").
  • The original field, by reanalysing the text on-the-fly.

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few minors but LGTM. I guess next is to deprecate the postings highlighter in 5.x, right?

.endObject().endObject().endObject()
.endObject().endObject().endObject()));
.addMapping("type1", jsonBuilder().startObject().startObject("type1").startObject("properties")
// we don't store title and don't use term vector, now lets see if it works...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you keep the indentation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that :(

.endObject().endObject().endObject()));
.addMapping("type1", jsonBuilder().startObject().startObject("type1").startObject("properties")
// we don't store title, now lets see if it works...
.startObject("title")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too.

will use this information to highlight documents without re-analyzing the text.
It re-runs the original query directly on the postings and extracts the matching offsets
directly from the index limiting the collection to the highlighted documents.
This mode is faster since it doesn't require to reanalyze the text to be highlighted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect it isn't actually faster for very short strings. At least, that was my experience with the experimental highlighter.

@@ -814,6 +833,8 @@ to

[[phrase-limit]]
==== Phrase Limit

WARNING:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing "this is only supported by the fast vector highlighter", I think.

@jimczi
Copy link
Contributor Author

jimczi commented Jun 2, 2017

Thanks @clintongormley and @nik9000
I pushed a commit to address your comments.

I guess next is to deprecate the postings highlighter in 5.x

We can make this change transparent by allowing postings to be set in 6.x as an alias for unified (and just add the verification that postings is set in the index options) ?

@nik9000
Copy link
Member

nik9000 commented Jun 2, 2017

We can make this change transparent by allowing postings to be set in 6.x as an alias for unified (and just add the verification that postings is set in the index options) ?

I'm afraid of something sneaky like that. I think we're better off deprecating in 5.x so users of the postings highlighter know that they should test with the unified highlighter. Even if it is the same code I'd prefer they know about the change rather than get some kind of surprise on upgrade.

@javanna
Copy link
Member

javanna commented Jun 2, 2017

Not sure, but I am wondering if we should rather have two separate PRs, one for removing postings which gets replaced by unified (probably already potentially breaking although if it breaks users it's because of bugs? but certainly breaking for people specifying postings as highlighter type), and another one for changing the default highlighter type (more breaking as it affects also people relying on plain or fvh just because they don't have offsets or they have term vectors).

I tend to agree with @nik9000 on not making unified a synonym of postings under the hood.

@jimczi
Copy link
Contributor Author

jimczi commented Jun 6, 2017

I tend to agree with @nik9000 on not making unified a synonym of postings under the hood.

Ok so I'll start with the deprecation in 5.x

Not sure, but I am wondering if we should rather have two separate PRs, one for removing postings which gets replaced by unified (probably already potentially breaking although if it breaks users it's because of bugs? but certainly breaking for people specifying postings as highlighter type)

I think it affects 5.x only where we have two options:

  • Deprecate postings but only when the type is forced and redirect automatically to unified (instead of postings) when the field options for offsets are set.
  • Deprecate postings but continue to redirect automatically to this highlighter when the field options are set.

I am leaning toward option 2 since the unified highlighter also handles phrase/span queries (and the postings just ignores the positions).

and another one for changing the default highlighter type (more breaking as it affects also people relying on plain or fvh just because they don't have offsets or they have term vectors)

If the desired behavior for 6.x is to use the unified highlighter by default I don't think we need to split this in two PRs. Doing so would require to change the tests and documentations twice, with a single PR the change is simpler, we don't need to have a replacement for the automatic postings mode.

@jimczi
Copy link
Contributor Author

jimczi commented Jun 6, 2017

I opened #25073 for the deprecation in 5.x

jimczi added 2 commits June 8, 2017 14:00
…ter choice

This change removes the `postings` highlighter. This highlighter has been removed from Lucene master (7.x) because it behaves
exactly like the `unified` highlighter when index_options is set to `offsets`:
https://issues.apache.org/jira/browse/LUCENE-7815

It also makes the `unified` highlighter the default choice for highlighting a field (if `type` is not provided).
The strategy used internally by this highlighter remain the same as before, it checks `term_vectors` first, then `postings` and ultimately it re-analyzes the text.
Ultimately it rewrites the docs so that the options that the `unified` highlighter cannot handle are clearly marked as such.
There are few features that the `unified` highlighter is not able to handle which is why the other highlighters (`plain` and `fvh`) are still available.
I'll open separate issues for these features and we'll deprecate the `fvh` and `plain` highlighters when full support for these features have been added to the `unified`.
@jimczi jimczi force-pushed the unified_highlighting_default branch from c1e357f to 4feaaca Compare June 8, 2017 12:00
jimczi added a commit to jimczi/elasticsearch that referenced this pull request Jun 9, 2017
This change adds a deprecation warning for removal in 6.0.

Relates elastic#25028
jimczi added a commit that referenced this pull request Jun 9, 2017
This change adds a deprecation warning for removal in 6.0.
Only one deprecation is logged per request
Relates #25028
@jimczi jimczi merged commit 8250aa4 into elastic:master Jun 9, 2017
@jimczi jimczi deleted the unified_highlighting_default branch June 9, 2017 12:09
@jimczi jimczi removed the review label Jun 9, 2017
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jun 10, 2017
* master: (53 commits)
  Log checkout so SHA is known
  Add link to community Rust Client (elastic#22897)
  "shard started" should show index and shard ID (elastic#25157)
  await fix testWithRandomException
  Change BWC versions on create index response
  Return the index name on a create index response
  Remove incorrect bwc branch logic from master
  Correctly format arrays in output
  [Test] Extending parsing checks for SearchResponse (elastic#25148)
  Scripting: Change keys for inline/stored scripts to source/id (elastic#25127)
  [Test] Add test for custom requests in High Level Rest Client (elastic#25106)
  nested: In case of a single type the _id field should be added to the nested document instead of _uid field.
  `type` and `id` are lost upon serialization of `Translog.Delete`. (elastic#24586)
  fix highlighting docs
  Fix NPE in token_count datatype with null value (elastic#25046)
  Remove the postings highlighter and make unified the default highlighter choice (elastic#25028)
  [Test] Adding test for parsing SearchShardFailure leniently (elastic#25144)
  Fix typo in shards.asciidoc (elastic#25143)
  List Hibernate Search (elastic#25145)
  [DOCS] update maxRetryTimeout in java REST client usage page
  ...
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jun 10, 2017
* master: (80 commits)
  Test: remove faling test that relies on merge order
  Log checkout so SHA is known
  Add link to community Rust Client (elastic#22897)
  "shard started" should show index and shard ID (elastic#25157)
  await fix testWithRandomException
  Change BWC versions on create index response
  Return the index name on a create index response
  Remove incorrect bwc branch logic from master
  Correctly format arrays in output
  [Test] Extending parsing checks for SearchResponse (elastic#25148)
  Scripting: Change keys for inline/stored scripts to source/id (elastic#25127)
  [Test] Add test for custom requests in High Level Rest Client (elastic#25106)
  nested: In case of a single type the _id field should be added to the nested document instead of _uid field.
  `type` and `id` are lost upon serialization of `Translog.Delete`. (elastic#24586)
  fix highlighting docs
  Fix NPE in token_count datatype with null value (elastic#25046)
  Remove the postings highlighter and make unified the default highlighter choice (elastic#25028)
  [Test] Adding test for parsing SearchShardFailure leniently (elastic#25144)
  Fix typo in shards.asciidoc (elastic#25143)
  List Hibernate Search (elastic#25145)
  ...
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jun 10, 2017
* master: (1889 commits)
  Test: remove faling test that relies on merge order
  Log checkout so SHA is known
  Add link to community Rust Client (elastic#22897)
  "shard started" should show index and shard ID (elastic#25157)
  await fix testWithRandomException
  Change BWC versions on create index response
  Return the index name on a create index response
  Remove incorrect bwc branch logic from master
  Correctly format arrays in output
  [Test] Extending parsing checks for SearchResponse (elastic#25148)
  Scripting: Change keys for inline/stored scripts to source/id (elastic#25127)
  [Test] Add test for custom requests in High Level Rest Client (elastic#25106)
  nested: In case of a single type the _id field should be added to the nested document instead of _uid field.
  `type` and `id` are lost upon serialization of `Translog.Delete`. (elastic#24586)
  fix highlighting docs
  Fix NPE in token_count datatype with null value (elastic#25046)
  Remove the postings highlighter and make unified the default highlighter choice (elastic#25028)
  [Test] Adding test for parsing SearchShardFailure leniently (elastic#25144)
  Fix typo in shards.asciidoc (elastic#25143)
  List Hibernate Search (elastic#25145)
  ...
@colings86 colings86 removed the v6.0.0 label Aug 4, 2017
Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Oct 11, 2017
Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Oct 21, 2017
Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Oct 21, 2017
Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Nov 16, 2017
Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Nov 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants