Remove the postings highlighter and make unified the default highlighter choice #25028

jimczi · 2017-06-02T14:52:53Z

This change removes the postings highlighter. This highlighter has been removed from Lucene master (7.x) because it behaves
exactly like the unified highlighter when index_options is set to offsets:
https://issues.apache.org/jira/browse/LUCENE-7815

It also makes the unified highlighter the default choice for highlighting a field (if type is not provided).
The strategy used internally by this highlighter remain the same as before, it checks term_vectors first, then postings and ultimately it re-analyzes the text.
This change also rewrites the docs so that the options that the unified highlighter cannot handle are clearly marked as such.
There are few features that the unified highlighter is not able to handle which is why the other highlighters (plain and fvh) are still available.
I'll open separate issues for these features and we'll deprecate the fvh and plain highlighters when full support for these features have been added to the unified.

clintongormley · 2017-06-02T15:11:42Z

docs/reference/migration/migrate_6_0/search.asciidoc

@@ -98,3 +98,14 @@ but the only reason why it has not been deprecated too is because it is used
 for the `random_score` function. If you really need access to the id of
 documents for sorting, aggregations or search scripts, the recommandation is
 to duplicate the id as a field in the document.
+
+==== Highlighers


clintongormley

Added a few docs comments

clintongormley · 2017-06-02T15:12:33Z

docs/reference/migration/migrate_6_0/search.asciidoc

+==== Highlighers
+
+The `unified` highlighter is the new default choice for highlighter.
+The offset strategy for each field is picked internally by this highlighter depending on the


What do you mean by the offset strategy?

clintongormley · 2017-06-02T15:15:31Z

docs/reference/search/request/highlighting.asciidoc

-Allows to highlight search results on one or more fields. The
-implementation uses either the lucene `plain` highlighter, the
-fast vector highlighter (`fvh`) or `postings` highlighter.
+Allows to highlight search results on one or more fields.


"Highlighters allow you to produce highlighted snippets from one or more fields in your search results."

clintongormley · 2017-06-02T15:17:45Z

docs/reference/search/request/highlighting.asciidoc


-The default choice of highlighter is of type `plain` and uses the Lucene highlighter.
-It tries hard to reflect the query matching logic in terms of understanding word importance and any word positioning criteria in phrase queries.
+The default choice of highlighter is of type `unified` and uses the Lucene Unified highlighter.


The unified highlighter (which is used by default if no highlighter type is specified) uses the Lucene Unified Highlighter.

clintongormley · 2017-06-02T15:27:24Z

docs/reference/search/request/highlighting.asciidoc

+It also supports accurate phrase and multi-term (fuzzy, prefix, regex) highlighting.
+
+[float]
+===== Offsets Strategy


Offsets Strategy requires some explanation, perhaps:

In order to create meaningful search snippets from the terms being queried, a highlighter needs to know the start and end character offsets of each word in the original text. These offsets can be obtained from:

The postings list (fields mapped as "index_options": "offsets").

Term vectors (fields mapped as "term_vectors": "with_positions_offsets").

The original field, by reanalysing the text on-the-fly.

nik9000

I left a few minors but LGTM. I guess next is to deprecate the postings highlighter in 5.x, right?

nik9000 · 2017-06-02T15:26:25Z

core/src/test/java/org/elasticsearch/search/fetch/subphase/highlight/HighlighterSearchIT.java

-                        .endObject().endObject().endObject()
-                        .endObject().endObject().endObject()));
+            .addMapping("type1", jsonBuilder().startObject().startObject("type1").startObject("properties")
+                // we don't store title and don't use term vector, now lets see if it works...


Can you keep the indentation?

Sorry about that :(

nik9000 · 2017-06-02T15:26:44Z

core/src/test/java/org/elasticsearch/search/fetch/subphase/highlight/HighlighterSearchIT.java

-                        .endObject().endObject().endObject()));
+            .addMapping("type1", jsonBuilder().startObject().startObject("type1").startObject("properties")
+                // we don't store title, now lets see if it works...
+                .startObject("title")


nik9000 · 2017-06-02T15:30:15Z

docs/reference/search/request/highlighting.asciidoc

+will use this information to highlight documents without re-analyzing the text.
+It re-runs the original query directly on the postings and extracts the matching offsets
+directly from the index limiting the collection to the highlighted documents.
+This mode is faster since it doesn't require to reanalyze the text to be highlighted


I expect it isn't actually faster for very short strings. At least, that was my experience with the experimental highlighter.

nik9000 · 2017-06-02T15:32:25Z

docs/reference/search/request/highlighting.asciidoc

@@ -814,6 +833,8 @@ to

 [[phrase-limit]]
 ==== Phrase Limit
+
+WARNING:


Missing "this is only supported by the fast vector highlighter", I think.

jimczi · 2017-06-02T16:08:29Z

Thanks @clintongormley and @nik9000
I pushed a commit to address your comments.

I guess next is to deprecate the postings highlighter in 5.x

We can make this change transparent by allowing postings to be set in 6.x as an alias for unified (and just add the verification that postings is set in the index options) ?

nik9000 · 2017-06-02T16:13:23Z

We can make this change transparent by allowing postings to be set in 6.x as an alias for unified (and just add the verification that postings is set in the index options) ?

I'm afraid of something sneaky like that. I think we're better off deprecating in 5.x so users of the postings highlighter know that they should test with the unified highlighter. Even if it is the same code I'd prefer they know about the change rather than get some kind of surprise on upgrade.

javanna · 2017-06-02T22:35:37Z

Not sure, but I am wondering if we should rather have two separate PRs, one for removing postings which gets replaced by unified (probably already potentially breaking although if it breaks users it's because of bugs? but certainly breaking for people specifying postings as highlighter type), and another one for changing the default highlighter type (more breaking as it affects also people relying on plain or fvh just because they don't have offsets or they have term vectors).

I tend to agree with @nik9000 on not making unified a synonym of postings under the hood.

jimczi · 2017-06-06T11:27:33Z

I tend to agree with @nik9000 on not making unified a synonym of postings under the hood.

Ok so I'll start with the deprecation in 5.x

Not sure, but I am wondering if we should rather have two separate PRs, one for removing postings which gets replaced by unified (probably already potentially breaking although if it breaks users it's because of bugs? but certainly breaking for people specifying postings as highlighter type)

I think it affects 5.x only where we have two options:

Deprecate postings but only when the type is forced and redirect automatically to unified (instead of postings) when the field options for offsets are set.
Deprecate postings but continue to redirect automatically to this highlighter when the field options are set.

I am leaning toward option 2 since the unified highlighter also handles phrase/span queries (and the postings just ignores the positions).

and another one for changing the default highlighter type (more breaking as it affects also people relying on plain or fvh just because they don't have offsets or they have term vectors)

If the desired behavior for 6.x is to use the unified highlighter by default I don't think we need to split this in two PRs. Doing so would require to change the tests and documentations twice, with a single PR the change is simpler, we don't need to have a replacement for the automatic postings mode.

jimczi · 2017-06-06T13:10:42Z

I opened #25073 for the deprecation in 5.x

…ter choice This change removes the `postings` highlighter. This highlighter has been removed from Lucene master (7.x) because it behaves exactly like the `unified` highlighter when index_options is set to `offsets`: https://issues.apache.org/jira/browse/LUCENE-7815 It also makes the `unified` highlighter the default choice for highlighting a field (if `type` is not provided). The strategy used internally by this highlighter remain the same as before, it checks `term_vectors` first, then `postings` and ultimately it re-analyzes the text. Ultimately it rewrites the docs so that the options that the `unified` highlighter cannot handle are clearly marked as such. There are few features that the `unified` highlighter is not able to handle which is why the other highlighters (`plain` and `fvh`) are still available. I'll open separate issues for these features and we'll deprecate the `fvh` and `plain` highlighters when full support for these features have been added to the `unified`.

This change adds a deprecation warning for removal in 6.0. Relates elastic#25028

This change adds a deprecation warning for removal in 6.0. Only one deprecation is logged per request Relates #25028

* master: (53 commits) Log checkout so SHA is known Add link to community Rust Client (elastic#22897) "shard started" should show index and shard ID (elastic#25157) await fix testWithRandomException Change BWC versions on create index response Return the index name on a create index response Remove incorrect bwc branch logic from master Correctly format arrays in output [Test] Extending parsing checks for SearchResponse (elastic#25148) Scripting: Change keys for inline/stored scripts to source/id (elastic#25127) [Test] Add test for custom requests in High Level Rest Client (elastic#25106) nested: In case of a single type the _id field should be added to the nested document instead of _uid field. `type` and `id` are lost upon serialization of `Translog.Delete`. (elastic#24586) fix highlighting docs Fix NPE in token_count datatype with null value (elastic#25046) Remove the postings highlighter and make unified the default highlighter choice (elastic#25028) [Test] Adding test for parsing SearchShardFailure leniently (elastic#25144) Fix typo in shards.asciidoc (elastic#25143) List Hibernate Search (elastic#25145) [DOCS] update maxRetryTimeout in java REST client usage page ...

* master: (80 commits) Test: remove faling test that relies on merge order Log checkout so SHA is known Add link to community Rust Client (elastic#22897) "shard started" should show index and shard ID (elastic#25157) await fix testWithRandomException Change BWC versions on create index response Return the index name on a create index response Remove incorrect bwc branch logic from master Correctly format arrays in output [Test] Extending parsing checks for SearchResponse (elastic#25148) Scripting: Change keys for inline/stored scripts to source/id (elastic#25127) [Test] Add test for custom requests in High Level Rest Client (elastic#25106) nested: In case of a single type the _id field should be added to the nested document instead of _uid field. `type` and `id` are lost upon serialization of `Translog.Delete`. (elastic#24586) fix highlighting docs Fix NPE in token_count datatype with null value (elastic#25046) Remove the postings highlighter and make unified the default highlighter choice (elastic#25028) [Test] Adding test for parsing SearchShardFailure leniently (elastic#25144) Fix typo in shards.asciidoc (elastic#25143) List Hibernate Search (elastic#25145) ...

* master: (1889 commits) Test: remove faling test that relies on merge order Log checkout so SHA is known Add link to community Rust Client (elastic#22897) "shard started" should show index and shard ID (elastic#25157) await fix testWithRandomException Change BWC versions on create index response Return the index name on a create index response Remove incorrect bwc branch logic from master Correctly format arrays in output [Test] Extending parsing checks for SearchResponse (elastic#25148) Scripting: Change keys for inline/stored scripts to source/id (elastic#25127) [Test] Add test for custom requests in High Level Rest Client (elastic#25106) nested: In case of a single type the _id field should be added to the nested document instead of _uid field. `type` and `id` are lost upon serialization of `Translog.Delete`. (elastic#24586) fix highlighting docs Fix NPE in token_count datatype with null value (elastic#25046) Remove the postings highlighter and make unified the default highlighter choice (elastic#25028) [Test] Adding test for parsing SearchShardFailure leniently (elastic#25144) Fix typo in shards.asciidoc (elastic#25143) List Hibernate Search (elastic#25145) ...

jimczi added :Search Relevance/Highlighting How a query matched a document >breaking >feature review v6.0.0 labels Jun 2, 2017

nik9000 self-requested a review June 2, 2017 14:54

clintongormley reviewed Jun 2, 2017

View reviewed changes

nik9000 approved these changes Jun 2, 2017

View reviewed changes

jimczi mentioned this pull request Jun 6, 2017

Postings highlighter deprecation #25073

Merged

jpountz mentioned this pull request Jun 8, 2017

Query range fields by doc values when they are expected to be more efficient than points #24823

Merged

jimczi added 2 commits June 8, 2017 14:00

apply reviews

4feaaca

jimczi force-pushed the unified_highlighting_default branch from c1e357f to 4feaaca Compare June 8, 2017 12:00

remove docs that explain the postings highlighter behavior

5d0147d

jimczi added a commit to jimczi/elasticsearch that referenced this pull request Jun 9, 2017

Deprecate postings highlighter

195395b

This change adds a deprecation warning for removal in 6.0. Relates elastic#25028

jimczi added a commit that referenced this pull request Jun 9, 2017

Postings highlighter deprecation (#25073)

1ecb1c5

This change adds a deprecation warning for removal in 6.0. Only one deprecation is logged per request Relates #25028

jimczi merged commit 8250aa4 into elastic:master Jun 9, 2017

jimczi deleted the unified_highlighting_default branch June 9, 2017 12:09

jimczi removed the review label Jun 9, 2017

colings86 added the v6.0.0-beta1 label Aug 4, 2017

colings86 removed the v6.0.0 label Aug 4, 2017

Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Oct 11, 2017

remove postings highlighter as per elastic/elasticsearch#25028

311970e

Mpdreamz mentioned this pull request Oct 11, 2017

WIP Feature/elasticsearch 6 elastic/elasticsearch-net#2870

Merged

27 tasks

Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Oct 21, 2017

remove postings highlighter as per elastic/elasticsearch#25028

04cce43

Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Oct 21, 2017

remove postings highlighter as per elastic/elasticsearch#25028

00517b9

Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Nov 16, 2017

remove postings highlighter as per elastic/elasticsearch#25028

6259df5

Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Nov 16, 2017

remove postings highlighter as per elastic/elasticsearch#25028

40b1a8a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the postings highlighter and make unified the default highlighter choice #25028

Remove the postings highlighter and make unified the default highlighter choice #25028

jimczi commented Jun 2, 2017 •

edited

Loading

clintongormley Jun 2, 2017

clintongormley left a comment

clintongormley Jun 2, 2017

clintongormley Jun 2, 2017

clintongormley Jun 2, 2017

clintongormley Jun 2, 2017

nik9000 left a comment

nik9000 Jun 2, 2017

jimczi Jun 2, 2017

nik9000 Jun 2, 2017

nik9000 Jun 2, 2017

nik9000 Jun 2, 2017

jimczi commented Jun 2, 2017

nik9000 commented Jun 2, 2017

javanna commented Jun 2, 2017

jimczi commented Jun 6, 2017

jimczi commented Jun 6, 2017

Remove the postings highlighter and make unified the default highlighter choice #25028

Remove the postings highlighter and make unified the default highlighter choice #25028

Conversation

jimczi commented Jun 2, 2017 • edited Loading

Choose a reason for hiding this comment

clintongormley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nik9000 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimczi commented Jun 2, 2017

nik9000 commented Jun 2, 2017

javanna commented Jun 2, 2017

jimczi commented Jun 6, 2017

jimczi commented Jun 6, 2017

jimczi commented Jun 2, 2017 •

edited

Loading