-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove the postings highlighter and make unified the default highlighter choice #25028
Conversation
@@ -98,3 +98,14 @@ but the only reason why it has not been deprecated too is because it is used | |||
for the `random_score` function. If you really need access to the id of | |||
documents for sorting, aggregations or search scripts, the recommandation is | |||
to duplicate the id as a field in the document. | |||
|
|||
==== Highlighers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few docs comments
==== Highlighers | ||
|
||
The `unified` highlighter is the new default choice for highlighter. | ||
The offset strategy for each field is picked internally by this highlighter depending on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by the offset strategy?
Allows to highlight search results on one or more fields. The | ||
implementation uses either the lucene `plain` highlighter, the | ||
fast vector highlighter (`fvh`) or `postings` highlighter. | ||
Allows to highlight search results on one or more fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Highlighters allow you to produce highlighted snippets from one or more fields in your search results."
|
||
The default choice of highlighter is of type `plain` and uses the Lucene highlighter. | ||
It tries hard to reflect the query matching logic in terms of understanding word importance and any word positioning criteria in phrase queries. | ||
The default choice of highlighter is of type `unified` and uses the Lucene Unified highlighter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unified
highlighter (which is used by default if no highlighter type
is specified) uses the Lucene Unified Highlighter.
It also supports accurate phrase and multi-term (fuzzy, prefix, regex) highlighting. | ||
|
||
[float] | ||
===== Offsets Strategy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Offsets Strategy requires some explanation, perhaps:
In order to create meaningful search snippets from the terms being queried, a highlighter needs to know the start and end character offsets of each word in the original text. These offsets can be obtained from:
- The postings list (fields mapped as
"index_options": "offsets"
). - Term vectors (fields mapped as
"term_vectors": "with_positions_offsets"
). - The original field, by reanalysing the text on-the-fly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few minors but LGTM. I guess next is to deprecate the postings highlighter in 5.x, right?
.endObject().endObject().endObject() | ||
.endObject().endObject().endObject())); | ||
.addMapping("type1", jsonBuilder().startObject().startObject("type1").startObject("properties") | ||
// we don't store title and don't use term vector, now lets see if it works... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you keep the indentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that :(
.endObject().endObject().endObject())); | ||
.addMapping("type1", jsonBuilder().startObject().startObject("type1").startObject("properties") | ||
// we don't store title, now lets see if it works... | ||
.startObject("title") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too.
will use this information to highlight documents without re-analyzing the text. | ||
It re-runs the original query directly on the postings and extracts the matching offsets | ||
directly from the index limiting the collection to the highlighted documents. | ||
This mode is faster since it doesn't require to reanalyze the text to be highlighted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect it isn't actually faster for very short strings. At least, that was my experience with the experimental highlighter.
@@ -814,6 +833,8 @@ to | |||
|
|||
[[phrase-limit]] | |||
==== Phrase Limit | |||
|
|||
WARNING: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing "this is only supported by the fast vector highlighter", I think.
Thanks @clintongormley and @nik9000
We can make this change transparent by allowing |
I'm afraid of something sneaky like that. I think we're better off deprecating in 5.x so users of the postings highlighter know that they should test with the unified highlighter. Even if it is the same code I'd prefer they know about the change rather than get some kind of surprise on upgrade. |
Not sure, but I am wondering if we should rather have two separate PRs, one for removing postings which gets replaced by unified (probably already potentially breaking although if it breaks users it's because of bugs? but certainly breaking for people specifying postings as highlighter type), and another one for changing the default highlighter type (more breaking as it affects also people relying on plain or fvh just because they don't have offsets or they have term vectors). I tend to agree with @nik9000 on not making unified a synonym of postings under the hood. |
Ok so I'll start with the deprecation in 5.x
I think it affects 5.x only where we have two options:
I am leaning toward option 2 since the
If the desired behavior for 6.x is to use the |
I opened #25073 for the deprecation in 5.x |
…ter choice This change removes the `postings` highlighter. This highlighter has been removed from Lucene master (7.x) because it behaves exactly like the `unified` highlighter when index_options is set to `offsets`: https://issues.apache.org/jira/browse/LUCENE-7815 It also makes the `unified` highlighter the default choice for highlighting a field (if `type` is not provided). The strategy used internally by this highlighter remain the same as before, it checks `term_vectors` first, then `postings` and ultimately it re-analyzes the text. Ultimately it rewrites the docs so that the options that the `unified` highlighter cannot handle are clearly marked as such. There are few features that the `unified` highlighter is not able to handle which is why the other highlighters (`plain` and `fvh`) are still available. I'll open separate issues for these features and we'll deprecate the `fvh` and `plain` highlighters when full support for these features have been added to the `unified`.
c1e357f
to
4feaaca
Compare
This change adds a deprecation warning for removal in 6.0. Relates elastic#25028
This change adds a deprecation warning for removal in 6.0. Only one deprecation is logged per request Relates #25028
* master: (53 commits) Log checkout so SHA is known Add link to community Rust Client (elastic#22897) "shard started" should show index and shard ID (elastic#25157) await fix testWithRandomException Change BWC versions on create index response Return the index name on a create index response Remove incorrect bwc branch logic from master Correctly format arrays in output [Test] Extending parsing checks for SearchResponse (elastic#25148) Scripting: Change keys for inline/stored scripts to source/id (elastic#25127) [Test] Add test for custom requests in High Level Rest Client (elastic#25106) nested: In case of a single type the _id field should be added to the nested document instead of _uid field. `type` and `id` are lost upon serialization of `Translog.Delete`. (elastic#24586) fix highlighting docs Fix NPE in token_count datatype with null value (elastic#25046) Remove the postings highlighter and make unified the default highlighter choice (elastic#25028) [Test] Adding test for parsing SearchShardFailure leniently (elastic#25144) Fix typo in shards.asciidoc (elastic#25143) List Hibernate Search (elastic#25145) [DOCS] update maxRetryTimeout in java REST client usage page ...
* master: (80 commits) Test: remove faling test that relies on merge order Log checkout so SHA is known Add link to community Rust Client (elastic#22897) "shard started" should show index and shard ID (elastic#25157) await fix testWithRandomException Change BWC versions on create index response Return the index name on a create index response Remove incorrect bwc branch logic from master Correctly format arrays in output [Test] Extending parsing checks for SearchResponse (elastic#25148) Scripting: Change keys for inline/stored scripts to source/id (elastic#25127) [Test] Add test for custom requests in High Level Rest Client (elastic#25106) nested: In case of a single type the _id field should be added to the nested document instead of _uid field. `type` and `id` are lost upon serialization of `Translog.Delete`. (elastic#24586) fix highlighting docs Fix NPE in token_count datatype with null value (elastic#25046) Remove the postings highlighter and make unified the default highlighter choice (elastic#25028) [Test] Adding test for parsing SearchShardFailure leniently (elastic#25144) Fix typo in shards.asciidoc (elastic#25143) List Hibernate Search (elastic#25145) ...
* master: (1889 commits) Test: remove faling test that relies on merge order Log checkout so SHA is known Add link to community Rust Client (elastic#22897) "shard started" should show index and shard ID (elastic#25157) await fix testWithRandomException Change BWC versions on create index response Return the index name on a create index response Remove incorrect bwc branch logic from master Correctly format arrays in output [Test] Extending parsing checks for SearchResponse (elastic#25148) Scripting: Change keys for inline/stored scripts to source/id (elastic#25127) [Test] Add test for custom requests in High Level Rest Client (elastic#25106) nested: In case of a single type the _id field should be added to the nested document instead of _uid field. `type` and `id` are lost upon serialization of `Translog.Delete`. (elastic#24586) fix highlighting docs Fix NPE in token_count datatype with null value (elastic#25046) Remove the postings highlighter and make unified the default highlighter choice (elastic#25028) [Test] Adding test for parsing SearchShardFailure leniently (elastic#25144) Fix typo in shards.asciidoc (elastic#25143) List Hibernate Search (elastic#25145) ...
This change removes the
postings
highlighter. This highlighter has been removed from Lucene master (7.x) because it behavesexactly like the
unified
highlighter when index_options is set tooffsets
:https://issues.apache.org/jira/browse/LUCENE-7815
It also makes the
unified
highlighter the default choice for highlighting a field (iftype
is not provided).The strategy used internally by this highlighter remain the same as before, it checks
term_vectors
first, thenpostings
and ultimately it re-analyzes the text.This change also rewrites the docs so that the options that the
unified
highlighter cannot handle are clearly marked as such.There are few features that the
unified
highlighter is not able to handle which is why the other highlighters (plain
andfvh
) are still available.I'll open separate issues for these features and we'll deprecate the
fvh
andplain
highlighters when full support for these features have been added to theunified
.