Shortcut query phase using the results of other shards #51852

jimczi · 2020-02-04T10:07:33Z

This commit, built on top of #51708, allows to modify shard search requests based on informations collected on other shards. It is intended to speed up sorted queries on time-based indices. For queries that are only interested in the top documents.

This change will rewrite the shard queries to match none if the bottom sort value computed in prior shards is better than all values in the shard.
For queries that mix top documents and aggregations this change will reset the size of the top documents to 0 instead of rewriting to match none.
This means that we don't need to keep a search context open for this shard since we know in advance that it doesn't contain any competitive hit.

Closes #49601

This change ensures that the rewrite of the shard request is executed in the network thread or in the refresh listener when waiting for an active shard. This allows queries that rewrite to match_no_docs to bypass the search thread pool entirely even if the can_match phase was skipped (pre_filter_shard_size > number of shards). Coordinating nodes don't have the ability to create empty responses so this change also ensures that at least one shard creates a full empty response while the other can return null ones. This is needed since creating true empty responses on shards require to create concrete aggregators which would be too costly to build on a network thread. We should move this functionality to aggregation builders in a follow up but that would be a much bigger change. This change is also important for elastic#49601 since we want to add the ability to use the result of other shards to rewrite the request of subsequent ones. For instance if the first M shards have their top N computed, the top worst document in the global queue can be pass to subsequent shards that can then rewrite to match_no_docs if they can guarantee that they don't have any document better than the provided one.

…me_sort

elasticmachine · 2020-02-04T10:07:35Z

Pinging @elastic/es-distributed (:Distributed/Distributed)

elasticmachine · 2020-02-04T10:07:37Z

Pinging @elastic/es-search (:Search/Search)

mayya-sharipova

Overall, LGTM, I just left a couple of comments.

mayya-sharipova · 2020-02-10T15:38:16Z

server/src/main/java/org/elasticsearch/action/search/SearchQueryThenFetchAsyncAction.java

@@ -72,4 +85,38 @@ protected void onShardGroupFailure(int shardIndex, Exception exc) {
    protected SearchPhase getNextPhase(final SearchPhaseResults<SearchPhaseResult> results, final SearchPhaseContext context) {
        return new FetchSearchPhase(results, searchPhaseController, context);
    }
+
+    ShardSearchRequest rewriteShardRequest(ShardSearchRequest request) {


Not relevant to this PR, but In future, do we want to rewrite also requests without sort ( e.g. a keyword search) that can be shortcut?

I am not sure I follow. Are you talking of handling queries sorted by _score ? We can probably propagate the global min competitive score up to the query collector so that wouldn't require any rewrite.

server/src/main/java/org/elasticsearch/action/search/SearchQueryThenFetchAsyncAction.java

server/src/main/java/org/elasticsearch/action/search/SearchPhaseController.java

server/src/main/java/org/elasticsearch/action/search/SearchQueryThenFetchAsyncAction.java

server/src/main/java/org/elasticsearch/search/internal/ShardSearchRequest.java

This change adapts the serialization checks to 7.7.0 in order to cope with #53659. Note that this commit also disables the bwc tests temporarily in order to be able to merge #53659 first. Relates #51852

This commit, built on top of #51708, allows to modify shard search requests based on informations collected on other shards. It is intended to speed up sorted queries on time-based indices. For queries that are only interested in the top documents. This change will rewrite the shard queries to match none if the bottom sort value computed in prior shards is better than all values in the shard. For queries that mix top documents and aggregations this change will reset the size of the top documents to 0 instead of rewriting to match none. This means that we don't need to keep a search context open for this shard since we know in advance that it doesn't contain any competitive hit.

This commit disables the sort optimization added in elastic#51852 for scroll requests. Scroll queries keep a state per shard so we cannot modify the request on the first round (submit). This bug was introduced in non-released versions which is why this pr is marked as a non-issue.

This commit disables the sort optimization added in #51852 for scroll requests. Scroll queries keep a state per shard so we cannot modify the request on the first round (submit). This bug was introduced in non-released versions which is why this pr is marked as a non-issue.

Collapse search queries that sort by a field can throw an ArrayStoreException due to a bug in the [sort optimization](elastic#51852) introduced in 7.7.0. Search collapsing were not supposed to be eligible for this sort optimization so this change explicitly filters them from this new feature.

Collapse search queries that sort by a field can throw an ArrayStoreException due to a bug in the [sort optimization](#51852) introduced in 7.7.0. Search collapsing were not supposed to be eligible for this sort optimization so this change explicitly filters them from this new feature.

Whenever sorting on a date, numeric or keyword field (as primary sort), the can_match phase retrieves min and max for the field and sorts the shards (asc or desc depending on the sort order) so that they are going to be queried following that order. This allows incremental results to be exposed in that same order when using async search, as well as optimizations built on top of such behaviour (elastic#51852). For fields with points we call `getMinPackedValue` and `getMaxPackedValue`, while for keyword fields we call `Terms#getMin` and `Terms#getMax`. Elasticsearch uses `FilterTerms` implementations to cancel queries as well as to track field usage. Such filter implementations should delegate their `getMin` and `getMax` calls to the wrapped `Terms` instance, which will leverage info from the block tree that caches min and max, otherwise they are always going to be retrieved from the index, which does I/O and slows the can_match phase down.

… and geonames tracks We recently found a regression that affected searches sorted by keyword field (elastic/elasticsearch#92026). Given that we had no benchmarks for sorting by keyword, this commit adds the relevant operations to the http-logs and geonames tracks. Geonames is a good base but it's good to also make the new challenges part of the many-shards benchmarks as differences can be appreciated with a high amount of shards involved in a query. This commit adds also specific challenges to verify the effect of elastic/elasticsearch#51852 when a search is sorted by numeric or timestamp.

Whenever sorting on a date, numeric or keyword field (as primary sort), the can_match phase retrieves min and max for the field and sorts the shards (asc or desc depending on the sort order) so that they are going to be queried following that order. This allows incremental results to be exposed in that same order when using async search, as well as optimizations built on top of such behaviour (#51852). For fields with points we call `getMinPackedValue` and `getMaxPackedValue`, while for keyword fields we call `Terms#getMin` and `Terms#getMax`. Elasticsearch uses `FilterTerms` implementations to cancel queries as well as to track field usage. Such filter implementations should delegate their `getMin` and `getMax` calls to the wrapped `Terms` instance, which will leverage info from the block tree that caches min and max, otherwise they are always going to be retrieved from the index, which does I/O and slows the can_match phase down.

…#92026) Whenever sorting on a date, numeric or keyword field (as primary sort), the can_match phase retrieves min and max for the field and sorts the shards (asc or desc depending on the sort order) so that they are going to be queried following that order. This allows incremental results to be exposed in that same order when using async search, as well as optimizations built on top of such behaviour (elastic#51852). For fields with points we call `getMinPackedValue` and `getMaxPackedValue`, while for keyword fields we call `Terms#getMin` and `Terms#getMax`. Elasticsearch uses `FilterTerms` implementations to cancel queries as well as to track field usage. Such filter implementations should delegate their `getMin` and `getMax` calls to the wrapped `Terms` instance, which will leverage info from the block tree that caches min and max, otherwise they are always going to be retrieved from the index, which does I/O and slows the can_match phase down.

…#92854) Whenever sorting on a date, numeric or keyword field (as primary sort), the can_match phase retrieves min and max for the field and sorts the shards (asc or desc depending on the sort order) so that they are going to be queried following that order. This allows incremental results to be exposed in that same order when using async search, as well as optimizations built on top of such behaviour (#51852). For fields with points we call `getMinPackedValue` and `getMaxPackedValue`, while for keyword fields we call `Terms#getMin` and `Terms#getMax`. Elasticsearch uses `FilterTerms` implementations to cancel queries as well as to track field usage. Such filter implementations should delegate their `getMin` and `getMax` calls to the wrapped `Terms` instance, which will leverage info from the block tree that caches min and max, otherwise they are always going to be retrieved from the index, which does I/O and slows the can_match phase down.

…#92026) Whenever sorting on a date, numeric or keyword field (as primary sort), the can_match phase retrieves min and max for the field and sorts the shards (asc or desc depending on the sort order) so that they are going to be queried following that order. This allows incremental results to be exposed in that same order when using async search, as well as optimizations built on top of such behaviour (elastic#51852). For fields with points we call `getMinPackedValue` and `getMaxPackedValue`, while for keyword fields we call `Terms#getMin` and `Terms#getMax`. Elasticsearch uses `FilterTerms` implementations to cancel queries as well as to track field usage. Such filter implementations should delegate their `getMin` and `getMax` calls to the wrapped `Terms` instance, which will leverage info from the block tree that caches min and max, otherwise they are always going to be retrieved from the index, which does I/O and slows the can_match phase down.

Whenever sorting on a date, numeric or keyword field (as primary sort), the can_match phase retrieves min and max for the field and sorts the shards (asc or desc depending on the sort order) so that they are going to be queried following that order. This allows incremental results to be exposed in that same order when using async search, as well as optimizations built on top of such behaviour (#51852). For fields with points we call `getMinPackedValue` and `getMaxPackedValue`, while for keyword fields we call `Terms#getMin` and `Terms#getMax`. Elasticsearch uses `FilterTerms` implementations to cancel queries as well as to track field usage. Such filter implementations should delegate their `getMin` and `getMax` calls to the wrapped `Terms` instance, which will leverage info from the block tree that caches min and max, otherwise they are always going to be retrieved from the index, which does I/O and slows the can_match phase down.

… and geonames tracks (#357) We recently found a regression that affected searches sorted by a keyword field (elastic/elasticsearch#92026). Given that we had no benchmarks for sorting by keyword, this commit adds the relevant operations to the http-logs and geonames tracks. We will want to also add similar challenges to the many-shards benchmarks, as the regressions we found can be seen with more than a couple of shards. This commit adds also specific challenges to verify the effect of elastic/elasticsearch#51852 when a search is sorted by numeric or timestamp.

jimczi added 18 commits January 30, 2020 22:52

add serialization test

534b552

iter

c02f352

Merge branch 'master' into rewrite_shard_request_no_rejection

010ec08

fix bwc issue

f5684ec

address review

0acf244

adapt test

6016fa4

fix test

a058127

fix topNSize when size is reset to 0

8534ed2

add more comments

27cdf19

Merge branch 'master' into rewrite_shard_request_no_rejection

a313d1d

more review

662972c

initial commit

76e90a2

Merge branch 'master' into distributed_time_sort

7eb98fb

iter

ac0451c

iter

d0ae658

Merge branch 'rewrite_shard_request_no_rejection' into distributed_ti…

c6747e6

…me_sort

iter2

30b3bcb

jimczi added >enhancement :Search/Search Search-related issues that do not fall into other categories :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. labels Feb 4, 2020

jimczi mentioned this pull request Feb 4, 2020

Change the default batched_reduce_size of search requests #51857

Open

jimczi added 5 commits February 4, 2020 11:39

remove unrelated change

961c2cd

Merge branch 'master' into distributed_time_sort

d04a16d

fix last merge

af20421

fix rest test

291f742

another fix

71876e4

mayya-sharipova approved these changes Feb 10, 2020

View reviewed changes

jimczi mentioned this pull request Mar 18, 2020

Disable distributed sort optimization on scroll requests #53759

Merged

codebrain mentioned this pull request Apr 1, 2020

7.7.0 meta ticket (Part 2) elastic/elasticsearch-net#4533

Closed

jimczi mentioned this pull request Aug 6, 2020

Disable sort optimization on search collapsing #60838

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

javanna mentioned this pull request Nov 30, 2022

Avoid doing I/O when fetching min and max for keyword fields #92026

Merged

javanna mentioned this pull request Dec 19, 2022

Add sort by field with and without can match challenge to http-logs and geonames tracks elastic/rally-tracks#357

Merged

javanna mentioned this pull request Jan 12, 2023

[7.17] Avoid doing I/O when fetching min and max for keyword fields (#92026) #92865

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shortcut query phase using the results of other shards #51852

Shortcut query phase using the results of other shards #51852

jimczi commented Feb 4, 2020 •

edited

Loading

elasticmachine commented Feb 4, 2020

elasticmachine commented Feb 4, 2020

mayya-sharipova left a comment

mayya-sharipova Feb 10, 2020

jimczi Feb 27, 2020

Shortcut query phase using the results of other shards #51852

Shortcut query phase using the results of other shards #51852

Conversation

jimczi commented Feb 4, 2020 • edited Loading

elasticmachine commented Feb 4, 2020

elasticmachine commented Feb 4, 2020

mayya-sharipova left a comment

Choose a reason for hiding this comment

mayya-sharipova Feb 10, 2020

Choose a reason for hiding this comment

jimczi Feb 27, 2020

Choose a reason for hiding this comment

jimczi commented Feb 4, 2020 •

edited

Loading