Introduce point in time APIs in x-pack basic #61062

dnhatn · 2020-08-12T18:59:26Z

This commit introduces a new API that manages point-in-times in x-pack basic. Elasticsearch pit (point in time) is a lightweight view into the state of the data as it existed when initiated. A search request by default executes against the most recent point in time. In some cases, it is preferred to perform multiple search requests using the same point in time. For example, if refreshes happen between search_after requests, then the results of those requests might not be consistent as changes happening between searches are only visible to the more recent point in time.

A point in time must be opened before being used in search requests. The keep_alive parameter tells Elasticsearch how long it should keep a point in time around.

POST /my_index/_pit?keep_alive=1m

The response from the above request includes a id, which should be passed to the id of the pit parameter of search requests.

POST /_search
{
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    },
    "pit": {
	    "id":  "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWICBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA==",
	    "keep_alive": "1m"
    }
}

Point-in-times are automatically closed when the keep_alive is elapsed. However, keeping point-in-times has a cost; hence, point-in-times should be closed as soon as they are no longer used in search requests.

DELETE /_pit
{
    "id" : "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWIBBXV1aWQyAAA="
}

Notable works in this change:

Move the search state to the coordinating node: Move states of search to coordinating node #52741
Allow searches with a specific reader context: Allow searches with specific reader contexts #53989
Add the ability to acquire readers in IndexShard: Adds the ability to acquire readers in IndexShard #54966

Relates #46523
Relates #26472

Co-authored-by: Jim Ferenczi <[email protected]>

elasticmachine · 2020-08-12T19:16:54Z

Pinging @elastic/es-search (:Search/Search)

dnhatn · 2020-08-20T21:51:15Z

@albertzaharovits @jimczi Thank you for the discussion today. I've pushed:

8a7eb6a to override the indices of a search request using the resolved indices from the point in time parameter.
fa4b4c6 to restore the current behavior of SecuritySearchOperationListener.

jimczi

LGTM, although we don't handle aliases that have different permissions that their backing indices. We discussed with @albertzaharovits and the @elastic/es-security and agreed that we'll throw an error if we detect an alias with different security during the creation of the PIT. That doesn't mean that we'll never fix this on PIT but we prefer to consider it as a feature since the future of aliases in Security is not settled yet (we may remove this functionality in 8).

This commit introduces a new API that manages point-in-times in x-pack basic. Elasticsearch pit (point in time) is a lightweight view into the state of the data as it existed when initiated. A search request by default executes against the most recent point in time. In some cases, it is preferred to perform multiple search requests using the same point in time. For example, if refreshes happen between search_after requests, then the results of those requests might not be consistent as changes happening between searches are only visible to the more recent point in time. A point in time must be opened before being used in search requests. The `keep_alive` parameter tells Elasticsearch how long it should keep a point in time around. ``` POST /my_index/_pit?keep_alive=1m ``` The response from the above request includes a `id`, which should be passed to the `id` of the `pit` parameter of search requests. ``` POST /_search { "query": { "match" : { "title" : "elasticsearch" } }, "pit": { "id": "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWICBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA==", "keep_alive": "1m" } } ``` Point-in-times are automatically closed when the `keep_alive` is elapsed. However, keeping point-in-times has a cost; hence, point-in-times should be closed as soon as they are no longer used in search requests. ``` DELETE /_pit { "id" : "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWIBBXV1aWQyAAA=" } ``` #### Notable works in this change: - Move the search state to the coordinating node: elastic#52741 - Allow searches with a specific reader context: elastic#53989 - Add the ability to acquire readers in IndexShard: elastic#54966 Relates elastic#46523 Relates elastic#26472 Co-authored-by: Jim Ferenczi <[email protected]>

Today some uncaught shard failures such as RejectedExecutionException skips the release of shard context and let subsequent scroll requests access the same shard context again. Depending on how the other shards advanced, this behavior can lead to missing data since scrolls always move forward. In order to avoid hidden data loss, this commit ensures that we always release the context of shard search scroll requests whenever a failure occurs locally. The shard search context will no longer exist in subsequent scroll requests which will lead to consistent shard failures in the responses. This change also modifies the retry tests of the reindex feature. Reindex retries scroll search request that contains a shard failure and move on whenever the failure disappears. That is not compatible with how scrolls work and can lead to missing data as explained above. That means that reindex will now report scroll failures when search rejection happen during the operation instead of skipping document silently. Finally this change removes an old TODO that was fulfilled with elastic#61062.

This commit integrates point in time into async search and ensures that it works correctly with security enabled. Relates elastic#61062

This commit integrates point in time into cross cluster search. Relates elastic#61062 Closes elastic#61790

Relates #61062 Relates #61872

This commit introduces a new API that manages point-in-times in x-pack basic. Elasticsearch pit (point in time) is a lightweight view into the state of the data as it existed when initiated. A search request by default executes against the most recent point in time. In some cases, it is preferred to perform multiple search requests using the same point in time. For example, if refreshes happen between search_after requests, then the results of those requests might not be consistent as changes happening between searches are only visible to the more recent point in time. A point in time must be opened before being used in search requests. The `keep_alive` parameter tells Elasticsearch how long it should keep a point in time around. ``` POST /my_index/_pit?keep_alive=1m ``` The response from the above request includes a `id`, which should be passed to the `id` of the `pit` parameter of search requests. ``` POST /_search { "query": { "match" : { "title" : "elasticsearch" } }, "pit": { "id": "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWICBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA==", "keep_alive": "1m" } } ``` Point-in-times are automatically closed when the `keep_alive` is elapsed. However, keeping point-in-times has a cost; hence, point-in-times should be closed as soon as they are no longer used in search requests. ``` DELETE /_pit { "id" : "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWIBBXV1aWQyAAA=" } ``` #### Notable works in this change: - Move the search state to the coordinating node: #52741 - Allow searches with a specific reader context: #53989 - Add the ability to acquire readers in IndexShard: #54966 Relates #46523 Relates #26472 Co-authored-by: Jim Ferenczi <[email protected]>

Today some uncaught shard failures such as RejectedExecutionException skips the release of shard context and let subsequent scroll requests access the same shard context again. Depending on how the other shards advanced, this behavior can lead to missing data since scrolls always move forward. In order to avoid hidden data loss, this commit ensures that we always release the context of shard search scroll requests whenever a failure occurs locally. The shard search context will no longer exist in subsequent scroll requests which will lead to consistent shard failures in the responses. This change also modifies the retry tests of the reindex feature. Reindex retries scroll search request that contains a shard failure and move on whenever the failure disappears. That is not compatible with how scrolls work and can lead to missing data as explained above. That means that reindex will now report scroll failures when search rejection happen during the operation instead of skipping document silently. Finally this change removes an old TODO that was fulfilled with #61062.

This commit integrates point in time into async search and ensures that it works correctly with security enabled. Relates #61062

This commit integrates point in time into cross cluster search. Relates #61062 Closes #61790

Relates #61062

Relates elastic#61062

Relates #61062

Rupesh282 · 2023-10-16T23:18:55Z

Hi @dnhatn, I want to do deep pagination using a query & sort on a couple of indices in my es cluster.
I am trying to see if search_after has any advantage over scroll in this case. As I need stateful data, using PIT (Point in Time) with search_after.
Things I have found so far through code :

There is no major difference between context stored of PIT (point in time) method & context of scroll. Just that scroll context has some more information about incoming request.
Class reference :-
- org.elasticsearch.search.internal.LegacyReaderContext
- org.elasticsearch.search.internal.ReaderContext
Both scroll & search_after utilise sorting to hold minimal data (O(size)) at each shard's priority queue.

I want to do deep pagination in which from+size can go beyond 100k. ( though size will always be 5k, only from is increasing ).

Your opinion on this will be appreciated, Thanks !

Introduce search context

314a596

Co-authored-by: Jim Ferenczi <[email protected]>

dnhatn added the :Search/Search Search-related issues that do not fall into other categories label Aug 12, 2020

elasticmachine added the Team:Search Meta label for search team label Aug 12, 2020

dnhatn added >enhancement v8.0.0 v7.10.0 release highlight and removed Team:Search Meta label for search team labels Aug 12, 2020

dnhatn requested a review from jimczi August 12, 2020 19:17

dnhatn added 5 commits August 12, 2020 15:36

changes

d92aa92

fix doc

b44dc0a

Merge branch 'master' into search-context

6ed217a

search context -> point in time

136bf34

fix rest test

43b9dad

dnhatn changed the title ~~Introduce search context~~ Introduce point in time APIs Aug 17, 2020

dnhatn changed the title ~~Introduce point in time APIs~~ Introduce point in time APIs in x-pack basic Aug 17, 2020

dnhatn added 9 commits August 19, 2020 17:16

validate only scroll and multi session point in time

dcbe3e3

handle point-in-time related request in RBACEngine

6453754

Merge branch 'master' into search-context

ff38a8a

combine scroll and point-in-time

af02abf

Revert "combine scroll and point-in-time"

d4d902b

pass NamedWriteableRegistry to NodeClient

2ba2113

override indices of search request if point-in-time specified

8a7eb6a

validate scroll request only

fa4b4c6

Merge branch 'master' into search-context

12429ac

dnhatn added 2 commits August 20, 2020 18:11

fix test

2897660

fix test

0a454de

jimczi approved these changes Aug 24, 2020

View reviewed changes

dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Sep 10, 2020

Support point in time in async_search (elastic#61560)

61d66fe

This commit integrates point in time into async search and ensures that it works correctly with security enabled. Relates elastic#61062

dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Sep 10, 2020

Support point in time cross cluster search (elastic#61827)

58b498e

This commit integrates point in time into cross cluster search. Relates elastic#61062 Closes elastic#61790

dnhatn added a commit that referenced this pull request Sep 10, 2020

Disable BWC to backport point in time to 7.10

eaf4ce2

Relates #61062 Relates #61872

dnhatn added a commit that referenced this pull request Sep 10, 2020

Support point in time in async_search (#61560)

035f063

This commit integrates point in time into async search and ensures that it works correctly with security enabled. Relates #61062

dnhatn added a commit that referenced this pull request Sep 10, 2020

Support point in time cross cluster search (#61827)

aafb2cb

This commit integrates point in time into cross cluster search. Relates #61062 Closes #61790

dnhatn removed the backport pending label Sep 10, 2020

dnhatn mentioned this pull request Sep 11, 2020

Adjust BWC after backporting point in time to 7.10 #62262

Merged

dnhatn added a commit that referenced this pull request Sep 11, 2020

Adjust BWC after backporting point in time to 7.10 (#62262)

7b1ab6f

Relates #61062

droberts195 mentioned this pull request Sep 30, 2020

[ML] Consider using search_after instead of scroll in datafeeds #29781

Open

weltenwort mentioned this pull request Oct 2, 2020

[Logs UI] Use point-in-time reader for consistent log entry fetching elastic/kibana#79262

Open

Ryan3435 mentioned this pull request Nov 13, 2020

Add Point in Time API functionality olivere/elastic#1433

Closed

Mpdreamz mentioned this pull request Nov 16, 2020

7.10.1 Meta Ticket elastic/elasticsearch-net#5096

Closed

61 tasks

jakelandis mentioned this pull request Dec 2, 2020

Very large scroll search (i.e. reindex) can gradually slow down #65780

Closed

dnhatn mentioned this pull request Jan 6, 2021

Retry point in time on other copy when possible #66713

Merged

dnhatn added a commit that referenced this pull request Jan 9, 2021

Retry point in time on other copy when possible (#66713)

59082c0

Relates #61062

dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jan 10, 2021

Retry point in time on other copy when possible (elastic#66713)

f89b863

Relates elastic#61062

dnhatn mentioned this pull request Jan 10, 2021

Retry point in time on other copy when possible #67224

Merged

dnhatn added a commit that referenced this pull request Jan 11, 2021

Retry point in time on other copy when possible (#66713)

806b6eb

Relates #61062

dnhatn mentioned this pull request Apr 6, 2021

Elasticsearch queries with scroll stopped working #56202

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce point in time APIs in x-pack basic #61062

Introduce point in time APIs in x-pack basic #61062

dnhatn commented Aug 12, 2020 •

edited

Loading

elasticmachine commented Aug 12, 2020

dnhatn commented Aug 20, 2020

jimczi left a comment •

edited

Loading

Rupesh282 commented Oct 16, 2023 •

edited

Loading

Introduce point in time APIs in x-pack basic #61062

Introduce point in time APIs in x-pack basic #61062

Conversation

dnhatn commented Aug 12, 2020 • edited Loading

Notable works in this change:

elasticmachine commented Aug 12, 2020

dnhatn commented Aug 20, 2020

jimczi left a comment • edited Loading

Choose a reason for hiding this comment

Rupesh282 commented Oct 16, 2023 • edited Loading

dnhatn commented Aug 12, 2020 •

edited

Loading

jimczi left a comment •

edited

Loading

Rupesh282 commented Oct 16, 2023 •

edited

Loading