Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce point in time APIs in x-pack basic #61062

Merged
merged 19 commits into from
Aug 25, 2020
Merged

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Aug 12, 2020

This commit introduces a new API that manages point-in-times in x-pack basic. Elasticsearch pit (point in time) is a lightweight view into the state of the data as it existed when initiated. A search request by default executes against the most recent point in time. In some cases, it is preferred to perform multiple search requests using the same point in time. For example, if refreshes happen between search_after requests, then the results of those requests might not be consistent as changes happening between searches are only visible to the more recent point in time.

A point in time must be opened before being used in search requests. The keep_alive parameter tells Elasticsearch how long it should keep a point in time around.

POST /my_index/_pit?keep_alive=1m

The response from the above request includes a id, which should be passed to the id of the pit parameter of search requests.

POST /_search
{
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    },
    "pit": {
	    "id":  "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWICBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA==",
	    "keep_alive": "1m"
    }
}

Point-in-times are automatically closed when the keep_alive is elapsed. However, keeping point-in-times has a cost; hence, point-in-times should be closed as soon as they are no longer used in search requests.

DELETE /_pit
{
    "id" : "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWIBBXV1aWQyAAA="
}

Notable works in this change:

Relates #46523
Relates #26472

Co-authored-by: Jim Ferenczi <[email protected]>

Co-authored-by: Jim Ferenczi <[email protected]>
@dnhatn dnhatn added the :Search/Search Search-related issues that do not fall into other categories label Aug 12, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@elasticmachine elasticmachine added the Team:Search Meta label for search team label Aug 12, 2020
@dnhatn dnhatn requested a review from jimczi August 12, 2020 19:17
@dnhatn dnhatn changed the title Introduce search context Introduce point in time APIs Aug 17, 2020
@dnhatn dnhatn changed the title Introduce point in time APIs Introduce point in time APIs in x-pack basic Aug 17, 2020
@dnhatn
Copy link
Member Author

dnhatn commented Aug 20, 2020

@albertzaharovits @jimczi Thank you for the discussion today. I've pushed:

  • 8a7eb6a to override the indices of a search request using the resolved indices from the point in time parameter.
  • fa4b4c6 to restore the current behavior of SecuritySearchOperationListener.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although we don't handle aliases that have different permissions that their backing indices. We discussed with @albertzaharovits and the @elastic/es-security and agreed that we'll throw an error if we detect an alias with different security during the creation of the PIT. That doesn't mean that we'll never fix this on PIT but we prefer to consider it as a feature since the future of aliases in Security is not settled yet (we may remove this functionality in 8).

dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Sep 10, 2020
This commit introduces a new API that manages point-in-times in x-pack
basic. Elasticsearch pit (point in time) is a lightweight view into the
state of the data as it existed when initiated. A search request by
default executes against the most recent point in time. In some cases,
it is preferred to perform multiple search requests using the same point
in time. For example, if refreshes happen between search_after requests,
then the results of those requests might not be consistent as changes
happening between searches are only visible to the more recent point in
time.

A point in time must be opened before being used in search requests. The
`keep_alive` parameter tells Elasticsearch how long it should keep a
point in time around.

```
POST /my_index/_pit?keep_alive=1m
```

The response from the above request includes a `id`, which should be
passed to the `id` of the `pit` parameter of search requests.

```
POST /_search
{
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    },
    "pit": {
            "id":  "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWICBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA==",
            "keep_alive": "1m"
    }
}
```

Point-in-times are automatically closed when the `keep_alive` is
elapsed. However, keeping point-in-times has a cost; hence,
point-in-times should be closed as soon as they are no longer used in
search requests.

```
DELETE /_pit
{
    "id" : "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWIBBXV1aWQyAAA="
}
```

#### Notable works in this change:

- Move the search state to the coordinating node: elastic#52741
- Allow searches with a specific reader context: elastic#53989
- Add the ability to acquire readers in IndexShard: elastic#54966

Relates elastic#46523
Relates elastic#26472

Co-authored-by: Jim Ferenczi <[email protected]>
dnhatn pushed a commit to dnhatn/elasticsearch that referenced this pull request Sep 10, 2020
Today some uncaught shard failures such as RejectedExecutionException skips the release of shard context
and let subsequent scroll requests access the same shard context again. Depending on how the other shards advanced,
this behavior can lead to missing data since scrolls always move forward.
In order to avoid hidden data loss, this commit ensures that we always release the context of shard search scroll requests whenever a failure
occurs locally. The shard search context will no longer exist in subsequent scroll requests which will lead to consistent shard failures
in the responses.
This change also modifies the retry tests of the reindex feature. Reindex retries scroll search request that contains a shard failure and
move on whenever the failure disappears. That is not compatible with how scrolls work and can lead to missing data as explained above.
That means that reindex will now report scroll failures when search rejection happen during the operation instead of skipping document
silently.
Finally this change removes an old TODO that was fulfilled with elastic#61062.
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Sep 10, 2020
This commit integrates point in time into async search and
ensures that it works correctly with security enabled.

Relates elastic#61062
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Sep 10, 2020
This commit integrates point in time into cross cluster search.

Relates elastic#61062
Closes elastic#61790
dnhatn added a commit that referenced this pull request Sep 10, 2020
dnhatn added a commit that referenced this pull request Sep 10, 2020
This commit introduces a new API that manages point-in-times in x-pack
basic. Elasticsearch pit (point in time) is a lightweight view into the
state of the data as it existed when initiated. A search request by
default executes against the most recent point in time. In some cases,
it is preferred to perform multiple search requests using the same point
in time. For example, if refreshes happen between search_after requests,
then the results of those requests might not be consistent as changes
happening between searches are only visible to the more recent point in
time.

A point in time must be opened before being used in search requests. The
`keep_alive` parameter tells Elasticsearch how long it should keep a
point in time around.

```
POST /my_index/_pit?keep_alive=1m
```

The response from the above request includes a `id`, which should be
passed to the `id` of the `pit` parameter of search requests.

```
POST /_search
{
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    },
    "pit": {
            "id":  "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWICBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA==",
            "keep_alive": "1m"
    }
}
```

Point-in-times are automatically closed when the `keep_alive` is
elapsed. However, keeping point-in-times has a cost; hence,
point-in-times should be closed as soon as they are no longer used in
search requests.

```
DELETE /_pit
{
    "id" : "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWIBBXV1aWQyAAA="
}
```

#### Notable works in this change:

- Move the search state to the coordinating node: #52741
- Allow searches with a specific reader context: #53989
- Add the ability to acquire readers in IndexShard: #54966

Relates #46523
Relates #26472

Co-authored-by: Jim Ferenczi <[email protected]>
dnhatn pushed a commit that referenced this pull request Sep 10, 2020
Today some uncaught shard failures such as RejectedExecutionException skips the release of shard context
and let subsequent scroll requests access the same shard context again. Depending on how the other shards advanced,
this behavior can lead to missing data since scrolls always move forward.
In order to avoid hidden data loss, this commit ensures that we always release the context of shard search scroll requests whenever a failure
occurs locally. The shard search context will no longer exist in subsequent scroll requests which will lead to consistent shard failures
in the responses.
This change also modifies the retry tests of the reindex feature. Reindex retries scroll search request that contains a shard failure and
move on whenever the failure disappears. That is not compatible with how scrolls work and can lead to missing data as explained above.
That means that reindex will now report scroll failures when search rejection happen during the operation instead of skipping document
silently.
Finally this change removes an old TODO that was fulfilled with #61062.
dnhatn added a commit that referenced this pull request Sep 10, 2020
This commit integrates point in time into async search and
ensures that it works correctly with security enabled.

Relates #61062
dnhatn added a commit that referenced this pull request Sep 10, 2020
This commit integrates point in time into cross cluster search.

Relates #61062
Closes #61790
dnhatn added a commit that referenced this pull request Sep 11, 2020
dnhatn added a commit that referenced this pull request Jan 9, 2021
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jan 10, 2021
dnhatn added a commit that referenced this pull request Jan 11, 2021
@Rupesh282
Copy link

Rupesh282 commented Oct 16, 2023

Hi @dnhatn, I want to do deep pagination using a query & sort on a couple of indices in my es cluster.
I am trying to see if search_after has any advantage over scroll in this case. As I need stateful data, using PIT (Point in Time) with search_after.
Things I have found so far through code :

  • There is no major difference between context stored of PIT (point in time) method & context of scroll. Just that scroll context has some more information about incoming request.
    Class reference :-
    • org.elasticsearch.search.internal.LegacyReaderContext
    • org.elasticsearch.search.internal.ReaderContext
  • Both scroll & search_after utilise sorting to hold minimal data (O(size)) at each shard's priority queue.

I want to do deep pagination in which from+size can go beyond 100k. ( though size will always be 5k, only from is increasing ).

Your opinion on this will be appreciated, Thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement release highlight :Search/Search Search-related issues that do not fall into other categories v7.10.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants