-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Functionality to Consistently Read RepositoryData For CS Updates #55773
Add Functionality to Consistently Read RepositoryData For CS Updates #55773
Conversation
Using optimistic locking, add the ability to run a repository state update task with a consistent view of the current repository data. Allows for a follow-up to remove the snapshot init state. Closes elastic#55702
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
if (snapshotIds.isEmpty()) { | ||
listener.onResponse(null); | ||
return; | ||
return new ClusterStateUpdateTask() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little less ergonomic than I would've liked but I found it easiest to produce a CS update task in the Repository
API compared to other mechanism because the existing code is so heavily built on using state in CS update tasks to do logic between execute
and clusterStateProcessed
, so I figured doing it this way and in the repository allows for the follow-ups to be relatively isolated changes.
@Override | ||
public void executeConsistentStateUpdate(Function<RepositoryData, ClusterStateUpdateTask> createUpdateTask, | ||
Consumer<Exception> onFailure) { | ||
threadPool.generic().execute(new AbstractRunnable() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could technically be done quite a bit more efficiently by not forking off to the generic pool when we just get cached repository data, but I figured this is good enough and I can back-port it to 7.x
without trouble because even in a mixed 7.x/6.x cluster, a 7.x master node will always update the repo metadata when writing new repository data so this simple solution avoids dealing with all the details of the org.elasticsearch.repositories.blobstore.BlobStoreRepository#bestEffortConsistency
flag.
@@ -738,6 +739,63 @@ public void onFailure(Exception e) { | |||
assertEquals(0, snapshotInfo.failedShards()); | |||
} | |||
|
|||
public void testConcurrentDeletes() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this test to cover the scenario of concurrent deletes causing the repository data that a delete is based on to be outdated in addition to the coverage we have for concurrent snapshot + delete
* @param createUpdateTask function to supply cluster state update task | ||
* @param onFailure error handler invoked on failure to get a consistent view of the current {@link RepositoryData} | ||
*/ | ||
void executeConsistentStateUpdate(Function<RepositoryData, ClusterStateUpdateTask> createUpdateTask, Consumer<Exception> onFailure); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With more and more cluster state related logic leaking into the repository code, I think we should just eventually refactor things such that the repository only deals with reading and writing data in some form (obviously lots of tricky details here around source only repos and S3 waiting and whatnot) and completely delegate the responsibility for this kind of magic to either the SnapshotsService
or some other component. But until then this is basically the method we want to make life a lot less complicated in the SnapshotsService
@Override | ||
public void executeConsistentStateUpdate(Function<RepositoryData, ClusterStateUpdateTask> createUpdateTask, | ||
Consumer<Exception> onFailure) { | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throw UnsupportedOperation here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure :)
repositoriesService.repository(repositoryName).getRepositoryData(ActionListener.wrap(repositoryData -> | ||
deleteCompletedSnapshots(matchingSnapshotIds(repositoryData, snapshotNames, repositoryName), | ||
repositoryName, repositoryData.getGenId(), Priority.NORMAL, l), l::onFailure)))); | ||
repositoriesService.repository(repositoryName).executeConsistentStateUpdate(repositoryData -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could repositoriesService.repository(repositoryName)
throw an exception here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Close to impossible but I suppose there is a slim chance of concurrently removing the repository here so I added a catch for that :)
@@ -330,6 +330,60 @@ protected void doClose() { | |||
} | |||
} | |||
|
|||
@Override | |||
public void executeConsistentStateUpdate(Function<RepositoryData, ClusterStateUpdateTask> createUpdateTask, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is using ClusterStateUpdateTask
in a weird way, directly calling some methods on it, but ignoring other things (e.g. priority / listeners?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, this is a little less ergonomic than I'd like it to be. Though, practically speaking the priority is not and won't be used anywhere (as in the delete which runs at normal and all upcoming use-cases of this) anyway so I decided to ignore it.
The whole listeners thing I think I got correct though? Since clusterStatePublished
is final and empty in ClusterStateUpdateTask
and we don't use onNoLongerMaster
yet in the task passed I figured why add dead code?
protected void doRun() { | ||
final RepositoryMetadata repositoryMetadataStart = metadata; | ||
getRepositoryData(ActionListener.wrap(repositoryData -> | ||
clusterService.submitStateUpdateTask("consistent state update", new ClusterStateUpdateTask() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we are just protecting against concurrent changes done on the same node, I wonder if we can avoid calling getRepositoryData
here for every caller of executeConsistentStateUpdate
, or do some sharing between multiple callers of this method. In particular, I'm worried about situations where it takes a long time for the norma-priority CS update task below to be executed, and where stuff might have been invalidated again and again (think e.g. about a misconfigured deletion policy). The task here also does not seem to support timeouts? (This makes me also notice that deleteCompletedSnapshots
should probably use the timeout from the DeleteSnapshotRequest
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we are just protecting against concurrent changes done on the same node
Practically yes, theoretically no I think it could be across nodes? Technically we could have a delete along the lines of:
- Master reads repository data and then gets stuck for 30s or whatever
- New master takes over and completes a repository operation then dies
- Master from step 1 is back ...
Not likely, but possible and a scenario that SnapshotResiliencyTests
could cover (they currently don't but might be worth adding).
I'm worried about situations where it takes a long time for the norma-priority CS update task below to be executed, and where stuff might have been invalidated again and again (think e.g. about a misconfigured deletion policy)
I'm not sure this is all that likely in practice since we keep loading the RepositoryData
from cache and currently a delete does 4 state updates of which only two modify the repository metadata? So I would imagine that practically even if you somehow send a barrage of delete requests those will mostly work out fine just because it's so unlikely for these back to back repo metadata updates to go through that quickly (they will always require at least one additional update to the cluster state before). Not saying I'm against optimizing this (always looking for ways to not have to load too many RepositoryData
instances on heap) but it seemed not strictly necessary for now (that's why I wrote https://github.com/elastic/elasticsearch/pull/55773/files#r415345654 , we can probably already way optimize this by exploiting the way the cache works).
The task here also does not seem to support timeouts? (This makes me also notice that deleteCompletedSnapshots should probably use the timeout from the DeleteSnapshotRequest).
Right we haven't been using those timeouts in deletes ever I think ... maybe something to fix in a follow-up since it'll require a few changes here and there because we currently don't pass the timeout from the request to the SnapshotsService
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #55798 for the timeout thing (was a small change after all ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All addressed I think. Let me know if you want to make it more efficient in this iteration already :)
@@ -330,6 +330,60 @@ protected void doClose() { | |||
} | |||
} | |||
|
|||
@Override | |||
public void executeConsistentStateUpdate(Function<RepositoryData, ClusterStateUpdateTask> createUpdateTask, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, this is a little less ergonomic than I'd like it to be. Though, practically speaking the priority is not and won't be used anywhere (as in the delete which runs at normal and all upcoming use-cases of this) anyway so I decided to ignore it.
The whole listeners thing I think I got correct though? Since clusterStatePublished
is final and empty in ClusterStateUpdateTask
and we don't use onNoLongerMaster
yet in the task passed I figured why add dead code?
repositoriesService.repository(repositoryName).getRepositoryData(ActionListener.wrap(repositoryData -> | ||
deleteCompletedSnapshots(matchingSnapshotIds(repositoryData, snapshotNames, repositoryName), | ||
repositoryName, repositoryData.getGenId(), Priority.NORMAL, l), l::onFailure)))); | ||
repositoriesService.repository(repositoryName).executeConsistentStateUpdate(repositoryData -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Close to impossible but I suppose there is a slim chance of concurrently removing the repository here so I added a catch for that :)
protected void doRun() { | ||
final RepositoryMetadata repositoryMetadataStart = metadata; | ||
getRepositoryData(ActionListener.wrap(repositoryData -> | ||
clusterService.submitStateUpdateTask("consistent state update", new ClusterStateUpdateTask() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we are just protecting against concurrent changes done on the same node
Practically yes, theoretically no I think it could be across nodes? Technically we could have a delete along the lines of:
- Master reads repository data and then gets stuck for 30s or whatever
- New master takes over and completes a repository operation then dies
- Master from step 1 is back ...
Not likely, but possible and a scenario that SnapshotResiliencyTests
could cover (they currently don't but might be worth adding).
I'm worried about situations where it takes a long time for the norma-priority CS update task below to be executed, and where stuff might have been invalidated again and again (think e.g. about a misconfigured deletion policy)
I'm not sure this is all that likely in practice since we keep loading the RepositoryData
from cache and currently a delete does 4 state updates of which only two modify the repository metadata? So I would imagine that practically even if you somehow send a barrage of delete requests those will mostly work out fine just because it's so unlikely for these back to back repo metadata updates to go through that quickly (they will always require at least one additional update to the cluster state before). Not saying I'm against optimizing this (always looking for ways to not have to load too many RepositoryData
instances on heap) but it seemed not strictly necessary for now (that's why I wrote https://github.com/elastic/elasticsearch/pull/55773/files#r415345654 , we can probably already way optimize this by exploiting the way the cache works).
The task here also does not seem to support timeouts? (This makes me also notice that deleteCompletedSnapshots should probably use the timeout from the DeleteSnapshotRequest).
Right we haven't been using those timeouts in deletes ever I think ... maybe something to fix in a follow-up since it'll require a few changes here and there because we currently don't pass the timeout from the request to the SnapshotsService
?
@Override | ||
public void executeConsistentStateUpdate(Function<RepositoryData, ClusterStateUpdateTask> createUpdateTask, | ||
Consumer<Exception> onFailure) { | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two more comments
protected void doRun() { | ||
final RepositoryMetadata repositoryMetadataStart = metadata; | ||
getRepositoryData(ActionListener.wrap(repositoryData -> | ||
clusterService.submitStateUpdateTask("consistent state update", new ClusterStateUpdateTask() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"consistent state update"
. Can you pass in a more descriptive state update message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can we add the timeout here now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure added the timeout, passing a source now + took the opportunity to also pass the task priority along now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks Yannick! |
Making use of elastic#55773 to simplify snapshot state machine. 1. Deletes with no in-progress snapshot now add the delete entry to the cluster state right away instead of doing a second CS update after the fist update was a NOOP. 2. If a bulk delete matches in-progress as well as completed snapshots, abort the in-progress snapshot and then move on to delete from the repository.
…lastic#55773) Using optimistic locking, add the ability to run a repository state update task with a consistent view of the current repository data. Allows for a follow-up to remove the snapshot INIT state.
Making use of #55773 to simplify snapshot state machine. 1. Deletes with no in-progress snapshot now add the delete entry to the cluster state right away instead of doing a second CS update after the fist update was a NOOP. 2. If a bulk delete matches in-progress as well as completed snapshots, abort the in-progress snapshot and then move on to delete from the repository.
Making use of elastic#55773 to simplify snapshot state machine. 1. Deletes with no in-progress snapshot now add the delete entry to the cluster state right away instead of doing a second CS update after the fist update was a NOOP. 2. If a bulk delete matches in-progress as well as completed snapshots, abort the in-progress snapshot and then move on to delete from the repository.
Making use of #55773 to simplify snapshot state machine. 1. Deletes with no in-progress snapshot now add the delete entry to the cluster state right away instead of doing a second CS update after the fist update was a NOOP. 2. If a bulk delete matches in-progress as well as completed snapshots, abort the in-progress snapshot and then move on to delete from the repository.
With #55773 the snapshot INIT state step has become obsolete. We can set up the snapshot directly in one single step to simplify the state machine. This is a big help for building concurrent snapshots because it allows us to establish a deterministic order of operations between snapshot create and delete operations since all of their entries now contain a repository generation. With this change simple queuing up of snapshot operations can and will be added in a follow-up.
With elastic#55773 the snapshot INIT state step has become obsolete. We can set up the snapshot directly in one single step to simplify the state machine. This is a big help for building concurrent snapshots because it allows us to establish a deterministic order of operations between snapshot create and delete operations since all of their entries now contain a repository generation. With this change simple queuing up of snapshot operations can and will be added in a follow-up.
With elastic#55773 the snapshot INIT state step has become obsolete. We can set up the snapshot directly in one single step to simplify the state machine. This is a big help for building concurrent snapshots because it allows us to establish a deterministic order of operations between snapshot create and delete operations since all of their entries now contain a repository generation. With this change simple queuing up of snapshot operations can and will be added in a follow-up.
With #55773 the snapshot INIT state step has become obsolete. We can set up the snapshot directly in one single step to simplify the state machine. This is a big help for building concurrent snapshots because it allows us to establish a deterministic order of operations between snapshot create and delete operations since all of their entries now contain a repository generation. With this change simple queuing up of snapshot operations can and will be added in a follow-up.
Using (sort of) optimistic locking, add the ability to run a repository state
update task with a consistent view of the current repository data.
Fixes issues where deleting a snapshot is enqueued based on an outdated repository generation due to concurrent snapshot or delete operations.
Allows for some follow-ups:
INIT
state because its only use was adding the snapshot to the cluster state so that no concurrent changes to the repository could occur before reading the repository data.INIT
state corner case handling inSnapshotsService
that can all nicely go away in a follow-up using this change.Closes #55702