-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finalize all snapshots completed by shard snapshot updates #105245
Finalize all snapshots completed by shard snapshot updates #105245
Conversation
Today when processing a batch of `ShardSnapshotUpdate` tasks each update's listener considers whether the corresponding snapshot has completed and, if so, it enqueues it for finalization. This is somewhat inefficient since we may be processing many shard snapshot updates for the same few snapshots, but there's no need to check each snapshot for completion more than once. It's also insufficient since the completion of a shard snapshot may cause the completion of subsequent snapshots too (e.g. they can go from state `QUEUED` straight to `MISSING`). This commit detaches the completion handling from the individual shard snapshot updates and instead makes sure that any snapshot that reaches a completed state is enqueued for finalization. Closes elastic#104939
Hi @DaveCTurner, I've created a changelog YAML for you. |
Pinging @elastic/es-distributed (Team:Distributed) |
@@ -1230,6 +1242,142 @@ public void testRunConcurrentSnapshots() { | |||
} | |||
} | |||
|
|||
public void testSnapshotCompletedByNodeLeft() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than adding a test that reliably reproduces the failure in #104939 I've elected to write this more general test with enough randomisation to cover a variety of similar situations too. It does reproduce the failure in #104939 much more frequently than SnapshotStressTestsIT
, but even so it might take a few hundred iterations to hit the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look fine on its own. But I could use some help to understand how the original code led to CI failure. Thanks!
for (final var taskContext : batchExecutionContext.taskContexts()) { | ||
if (taskContext.getTask() instanceof ShardSnapshotUpdate task) { | ||
final var ref = onCompletionRefs.acquire(); | ||
needsReroute = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think needsReroute
must be true
now sinc this is inside updatesByRepo.isEmpty() == false
check. We may not even need it since completionHandler
is only called when there is some update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah nice, good point. I found that flag irksome indeed, dropped in a0db49e.
I wrote a long comment only to realize that I did not send it out when posting the overall comment ... I'll post it again ... |
newEntries.add(newEntry); | ||
if (newEntry != entry && newEntry.state().completed()) { | ||
newlyCompletedEntries.add(newEntry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could use help to better understand two piece of information.
- How come a shard update for one snapshot end up updating the state of another snapshot?
- How come a
ShardSnapshotUpdate
processed unassigned shards beforeSnapshotsService#processExternalChanges
?
Using this particular failure as an example, I think the answer to the first question is the following sequence of calls when processing shard updates for snapshot-clone-11
:
elasticsearch/server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java
Line 3236 in d1ca37a
tryStartNextTaskAfterCloneUpdated(update.repoShardId, update.updatedState); |
elasticsearch/server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java
Line 3345 in d1ca37a
tryStartSnapshotAfterCloneFinish(repoShardId, updatedState.generation()); |
elasticsearch/server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java
Line 3380 in d1ca37a
startShardSnapshot(repoShardId, generation); |
elasticsearch/server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java
Line 3401 in d1ca37a
final ShardSnapshotStatus shardSnapshotStatus = initShardSnapshotStatus(generation, shardRouting, nodeIdRemovalPredicate); |
In initShardSnapshotStatus
, we found out that the shard is unassigned and marked its state to MISSING
which in turns leads to snapshot-paritial-12
being marked as SUCCESS
.
Is this correct?
Assuming the above is more or less correct, now for question 2, I don't understand how the unassigned shard was not processed processExternalChanges
which is called inside SnapshotsService#applyClusterState
. This is where my lack of knowledge for cluster coordination hurts ... I thought cluster state update and apply run strictly one after the other, i.e. after a cluster state update, the apply runs before computing another update. In regular case, node disconnection is handled by processExternalChanges
which ends snapshots based on the missing shards. How come ShardSnapshotUpdate
sees the missing shard first in this case? You said
while that update was in flight the node left the cluster so the other shard snapshots moved to state MISSING in the same update
This seems to say while ShardSnapshotUpdate
is running, the cluster state changed under it? Shard update uses a dedicated master task queue which I assume they cannot be processed together with node left? Sorry this all sounds pretty silly. I must have some serious/key misunderstanding somewhere. I'd appreciate if you could help clarify the sequence/order of events here. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In initShardSnapshotStatus, we found out that the shard is unassigned and marked its state to MISSING which in turns leads to snapshot-paritial-12 being marked as SUCCESS.
Correct.
I don't understand how the unassigned shard was not processed processExternalChanges
Because when processExternalChanges
was called [index-0][1]
was still in state INIT
in snapshot-clone-11
, and clones don't run on the data nodes so don't get cancelled on a node-left event, which means that the subsequent operations on [index-0][1]
also remain unchanged (in state QUEUED
). But then when that shard clone completes it moves to SUCCESS
and the later QUEUED
states all move to MISSING
via the process you describe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We sync in a separate channel and it helped me to understand why snapshot-partial-12
did not get finished during processExternalChanges
. Because (1) the shard did not get marked as a knownFailure
since it was part of a clone and (2) snapshot-partial-12
at that time still has the shard as UNASSIGNED_QUEUED
which does not has a nodeId and hence does not respond to node leaving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
💔 Backport failed
You can use sqren/backport to manually backport by running |
Today when processing a batch of `ShardSnapshotUpdate` tasks each update's listener considers whether the corresponding snapshot has completed and, if so, it enqueues it for finalization. This is somewhat inefficient since we may be processing many shard snapshot updates for the same few snapshots, but there's no need to check each snapshot for completion more than once. It's also insufficient since the completion of a shard snapshot may cause the completion of subsequent snapshots too (e.g. they can go from state `QUEUED` straight to `MISSING`). This commit detaches the completion handling from the individual shard snapshot updates and instead makes sure that any snapshot that reaches a completed state is enqueued for finalization. Closes #104939
Backported to 8.12 in a14ad5f. |
Today when processing a batch of
ShardSnapshotUpdate
tasks eachupdate's listener considers whether the corresponding snapshot has
completed and, if so, it enqueues it for finalization. This is somewhat
inefficient since we may be processing many shard snapshot updates for
the same few snapshots, but there's no need to check each snapshot for
completion more than once. It's also insufficient since the completion
of a shard snapshot may cause the completion of subsequent snapshots too
(e.g. they can go from state
QUEUED
straight toMISSING
).This commit detaches the completion handling from the individual shard
snapshot updates and instead makes sure that any snapshot that reaches a
completed state is enqueued for finalization.
Closes #104939