Finalize all snapshots completed by shard snapshot updates #105245

DaveCTurner · 2024-02-07T14:29:35Z

Today when processing a batch of ShardSnapshotUpdate tasks each
update's listener considers whether the corresponding snapshot has
completed and, if so, it enqueues it for finalization. This is somewhat
inefficient since we may be processing many shard snapshot updates for
the same few snapshots, but there's no need to check each snapshot for
completion more than once. It's also insufficient since the completion
of a shard snapshot may cause the completion of subsequent snapshots too
(e.g. they can go from state QUEUED straight to MISSING).

This commit detaches the completion handling from the individual shard
snapshot updates and instead makes sure that any snapshot that reaches a
completed state is enqueued for finalization.

Closes #104939

Today when processing a batch of `ShardSnapshotUpdate` tasks each update's listener considers whether the corresponding snapshot has completed and, if so, it enqueues it for finalization. This is somewhat inefficient since we may be processing many shard snapshot updates for the same few snapshots, but there's no need to check each snapshot for completion more than once. It's also insufficient since the completion of a shard snapshot may cause the completion of subsequent snapshots too (e.g. they can go from state `QUEUED` straight to `MISSING`). This commit detaches the completion handling from the individual shard snapshot updates and instead makes sure that any snapshot that reaches a completed state is enqueued for finalization. Closes elastic#104939

elasticsearchmachine · 2024-02-07T14:29:59Z

Hi @DaveCTurner, I've created a changelog YAML for you.

elasticsearchmachine · 2024-02-07T14:30:00Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner · 2024-02-07T15:16:42Z

server/src/test/java/org/elasticsearch/snapshots/SnapshotResiliencyTests.java

@@ -1230,6 +1242,142 @@ public void testRunConcurrentSnapshots() {
        }
    }

+    public void testSnapshotCompletedByNodeLeft() {


Rather than adding a test that reliably reproduces the failure in #104939 I've elected to write this more general test with enough randomisation to cover a variety of similar situations too. It does reproduce the failure in #104939 much more frequently than SnapshotStressTestsIT, but even so it might take a few hundred iterations to hit the problem.

…handler

ywangd

The changes look fine on its own. But I could use some help to understand how the original code led to CI failure. Thanks!

ywangd · 2024-02-09T05:40:02Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

+                    for (final var taskContext : batchExecutionContext.taskContexts()) {
+                        if (taskContext.getTask() instanceof ShardSnapshotUpdate task) {
+                            final var ref = onCompletionRefs.acquire();
+                            needsReroute = true;


Nit: I think needsReroute must be true now sinc this is inside updatesByRepo.isEmpty() == false check. We may not even need it since completionHandler is only called when there is some update.

Ah nice, good point. I found that flag irksome indeed, dropped in a0db49e.

ywangd · 2024-02-09T06:20:41Z

I wrote a long comment only to realize that I did not send it out when posting the overall comment ... I'll post it again ...

ywangd · 2024-02-09T06:42:05Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

+                    newEntries.add(newEntry);
+                    if (newEntry != entry && newEntry.state().completed()) {
+                        newlyCompletedEntries.add(newEntry);


I could use help to better understand two piece of information.

How come a shard update for one snapshot end up updating the state of another snapshot?

How come a ShardSnapshotUpdate processed unassigned shards before SnapshotsService#processExternalChanges?

Using this particular failure as an example, I think the answer to the first question is the following sequence of calls when processing shard updates for snapshot-clone-11:

elasticsearch/server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

Line 3236 in d1ca37a

tryStartNextTaskAfterCloneUpdated(update.repoShardId, update.updatedState);

elasticsearch/server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

Line 3345 in d1ca37a

tryStartSnapshotAfterCloneFinish(repoShardId, updatedState.generation());

elasticsearch/server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

Line 3380 in d1ca37a

startShardSnapshot(repoShardId, generation);

elasticsearch/server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

Line 3401 in d1ca37a

final ShardSnapshotStatus shardSnapshotStatus = initShardSnapshotStatus(generation, shardRouting, nodeIdRemovalPredicate);

In initShardSnapshotStatus, we found out that the shard is unassigned and marked its state to MISSING which in turns leads to snapshot-paritial-12 being marked as SUCCESS.

Is this correct?

Assuming the above is more or less correct, now for question 2, I don't understand how the unassigned shard was not processed processExternalChanges which is called inside SnapshotsService#applyClusterState. This is where my lack of knowledge for cluster coordination hurts ... I thought cluster state update and apply run strictly one after the other, i.e. after a cluster state update, the apply runs before computing another update. In regular case, node disconnection is handled by processExternalChanges which ends snapshots based on the missing shards. How come ShardSnapshotUpdate sees the missing shard first in this case? You said

while that update was in flight the node left the cluster so the other shard snapshots moved to state MISSING in the same update

This seems to say while ShardSnapshotUpdate is running, the cluster state changed under it? Shard update uses a dedicated master task queue which I assume they cannot be processed together with node left? Sorry this all sounds pretty silly. I must have some serious/key misunderstanding somewhere. I'd appreciate if you could help clarify the sequence/order of events here. Thanks!

In initShardSnapshotStatus, we found out that the shard is unassigned and marked its state to MISSING which in turns leads to snapshot-paritial-12 being marked as SUCCESS.

Correct.

I don't understand how the unassigned shard was not processed processExternalChanges

Because when processExternalChanges was called [index-0][1] was still in state INIT in snapshot-clone-11, and clones don't run on the data nodes so don't get cancelled on a node-left event, which means that the subsequent operations on [index-0][1] also remain unchanged (in state QUEUED). But then when that shard clone completes it moves to SUCCESS and the later QUEUED states all move to MISSING via the process you describe.

We sync in a separate channel and it helped me to understand why snapshot-partial-12 did not get finished during processExternalChanges. Because (1) the shard did not get marked as a knownFailure since it was part of a clone and (2) snapshot-partial-12 at that time still has the shard as UNASSIGNED_QUEUED which does not has a nodeId and hence does not respond to node leaving.

…handler

ywangd

LGTM

…handler

elasticsearchmachine · 2024-02-09T12:25:36Z

💔 Backport failed

Status	Branch	Result
❌	8.12	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 105245

Today when processing a batch of `ShardSnapshotUpdate` tasks each update's listener considers whether the corresponding snapshot has completed and, if so, it enqueues it for finalization. This is somewhat inefficient since we may be processing many shard snapshot updates for the same few snapshots, but there's no need to check each snapshot for completion more than once. It's also insufficient since the completion of a shard snapshot may cause the completion of subsequent snapshots too (e.g. they can go from state `QUEUED` straight to `MISSING`). This commit detaches the completion handling from the individual shard snapshot updates and instead makes sure that any snapshot that reaches a completed state is enqueued for finalization. Closes #104939

DaveCTurner · 2024-02-09T12:46:59Z

Backported to 8.12 in a14ad5f.

DaveCTurner added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.13.0 v8.12.2 labels Feb 7, 2024

DaveCTurner requested a review from ywangd February 7, 2024 14:29

elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Feb 7, 2024

Update docs/changelog/105245.yaml

6f07dea

DaveCTurner added the auto-backport-and-merge label Feb 7, 2024

DaveCTurner mentioned this pull request Feb 7, 2024

[CI] SnapshotStressTestsIT testRandomActivities failing #101352

Closed

DaveCTurner commented Feb 7, 2024

View reviewed changes

Merge branch 'main' into 2024/02/07/snapshot-shard-update-completion-…

ba2b0f2

…handler

ywangd reviewed Feb 9, 2024

View reviewed changes

DaveCTurner added 2 commits February 9, 2024 09:07

Merge branch 'main' into 2024/02/07/snapshot-shard-update-completion-…

4e99c94

…handler

No need to track needsReroute flag

a0db49e

DaveCTurner requested a review from ywangd February 9, 2024 09:32

Merge branch 'main' into 2024/02/07/snapshot-shard-update-completion-…

a6a0fd3

…handler

ywangd approved these changes Feb 9, 2024

View reviewed changes

Merge branch 'main' into 2024/02/07/snapshot-shard-update-completion-…

2ed7b5b

…handler

DaveCTurner added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Feb 9, 2024

elasticsearchmachine merged commit 61f2090 into elastic:main Feb 9, 2024
14 checks passed

DaveCTurner deleted the 2024/02/07/snapshot-shard-update-completion-handler branch February 9, 2024 12:24

elasticsearchmachine added the backport pending label Feb 9, 2024

DaveCTurner removed the backport pending label Feb 9, 2024

DaveCTurner mentioned this pull request May 23, 2024

[CI] SnapshotStressTestsIT testRandomActivities failing #108907

Open

DaveCTurner restored the 2024/02/07/snapshot-shard-update-completion-handler branch June 17, 2024 06:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finalize all snapshots completed by shard snapshot updates #105245

Finalize all snapshots completed by shard snapshot updates #105245

DaveCTurner commented Feb 7, 2024

elasticsearchmachine commented Feb 7, 2024

elasticsearchmachine commented Feb 7, 2024

DaveCTurner Feb 7, 2024

ywangd left a comment

ywangd Feb 9, 2024

DaveCTurner Feb 9, 2024

ywangd commented Feb 9, 2024

ywangd Feb 9, 2024

DaveCTurner Feb 9, 2024

ywangd Feb 9, 2024

ywangd left a comment

elasticsearchmachine commented Feb 9, 2024

DaveCTurner commented Feb 9, 2024

Finalize all snapshots completed by shard snapshot updates #105245

Finalize all snapshots completed by shard snapshot updates #105245

Conversation

DaveCTurner commented Feb 7, 2024

elasticsearchmachine commented Feb 7, 2024

elasticsearchmachine commented Feb 7, 2024

DaveCTurner Feb 7, 2024

Choose a reason for hiding this comment

ywangd left a comment

Choose a reason for hiding this comment

ywangd Feb 9, 2024

Choose a reason for hiding this comment

DaveCTurner Feb 9, 2024

Choose a reason for hiding this comment

ywangd commented Feb 9, 2024

ywangd Feb 9, 2024

Choose a reason for hiding this comment

DaveCTurner Feb 9, 2024

Choose a reason for hiding this comment

ywangd Feb 9, 2024

Choose a reason for hiding this comment

ywangd left a comment

Choose a reason for hiding this comment

elasticsearchmachine commented Feb 9, 2024

💔 Backport failed

DaveCTurner commented Feb 9, 2024