Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finalize all snapshots completed by shard snapshot updates #105245

Conversation

DaveCTurner
Copy link
Contributor

Today when processing a batch of ShardSnapshotUpdate tasks each
update's listener considers whether the corresponding snapshot has
completed and, if so, it enqueues it for finalization. This is somewhat
inefficient since we may be processing many shard snapshot updates for
the same few snapshots, but there's no need to check each snapshot for
completion more than once. It's also insufficient since the completion
of a shard snapshot may cause the completion of subsequent snapshots too
(e.g. they can go from state QUEUED straight to MISSING).

This commit detaches the completion handling from the individual shard
snapshot updates and instead makes sure that any snapshot that reaches a
completed state is enqueued for finalization.

Closes #104939

Today when processing a batch of `ShardSnapshotUpdate` tasks each
update's listener considers whether the corresponding snapshot has
completed and, if so, it enqueues it for finalization. This is somewhat
inefficient since we may be processing many shard snapshot updates for
the same few snapshots, but there's no need to check each snapshot for
completion more than once. It's also insufficient since the completion
of a shard snapshot may cause the completion of subsequent snapshots too
(e.g. they can go from state `QUEUED` straight to `MISSING`).

This commit detaches the completion handling from the individual shard
snapshot updates and instead makes sure that any snapshot that reaches a
completed state is enqueued for finalization.

Closes elastic#104939
@DaveCTurner DaveCTurner added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.13.0 v8.12.2 labels Feb 7, 2024
@DaveCTurner DaveCTurner requested a review from ywangd February 7, 2024 14:29
@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Feb 7, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@@ -1230,6 +1242,142 @@ public void testRunConcurrentSnapshots() {
}
}

public void testSnapshotCompletedByNodeLeft() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than adding a test that reliably reproduces the failure in #104939 I've elected to write this more general test with enough randomisation to cover a variety of similar situations too. It does reproduce the failure in #104939 much more frequently than SnapshotStressTestsIT, but even so it might take a few hundred iterations to hit the problem.

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look fine on its own. But I could use some help to understand how the original code led to CI failure. Thanks!

for (final var taskContext : batchExecutionContext.taskContexts()) {
if (taskContext.getTask() instanceof ShardSnapshotUpdate task) {
final var ref = onCompletionRefs.acquire();
needsReroute = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think needsReroute must be true now sinc this is inside updatesByRepo.isEmpty() == false check. We may not even need it since completionHandler is only called when there is some update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nice, good point. I found that flag irksome indeed, dropped in a0db49e.

@ywangd
Copy link
Member

ywangd commented Feb 9, 2024

I wrote a long comment only to realize that I did not send it out when posting the overall comment ... I'll post it again ...

Comment on lines +3173 to +3175
newEntries.add(newEntry);
if (newEntry != entry && newEntry.state().completed()) {
newlyCompletedEntries.add(newEntry);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could use help to better understand two piece of information.

  1. How come a shard update for one snapshot end up updating the state of another snapshot?
  2. How come a ShardSnapshotUpdate processed unassigned shards before SnapshotsService#processExternalChanges?

Using this particular failure as an example, I think the answer to the first question is the following sequence of calls when processing shard updates for snapshot-clone-11:

tryStartNextTaskAfterCloneUpdated(update.repoShardId, update.updatedState);

tryStartSnapshotAfterCloneFinish(repoShardId, updatedState.generation());

startShardSnapshot(repoShardId, generation);

final ShardSnapshotStatus shardSnapshotStatus = initShardSnapshotStatus(generation, shardRouting, nodeIdRemovalPredicate);

In initShardSnapshotStatus, we found out that the shard is unassigned and marked its state to MISSING which in turns leads to snapshot-paritial-12 being marked as SUCCESS.

Is this correct?

Assuming the above is more or less correct, now for question 2, I don't understand how the unassigned shard was not processed processExternalChanges which is called inside SnapshotsService#applyClusterState. This is where my lack of knowledge for cluster coordination hurts ... I thought cluster state update and apply run strictly one after the other, i.e. after a cluster state update, the apply runs before computing another update. In regular case, node disconnection is handled by processExternalChanges which ends snapshots based on the missing shards. How come ShardSnapshotUpdate sees the missing shard first in this case? You said

while that update was in flight the node left the cluster so the other shard snapshots moved to state MISSING in the same update

This seems to say while ShardSnapshotUpdate is running, the cluster state changed under it? Shard update uses a dedicated master task queue which I assume they cannot be processed together with node left? Sorry this all sounds pretty silly. I must have some serious/key misunderstanding somewhere. I'd appreciate if you could help clarify the sequence/order of events here. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In initShardSnapshotStatus, we found out that the shard is unassigned and marked its state to MISSING which in turns leads to snapshot-paritial-12 being marked as SUCCESS.

Correct.

I don't understand how the unassigned shard was not processed processExternalChanges

Because when processExternalChanges was called [index-0][1] was still in state INIT in snapshot-clone-11, and clones don't run on the data nodes so don't get cancelled on a node-left event, which means that the subsequent operations on [index-0][1] also remain unchanged (in state QUEUED). But then when that shard clone completes it moves to SUCCESS and the later QUEUED states all move to MISSING via the process you describe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We sync in a separate channel and it helped me to understand why snapshot-partial-12 did not get finished during processExternalChanges. Because (1) the shard did not get marked as a knownFailure since it was part of a clone and (2) snapshot-partial-12 at that time still has the shard as UNASSIGNED_QUEUED which does not has a nodeId and hence does not respond to node leaving.

@DaveCTurner DaveCTurner requested a review from ywangd February 9, 2024 09:32
Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DaveCTurner DaveCTurner added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Feb 9, 2024
@elasticsearchmachine elasticsearchmachine merged commit 61f2090 into elastic:main Feb 9, 2024
14 checks passed
@DaveCTurner DaveCTurner deleted the 2024/02/07/snapshot-shard-update-completion-handler branch February 9, 2024 12:24
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.12 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 105245

DaveCTurner added a commit that referenced this pull request Feb 9, 2024
Today when processing a batch of `ShardSnapshotUpdate` tasks each
update's listener considers whether the corresponding snapshot has
completed and, if so, it enqueues it for finalization. This is somewhat
inefficient since we may be processing many shard snapshot updates for
the same few snapshots, but there's no need to check each snapshot for
completion more than once. It's also insufficient since the completion
of a shard snapshot may cause the completion of subsequent snapshots too
(e.g. they can go from state `QUEUED` straight to `MISSING`).

This commit detaches the completion handling from the individual shard
snapshot updates and instead makes sure that any snapshot that reaches a
completed state is enqueued for finalization.

Closes #104939
@DaveCTurner
Copy link
Contributor Author

Backported to 8.12 in a14ad5f.

@DaveCTurner DaveCTurner restored the 2024/02/07/snapshot-shard-update-completion-handler branch June 17, 2024 06:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.12.2 v8.13.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] SnapshotStressTestsIT testRandomActivities failing
3 participants