Fix recovery stage transition with sync_id #57754

dnhatn · 2020-06-05T17:35:08Z

If the recovery source is on an old node (before 7.2), then the recovery target won't have the safe commit after phase1 because the recovery source does not send the global checkpoint in the clean_files step. And if the recovery fails and retries, then the recovery stage won't transition properly. If a sync_id is used in peer recovery, then the clean_files step won't be executed to move the stage to TRANSLOG.

This issue was addressed in #57187, but not forward-ported to 8.0. I think we should do it as this issue can occur in 8.0. (requires a full cluster restart to 8.0 after a peer recovery on 7.1 fails after it has completed phase 1).

Closes #57708

elasticmachine · 2020-06-05T17:35:10Z

Pinging @elastic/es-distributed (:Distributed/Recovery)

DaveCTurner

Is this assertion valid? If so, can we add it?

diff --git a/server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java b/server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java
index 33139912920..8000bbabc53 100644
--- a/server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java
+++ b/server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java
@@ -542,6 +542,8 @@ public class RecoverySourceHandler {
                         phase1ExistingFileSizes, existingTotalSize, took));
                 }, listener::onFailure);
             } else {
+                assert shard.indexSettings().getIndexVersionCreated().before(Version.V_7_2_0) ||
+                    request.startingSeqNo() != SequenceNumbers.UNASSIGNED_SEQ_NO;
                 logger.trace("skipping [phase1] since source and target have identical sync id [{}]", recoverySourceMetadata.getSyncId());

                 // but we must still create a retention lease

LGTM apart from that and one other request.

...er/src/internalClusterTest/java/org/elasticsearch/gateway/ReplicaShardAllocatorSyncIdIT.java

howardhuanghua · 2020-06-09T03:07:23Z

Hi @dnhatn , based on this comment:

If the recovery source is on an old node (before 7.2), then the recovery target won't have the safe commit after phase1 because the recovery source does not send the global checkpoint in the clean_files step. And if the recovery fails and retries, then the recovery stage won't transition properly. If a sync_id is used in peer recovery, then the clean_files step won't be executed to move the stage to TRANSLOG.

That means in recovering shard from pre-7.2 version to 7.2+ version, only if the recovery fails at the first time and retry again, then it cannot move to TRANSLOG stage? Is that possible it cannot move to TRANSLOG stage in a normal recovery process without any fail at the first time? Since in our case, 6.8 add 7.5 nodes to recovery #57708, we didn't see fails except the wrong stage exception. Also why after performing _reroute?retry_failed could relocate the stunk shard success? Thanks.

howardhuanghua · 2020-06-09T12:56:13Z

By the way, in #57708, both replica and primary are on 7.5.1 node. The flow is:

Exclude 6.8.2 version nodes.
.kibana primary shard relocated from 6.8.2 node to 7.5.1 node.
.kibana replica shard peer recovery from primary (both on 7.5.1), recovery blocked.

dnhatn · 2020-06-10T01:19:01Z

Thanks @howardhuanghua. I am trying to reproduce your situation.

howardhuanghua · 2020-06-10T01:27:23Z

@dnhatn Thanks, there is a point that in our situation, we could see the blocked index is empty, no docs contained.

dnhatn · 2020-06-11T17:22:21Z

Hi @howardhuanghua,

@hubbleview was kind to help me to reproduce the scenario that you provided. It matches what we outline in the PR. When migrating to new nodes, we first synced flush an index, then we exclude the old nodes in the allocation filter. ES will relocate both the primary and replica to the new nodes at the same time. If phase2 of the recovery of the replica starts after the relocation of the primary completes, then it will hit an IllegalIndexShardStateException. This exception is retry-able, hence we log it at the trace level and retry another recovery. At this point, the primary is on the new node, and the replica has an index commit with a sync_id, but that commit is not safe (because we do not have the global checkpoint in the clean_files step). The first retry recovery will fail due to the improper transition, but the subsequent recovery will succeed as we should have the global checkpoint in the finalize_step.

Thank you for reporting the issue.

dnhatn · 2020-06-11T17:44:39Z

@DaveCTurner Thanks for reviewing. The assertion is great, but we will trip it if we hit a simulated I/O exception while we are recovering locally.

howardhuanghua · 2020-06-12T01:30:02Z

Hi @dnhatn, thanks for the explanation, now I understand the issue. Just want to confirm the relocation scenarios,

ES will relocate both the primary and replica to the new nodes at the same time.

Set allocation filter, both primary and replica would be relocated to new nodes at the same time as you described.
Use shard move to move a replica to another node, this replica would be peer recovered from its primary or the original replica?
If replica shard failed itself and no local data could be used, it always peer recovery from its primary?

dnhatn · 2020-06-12T03:45:50Z

Hi @howardhuanghua,

Set allocation filter, both primary and replica would be relocated to new nodes at the same time as you described.

Yes, that's correct.

Use shard move to move a replica to another node, this replica would be peer recovered from its primary or the original replica?

A replica always recovers from its primary.

If replica shard failed itself and no local data could be used, it always peer recovery from its primary?

A replica always recovers from its primary, and the recovery tries to reuse the existing data when possible.

DaveCTurner · 2020-06-15T15:45:51Z

The assertion is great, but we will trip it if we hit a simulated I/O exception while we are recovering locally.

Is that ok? If we couldn't recover the shard locally does it make sense to proceed with the rest of the recovery like that?

dnhatn · 2020-06-15T16:10:02Z

The assertion is great, but we will trip it if we hit a simulated I/O exception while we are recovering locally.

Is that ok? If we couldn't recover the shard locally does it make sense to proceed with the rest of the recovery like that?

When the local translog is corrupted, we won't be able to recover locally up to the global checkpoint. In this case, we will try to reuse the existing index commit, which can have a sync_id.

DaveCTurner

Ok, since the whole sync marker thing is going away soon we don't need to dwell on this further so this LGTM.

dnhatn · 2020-06-15T17:05:22Z

Thanks David!

If the recovery source is on an old node (before 7.2), then the recovery target won't have the safe commit after phase1 because the recovery source does not send the global checkpoint in the clean_files step. And if the recovery fails and retries, then the recovery stage won't transition properly. If a sync_id is used in peer recovery, then the clean_files step won't be executed to move the stage to TRANSLOG. Relates ##7187 Closes #57708

Fix recovery stage transition with sync_id

5fa3979

dnhatn added >bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v8.0.0 v7.8.1 v7.9.0 labels Jun 5, 2020

dnhatn requested review from ywelsch and DaveCTurner June 5, 2020 17:35

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jun 5, 2020

fix test

4c5cbf3

DaveCTurner reviewed Jun 8, 2020

View reviewed changes

...er/src/internalClusterTest/java/org/elasticsearch/gateway/ReplicaShardAllocatorSyncIdIT.java Outdated Show resolved Hide resolved

dnhatn added 2 commits June 11, 2020 13:23

Merge branch 'master' into fix-recovery-transition

7c591f9

add version to comment

b8d63fb

dnhatn requested a review from DaveCTurner June 11, 2020 17:44

fix code link

24d4273

DaveCTurner approved these changes Jun 15, 2020

View reviewed changes

dnhatn merged commit bf910e9 into elastic:master Jun 15, 2020

dnhatn deleted the fix-recovery-transition branch June 15, 2020 17:06

dnhatn added the backport pending label Jun 15, 2020

dnhatn removed the backport pending label Jul 16, 2020

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix recovery stage transition with sync_id #57754

Fix recovery stage transition with sync_id #57754

dnhatn commented Jun 5, 2020

elasticmachine commented Jun 5, 2020

DaveCTurner left a comment •

edited

Loading

howardhuanghua commented Jun 9, 2020

howardhuanghua commented Jun 9, 2020 •

edited

Loading

dnhatn commented Jun 10, 2020

howardhuanghua commented Jun 10, 2020

dnhatn commented Jun 11, 2020 •

edited

Loading

dnhatn commented Jun 11, 2020

howardhuanghua commented Jun 12, 2020 •

edited

Loading

dnhatn commented Jun 12, 2020

DaveCTurner commented Jun 15, 2020

dnhatn commented Jun 15, 2020

DaveCTurner left a comment

dnhatn commented Jun 15, 2020

Fix recovery stage transition with sync_id #57754

Fix recovery stage transition with sync_id #57754

Conversation

dnhatn commented Jun 5, 2020

elasticmachine commented Jun 5, 2020

DaveCTurner left a comment • edited Loading

Choose a reason for hiding this comment

howardhuanghua commented Jun 9, 2020

howardhuanghua commented Jun 9, 2020 • edited Loading

dnhatn commented Jun 10, 2020

howardhuanghua commented Jun 10, 2020

dnhatn commented Jun 11, 2020 • edited Loading

dnhatn commented Jun 11, 2020

howardhuanghua commented Jun 12, 2020 • edited Loading

dnhatn commented Jun 12, 2020

DaveCTurner commented Jun 15, 2020

dnhatn commented Jun 15, 2020

DaveCTurner left a comment

Choose a reason for hiding this comment

dnhatn commented Jun 15, 2020

DaveCTurner left a comment •

edited

Loading

howardhuanghua commented Jun 9, 2020 •

edited

Loading

dnhatn commented Jun 11, 2020 •

edited

Loading

howardhuanghua commented Jun 12, 2020 •

edited

Loading