Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix recovery stage transition with sync_id #57754

Merged
merged 5 commits into from
Jun 15, 2020

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Jun 5, 2020

If the recovery source is on an old node (before 7.2), then the recovery target won't have the safe commit after phase1 because the recovery source does not send the global checkpoint in the clean_files step. And if the recovery fails and retries, then the recovery stage won't transition properly. If a sync_id is used in peer recovery, then the clean_files step won't be executed to move the stage to TRANSLOG.

This issue was addressed in #57187, but not forward-ported to 8.0. I think we should do it as this issue can occur in 8.0. (requires a full cluster restart to 8.0 after a peer recovery on 7.1 fails after it has completed phase 1).

Closes #57708

@dnhatn dnhatn added >bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v8.0.0 v7.8.1 v7.9.0 labels Jun 5, 2020
@dnhatn dnhatn requested review from ywelsch and DaveCTurner June 5, 2020 17:35
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Recovery)

@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jun 5, 2020
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this assertion valid? If so, can we add it?

diff --git a/server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java b/server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java
index 33139912920..8000bbabc53 100644
--- a/server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java
+++ b/server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java
@@ -542,6 +542,8 @@ public class RecoverySourceHandler {
                         phase1ExistingFileSizes, existingTotalSize, took));
                 }, listener::onFailure);
             } else {
+                assert shard.indexSettings().getIndexVersionCreated().before(Version.V_7_2_0) ||
+                    request.startingSeqNo() != SequenceNumbers.UNASSIGNED_SEQ_NO;
                 logger.trace("skipping [phase1] since source and target have identical sync id [{}]", recoverySourceMetadata.getSyncId());

                 // but we must still create a retention lease

LGTM apart from that and one other request.

@howardhuanghua
Copy link
Contributor

Hi @dnhatn , based on this comment:

If the recovery source is on an old node (before 7.2), then the recovery target won't have the safe commit after phase1 because the recovery source does not send the global checkpoint in the clean_files step. And if the recovery fails and retries, then the recovery stage won't transition properly. If a sync_id is used in peer recovery, then the clean_files step won't be executed to move the stage to TRANSLOG.

That means in recovering shard from pre-7.2 version to 7.2+ version, only if the recovery fails at the first time and retry again, then it cannot move to TRANSLOG stage? Is that possible it cannot move to TRANSLOG stage in a normal recovery process without any fail at the first time? Since in our case, 6.8 add 7.5 nodes to recovery #57708, we didn't see fails except the wrong stage exception. Also why after performing _reroute?retry_failed could relocate the stunk shard success? Thanks.

@howardhuanghua
Copy link
Contributor

howardhuanghua commented Jun 9, 2020

By the way, in #57708, both replica and primary are on 7.5.1 node. The flow is:

  1. Exclude 6.8.2 version nodes.
  2. .kibana primary shard relocated from 6.8.2 node to 7.5.1 node.
  3. .kibana replica shard peer recovery from primary (both on 7.5.1), recovery blocked.

@dnhatn
Copy link
Member Author

dnhatn commented Jun 10, 2020

Thanks @howardhuanghua. I am trying to reproduce your situation.

@howardhuanghua
Copy link
Contributor

@dnhatn Thanks, there is a point that in our situation, we could see the blocked index is empty, no docs contained.

@dnhatn
Copy link
Member Author

dnhatn commented Jun 11, 2020

Hi @howardhuanghua,

@hubbleview was kind to help me to reproduce the scenario that you provided. It matches what we outline in the PR. When migrating to new nodes, we first synced flush an index, then we exclude the old nodes in the allocation filter. ES will relocate both the primary and replica to the new nodes at the same time. If phase2 of the recovery of the replica starts after the relocation of the primary completes, then it will hit an IllegalIndexShardStateException. This exception is retry-able, hence we log it at the trace level and retry another recovery. At this point, the primary is on the new node, and the replica has an index commit with a sync_id, but that commit is not safe (because we do not have the global checkpoint in the clean_files step). The first retry recovery will fail due to the improper transition, but the subsequent recovery will succeed as we should have the global checkpoint in the finalize_step.

Thank you for reporting the issue.

@dnhatn
Copy link
Member Author

dnhatn commented Jun 11, 2020

@DaveCTurner Thanks for reviewing. The assertion is great, but we will trip it if we hit a simulated I/O exception while we are recovering locally.

@dnhatn dnhatn requested a review from DaveCTurner June 11, 2020 17:44
@howardhuanghua
Copy link
Contributor

howardhuanghua commented Jun 12, 2020

Hi @dnhatn, thanks for the explanation, now I understand the issue. Just want to confirm the relocation scenarios,

ES will relocate both the primary and replica to the new nodes at the same time.

  1. Set allocation filter, both primary and replica would be relocated to new nodes at the same time as you described.
  2. Use shard move to move a replica to another node, this replica would be peer recovered from its primary or the original replica?
  3. If replica shard failed itself and no local data could be used, it always peer recovery from its primary?

@dnhatn
Copy link
Member Author

dnhatn commented Jun 12, 2020

Hi @howardhuanghua,

  1. Set allocation filter, both primary and replica would be relocated to new nodes at the same time as you described.

Yes, that's correct.

  1. Use shard move to move a replica to another node, this replica would be peer recovered from its primary or the original replica?

A replica always recovers from its primary.

  1. If replica shard failed itself and no local data could be used, it always peer recovery from its primary?

A replica always recovers from its primary, and the recovery tries to reuse the existing data when possible.

@DaveCTurner
Copy link
Contributor

The assertion is great, but we will trip it if we hit a simulated I/O exception while we are recovering locally.

Is that ok? If we couldn't recover the shard locally does it make sense to proceed with the rest of the recovery like that?

@dnhatn
Copy link
Member Author

dnhatn commented Jun 15, 2020

The assertion is great, but we will trip it if we hit a simulated I/O exception while we are recovering locally.

Is that ok? If we couldn't recover the shard locally does it make sense to proceed with the rest of the recovery like that?

When the local translog is corrupted, we won't be able to recover locally up to the global checkpoint. In this case, we will try to reuse the existing index commit, which can have a sync_id.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, since the whole sync marker thing is going away soon we don't need to dwell on this further so this LGTM.

@dnhatn
Copy link
Member Author

dnhatn commented Jun 15, 2020

Thanks David!

@dnhatn dnhatn merged commit bf910e9 into elastic:master Jun 15, 2020
@dnhatn dnhatn deleted the fix-recovery-transition branch June 15, 2020 17:06
dnhatn added a commit that referenced this pull request Jul 7, 2020
If the recovery source is on an old node (before 7.2), then the recovery
target won't have the safe commit after phase1 because the recovery
source does not send the global checkpoint in the clean_files step. And
if the recovery fails and retries, then the recovery stage won't
transition properly. If a sync_id is used in peer recovery, then the
clean_files step won't be executed to move the stage to TRANSLOG.

Relates ##7187
Closes #57708
dnhatn added a commit that referenced this pull request Jul 7, 2020
If the recovery source is on an old node (before 7.2), then the recovery
target won't have the safe commit after phase1 because the recovery
source does not send the global checkpoint in the clean_files step. And
if the recovery fails and retries, then the recovery stage won't
transition properly. If a sync_id is used in peer recovery, then the
clean_files step won't be executed to move the stage to TRANSLOG.

Relates ##7187
Closes #57708
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v7.8.1 v7.9.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Shard cannot be relocated after setting node exclusion.
5 participants