Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replica recovery on follower can fail after recovery from remote #39000

Closed
jasontedor opened this issue Feb 16, 2019 · 9 comments
Closed

Replica recovery on follower can fail after recovery from remote #39000

jasontedor opened this issue Feb 16, 2019 · 9 comments
Assignees
Labels

Comments

@jasontedor
Copy link
Member

Today during peer recovery, we replay operations from the translog of the primary to the replica. This is to build a history of operations on the replica, in case that it would become a primary. Unfortunately, this is currently fatal for peer recovery of a follower shard after the primary shard has recovered remotely.

With recovery from remote, we copy over the index files and there is no translog replay phase. Assume the simple case of a newly created follower doing its initial recovery from the leader. If the leader does a flush before the follower initiates recovery from remote, after recovery from remote the follower will be fully caught up all primary shards of the follower will have empty translogs. When a replica shard of the follower attempts to recover from the primary shard of the follower, we want to replay translog from the local checkpoint of the commit, to bake a history of operations for it. Since the primary shards of the follower have empty translogs, this replay can not happen and recovery will fail.

Immediately I see two possibilities:

  • InternalEngine#readHistoryOperations could serve operations from the Lucene history instead of the translog; however, this interplays with the need for shard history retention leases to interplay with recovery to ensure that these operations exist
  • the primary shards of the follower could build their own translog from Lucene history after remotely recovering files

The purpose of this issue is to tracker this as a blocker for all versions of CCR that support recovery from remote, and to initiate discussion on these options (and others).

Relates #38633

@jasontedor jasontedor added blocker v7.0.0 :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features v6.7.0 v8.0.0 v7.2.0 labels Feb 16, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dnhatn dnhatn self-assigned this Feb 16, 2019
@dnhatn
Copy link
Member

dnhatn commented Feb 16, 2019

Relates #38949

@ywelsch
Copy link
Contributor

ywelsch commented Feb 17, 2019

Our replication model (and peer recovery) relies on a few concepts, among which:

  1. there is always a safe commit (i.e. global checkpoint >= max seq no of the Lucene commit)
  2. local checkpoint equals max sequence number on primary if all ongoing replication tasks complete in the replication group (typically bounded in time and under control of the primary).

Follower indices currently violate these two assumptions:

  • recovery from remote does not create a safe commit (the global checkpoint in the follower index's replication group == local checkpoint of the index commit that it copied over, which can be < max seq no of the index commit that was copied over). This is a problem for peer recovery (and possibly replica rollbacks) which rely on the existence of a safe commit.
  • primaries of follower indices are not in control of bringing their local checkpoint up to the max sequence number due to out-of-order replication from the leader index and the possibility of following being paused or leader index being unavailable. This is again a problem for peer recovery which assumes that the local checkpoint can catch up to the max sequence number by just waiting for ongoing ops to complete (This issue also existed already before recovery from remote AFAICS). So what's the reason for peer recovery to wait on this condition (local checkpoint = max seq no)? Peer recovery adds the new replica as tracking and waits for local checkpoint of primary to catch up to max seq no (captured in endingSeqNo) in order to ensure that all ongoing operations up to max seq no have been replicated to the replica. This is to ensure that all future replication requests (>endingSeqNo) will go to the new replica (which is now tracked). Similarly, all past operations (<= endingSeqNo) are then replayed in phase 2.
  • Note that this is the same type of issue we were having with the new close index API, that was built on the same assumption (max seq no = local checkpoint = global checkpoint once indexing activity in replication group is stopped) which no longer holds true for follower indices.

I think that we should try to keep assumption 1, even for follower indices, but relax assumption 2 in our replication model (at least for nor now, as long as we have out-of-order replication):

  • We should ensure that follower indices have a safe commit: recovery from remote should at least replay operations from the local checkpoint of the "to be safe commit" up to max seq of the "to be safe commit" in order to make it a safe commit before starting the primary shard. Note that we can decide on streaming more history here (above max seq no of the commit) if that helps with retention leases. The above is the minimum to adhere to the cluster-local replication group semantics, however.
  • adapt peer recovery so that it does not need the "local checkpoint = max seq no" condition and can therefore accomodate follower indices. This is a bit more tricky AFAICS. It requires having the notion of "all pending ops that have not taken the new replica into account have completed". One way of doing this is to introduce the notion of a replication group version (local to primary) in order to track whether there are outstanding replication requests for a previous version. Initiating tracking (ReplicationTracker.initiateTracking) would increment the version, and all replication operations would have to register with the replication tracker first (with a releasable) before writing to the primary so that we can track outstanding replication requests for older replication group versions and wait for their completion. Peer recovery would then use the following steps after phase 1:
    • initiate tracking (returns incremented replication group version)
    • wait for all outstanding requests associated with older replication group versions to complete. This means that all newly indexed operations from that point on will be sent to replica. Note that this here would replace the older condition where we waited on local checkpoint to move up to the previous max seq no.
    • initiate phase 2, and capture a snapshot that contains all ops (starting from >= startingSeqNo) that have so far been indexed into primary and replay those operations to the replica. This range of operations CAN contain gaps for the parts above the primary's local checkpoint. It's still important that we replay the ops above the primary's local checkpoint in order to fully align primary and replica.
    • mark the new shard copy as in-sync (just as before), which stops the global checkpoint from advancing until the local checkpoint of the new replica has caught up to the current global checkpoint. Even if max seq != local checkpoint, this can be guaranteed to complete as long as replica has the same writes as primary (ensured by previous steps).

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Feb 18, 2019

One way of doing this is to introduce the notion of a replication group version (local to primary) in order to track whether there are outstanding replication requests for a previous version.

Could we also do this by briefly obtaining an operation block between initiating tracking and capturing the phase 2 snapshot, if the primary is a CCR follower(*)? The versioned replication group proposal has the advantage of not blocking operations, but I'm not sure we need to avoid this.

(*) Edit: I think we could also wait for the checkpoints to align for a reasonable length of time first and only obtain this block on timeout.

@dnhatn
Copy link
Member

dnhatn commented Feb 19, 2019

We should ensure that follower indices have a safe commit: recovery from remote should at least replay operations from the local checkpoint of the "to be safe commit" up to max seq of the "to be safe commit" in order to make it a safe commit before starting the primary shard.

I am working on this.

@ywelsch
Copy link
Contributor

ywelsch commented Feb 19, 2019

Could we also do this by briefly obtaining an operation block between initiating tracking and capturing the phase 2 snapshot, if the primary is a CCR follower(*)? The versioned replication group proposal has the advantage of not blocking operations, but I'm not sure we need to avoid this.

yes, that would also work. Blocking indexing should be avoided as much as possible though, and I would also like to avoid having CCR specific recovery code. I have given this a few more thoughts and think that another simpler solution is possible.

The replication group to be used is currently sampled after indexing into the primary (see ReplicationOperation class). This means that when initiating tracking of a new replica, we have to consider the following two cases:

  • there are operations for which the replication group has not been sampled yet. As we initiated the new replica as tracking, we know that those operations will be replicated to the new replica and follow the typical replication group semantics (e.g. marked as stale when unavailable).
  • there are operations for which the replication group has already been sampled. These operations will not be sent to the new replica. However, we know that those operations are already indexed into Lucene and the translog on the primary, as the sampling is happening after that. This means that by taking a snapshot of Lucene or the translog, we will be getting those ops as well. What we cannot guarantee anymore is that all ops up to endingSeqNo are available in the snapshot (i.e. also see comment in RecoverySourceHandler saying We need to wait for all operations up to the current max to complete, otherwise we can not guarantee that all operations in the required range will be available for replaying from the translog of the source.). This is not needed, though, as we can no longer guarantee that max seq no == local checkpoint.

I think we can therefore just remove cancellableThreads.execute(() -> shard.waitForOpsToComplete(endingSeqNo)); and the range completeness check (if (requiredOpsTracker.getCheckpoint() < endingSeqNo) {) in RecoverySourceHandler. These were added in #27580 as an extra safeguard for dealing with 5.x BWC challenges. We could in theory keep them around if the primary shard is fully in charge of the seq no generator (on follower shards, it's not) but I'm not sure it's worth the complexity. Our tests should already check that shard copies are fully aligned after recovering (this was not the case previously because of missing primary-replica resync).

@DaveCTurner
Copy link
Contributor

That's a pretty subtle interplay of different components, but it does make sense.

@dnhatn
Copy link
Member

dnhatn commented Feb 19, 2019

I think we can therefore just remove cancellableThreads.execute(() -> shard.waitForOpsToComplete(endingSeqNo)); and the range completeness check (if (requiredOpsTracker.getCheckpoint() < endingSeqNo) {) in RecoverySourceHandler

@ywelsch This is brilliant 🎉

dnhatn added a commit that referenced this issue Feb 25, 2019
With this change, we won't wait for the local checkpoint to advance to
the max_seq_no before starting phase2 of peer-recovery. We also remove
the sequence number range check in peer-recovery. We can safely do these
thanks to Yannick's finding.

The replication group to be used is currently sampled after indexing
into the primary (see `ReplicationOperation` class). This means that
when initiating tracking of a new replica, we have to consider the
following two cases:

- There are operations for which the replication group has not been
sampled yet. As we initiated the new replica as tracking, we know that
those operations will be replicated to the new replica and follow the
typical replication group semantics (e.g. marked as stale when
unavailable).

- There are operations for which the replication group has already been
sampled. These operations will not be sent to the new replica.  However,
we know that those operations are already indexed into Lucene and the
translog on the primary, as the sampling is happening after that. This
means that by taking a snapshot of Lucene or the translog, we will be
getting those ops as well. What we cannot guarantee anymore is that all
ops up to `endingSeqNo` are available in the snapshot (i.e.  also see
comment in `RecoverySourceHandler` saying `We need to wait for all
operations up to the current max to complete, otherwise we can not
guarantee that all operations in the required range will be available
for replaying from the translog of the source.`). This is not needed,
though, as we can no longer guarantee that max seq no == local
checkpoint.

Relates #39000
Closes #38949

Co-authored-by: Yannick Welsch <[email protected]>
dnhatn added a commit that referenced this issue Feb 25, 2019
With this change, we won't wait for the local checkpoint to advance to
the max_seq_no before starting phase2 of peer-recovery. We also remove
the sequence number range check in peer-recovery. We can safely do these
thanks to Yannick's finding.

The replication group to be used is currently sampled after indexing
into the primary (see `ReplicationOperation` class). This means that
when initiating tracking of a new replica, we have to consider the
following two cases:

- There are operations for which the replication group has not been
sampled yet. As we initiated the new replica as tracking, we know that
those operations will be replicated to the new replica and follow the
typical replication group semantics (e.g. marked as stale when
unavailable).

- There are operations for which the replication group has already been
sampled. These operations will not be sent to the new replica.  However,
we know that those operations are already indexed into Lucene and the
translog on the primary, as the sampling is happening after that. This
means that by taking a snapshot of Lucene or the translog, we will be
getting those ops as well. What we cannot guarantee anymore is that all
ops up to `endingSeqNo` are available in the snapshot (i.e.  also see
comment in `RecoverySourceHandler` saying `We need to wait for all
operations up to the current max to complete, otherwise we can not
guarantee that all operations in the required range will be available
for replaying from the translog of the source.`). This is not needed,
though, as we can no longer guarantee that max seq no == local
checkpoint.

Relates #39000
Closes #38949

Co-authored-by: Yannick Welsch <[email protected]>
dnhatn added a commit that referenced this issue Feb 25, 2019
With this change, we won't wait for the local checkpoint to advance to
the max_seq_no before starting phase2 of peer-recovery. We also remove
the sequence number range check in peer-recovery. We can safely do these
thanks to Yannick's finding.

The replication group to be used is currently sampled after indexing
into the primary (see `ReplicationOperation` class). This means that
when initiating tracking of a new replica, we have to consider the
following two cases:

- There are operations for which the replication group has not been
sampled yet. As we initiated the new replica as tracking, we know that
those operations will be replicated to the new replica and follow the
typical replication group semantics (e.g. marked as stale when
unavailable).

- There are operations for which the replication group has already been
sampled. These operations will not be sent to the new replica.  However,
we know that those operations are already indexed into Lucene and the
translog on the primary, as the sampling is happening after that. This
means that by taking a snapshot of Lucene or the translog, we will be
getting those ops as well. What we cannot guarantee anymore is that all
ops up to `endingSeqNo` are available in the snapshot (i.e.  also see
comment in `RecoverySourceHandler` saying `We need to wait for all
operations up to the current max to complete, otherwise we can not
guarantee that all operations in the required range will be available
for replaying from the translog of the source.`). This is not needed,
though, as we can no longer guarantee that max seq no == local
checkpoint.

Relates #39000
Closes #38949

Co-authored-by: Yannick Welsch <[email protected]>
dnhatn added a commit that referenced this issue Feb 25, 2019
With this change, we won't wait for the local checkpoint to advance to
the max_seq_no before starting phase2 of peer-recovery. We also remove
the sequence number range check in peer-recovery. We can safely do these
thanks to Yannick's finding.

The replication group to be used is currently sampled after indexing
into the primary (see `ReplicationOperation` class). This means that
when initiating tracking of a new replica, we have to consider the
following two cases:

- There are operations for which the replication group has not been
sampled yet. As we initiated the new replica as tracking, we know that
those operations will be replicated to the new replica and follow the
typical replication group semantics (e.g. marked as stale when
unavailable).

- There are operations for which the replication group has already been
sampled. These operations will not be sent to the new replica.  However,
we know that those operations are already indexed into Lucene and the
translog on the primary, as the sampling is happening after that. This
means that by taking a snapshot of Lucene or the translog, we will be
getting those ops as well. What we cannot guarantee anymore is that all
ops up to `endingSeqNo` are available in the snapshot (i.e.  also see
comment in `RecoverySourceHandler` saying `We need to wait for all
operations up to the current max to complete, otherwise we can not
guarantee that all operations in the required range will be available
for replaying from the translog of the source.`). This is not needed,
though, as we can no longer guarantee that max seq no == local
checkpoint.

Relates #39000
Closes #38949

Co-authored-by: Yannick Welsch <[email protected]>
@dnhatn
Copy link
Member

dnhatn commented Feb 25, 2019

I am closing this issue for it was resolved by #39006. The safe-commit part is considered as an enhancement and will be handled by #39153.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants