Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail in-sync replica if incoming global checkpoint is higher than local checkpoint #25485

Closed

Conversation

ywelsch
Copy link
Contributor

@ywelsch ywelsch commented Jun 30, 2017

In case where an active replica detects that its local checkpoint is lower than the global checkpoint it receives from the primary, there should be a hard failure, as otherwise the replica might have its local checkpoint stuck from advancing. While we never expect this situation to happen (and if so will be probably due to a bug in the GlobalCheckPointTracker not properly accounting for this situation), we should treat it as a hard failure.

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment.

if (shardState == IndexShardState.POST_RECOVERY ||
shardState == IndexShardState.STARTED ||
shardState == IndexShardState.RELOCATED) {
throw new AssertionError("supposedly in-sync shard copy received a global checkpoint [" + globalCheckpoint + "] " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is entirely too harsh, this will fail the node if we get this wrong. We should fail the shard for sure though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Good catch. I missed it. It would still be good to kill the node when testing - so we should have some assertions here too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, an assert would be good so that we indeed fail hard during testing rather than failing the shard and having it recover and possibly not failing any tests.

@ywelsch
Copy link
Contributor Author

ywelsch commented Jul 11, 2017

Unfortunately the property asserted by this PR cannot hold as long as we use the cluster state as a basis to replicate changes. For example, an active shard that is being failed or closed by the master (e.g. a primary or replica relocation source after relocation completion) can receive a replication request with a global checkpoint that is higher than its local checkpoint, because the primary might have removed that shard copy from the GlobalCheckpointTracker in-sync set and updated the global checkpoint, but the cluster state might not be fully applied and exposed through ClusterService.state() yet, and ReplicationOperation will therefore send a global checkpoint to a shard copy which was already removed from the replication group when this global checkpoint was computed.
A proper solution would be to have GlobalCheckpointTracker fully manage the replication targets.
A quick hack might be to trim the replication targets (taken from the cluster state) with the allocation ids from the GlobalCheckpointTracker after sampling the global checkpoint.

@ywelsch ywelsch closed this Jul 11, 2017
ywelsch added a commit that referenced this pull request Jul 14, 2017
Currently replication and recovery are both coordinated through the latest cluster state available on the ClusterService as well as through the GlobalCheckpointTracker (to have consistent local/global checkpoint information), making it difficult to understand the relation between recovery and replication, and requiring some tricky checks in the recovery code to coordinate between the two. This commit makes the primary the single owner of its replication group, which simplifies the replication model and allows to clean up corner cases we have in our recovery code. It also reduces the dependencies in the code, so that neither RecoverySourceXXX nor ReplicationOperation need access to the latest state on ClusterService anymore. Finally, it gives us the property that in-sync shard copies won't receive global checkpoint updates which are above their local checkpoint (relates #25485).
@clintongormley clintongormley added :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Sequence IDs labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. >enhancement resiliency
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants