Fail in-sync replica if incoming global checkpoint is higher than local checkpoint #25485

ywelsch · 2017-06-30T07:59:00Z

In case where an active replica detects that its local checkpoint is lower than the global checkpoint it receives from the primary, there should be a hard failure, as otherwise the replica might have its local checkpoint stuck from advancing. While we never expect this situation to happen (and if so will be probably due to a bug in the GlobalCheckPointTracker not properly accounting for this situation), we should treat it as a hard failure.

…er than local checkpoint

…f-sync-replica

This condition is so bad/unexpected we always want a hard failure

jasontedor

I left a comment.

jasontedor · 2017-07-06T12:19:30Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

+            if (shardState == IndexShardState.POST_RECOVERY ||
+                shardState == IndexShardState.STARTED ||
+                shardState == IndexShardState.RELOCATED) {
+                throw new AssertionError("supposedly in-sync shard copy received a global checkpoint [" + globalCheckpoint + "] " +


This is entirely too harsh, this will fail the node if we get this wrong. We should fail the shard for sure though.

+1. Good catch. I missed it. It would still be good to kill the node when testing - so we should have some assertions here too.

I agree, an assert would be good so that we indeed fail hard during testing rather than failing the shard and having it recover and possibly not failing any tests.

ywelsch · 2017-07-11T08:08:26Z

Unfortunately the property asserted by this PR cannot hold as long as we use the cluster state as a basis to replicate changes. For example, an active shard that is being failed or closed by the master (e.g. a primary or replica relocation source after relocation completion) can receive a replication request with a global checkpoint that is higher than its local checkpoint, because the primary might have removed that shard copy from the GlobalCheckpointTracker in-sync set and updated the global checkpoint, but the cluster state might not be fully applied and exposed through ClusterService.state() yet, and ReplicationOperation will therefore send a global checkpoint to a shard copy which was already removed from the replication group when this global checkpoint was computed.
A proper solution would be to have GlobalCheckpointTracker fully manage the replication targets.
A quick hack might be to trim the replication targets (taken from the cluster state) with the allocation ids from the GlobalCheckpointTracker after sampling the global checkpoint.

Currently replication and recovery are both coordinated through the latest cluster state available on the ClusterService as well as through the GlobalCheckpointTracker (to have consistent local/global checkpoint information), making it difficult to understand the relation between recovery and replication, and requiring some tricky checks in the recovery code to coordinate between the two. This commit makes the primary the single owner of its replication group, which simplifies the replication model and allows to clean up corner cases we have in our recovery code. It also reduces the dependencies in the code, so that neither RecoverySourceXXX nor ReplicationOperation need access to the latest state on ClusterService anymore. Finally, it gives us the property that in-sync shard copies won't receive global checkpoint updates which are above their local checkpoint (relates #25485).

Fail supposedly in-sync replica if incoming global checkpoint is high…

e64b694

…er than local checkpoint

ywelsch added :Sequence IDs >enhancement resiliency v6.0.0 labels Jun 30, 2017

ywelsch requested a review from jasontedor June 30, 2017 07:59

ywelsch added 4 commits June 30, 2017 18:13

does this fix the test?

062cede

Merge remote-tracking branch 'elastic/master' into enhance/fail-out-o…

c28503b

…f-sync-replica

Fail hard if assertions are enabled

be57a41

no leniency

7318697

This condition is so bad/unexpected we always want a hard failure

bleskes approved these changes Jul 6, 2017

View reviewed changes

jasontedor requested changes Jul 6, 2017

View reviewed changes

ywelsch closed this Jul 11, 2017

ywelsch mentioned this pull request Jul 12, 2017

Let primary own its replication group #25692

Merged

clintongormley removed the v6.0.0 label Jul 14, 2017

clintongormley added :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Sequence IDs labels Feb 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail in-sync replica if incoming global checkpoint is higher than local checkpoint #25485

Fail in-sync replica if incoming global checkpoint is higher than local checkpoint #25485

ywelsch commented Jun 30, 2017 •

edited

Loading

jasontedor left a comment

jasontedor Jul 6, 2017

bleskes Jul 6, 2017

jasontedor Jul 6, 2017

ywelsch commented Jul 11, 2017

Fail in-sync replica if incoming global checkpoint is higher than local checkpoint #25485

Fail in-sync replica if incoming global checkpoint is higher than local checkpoint #25485

Conversation

ywelsch commented Jun 30, 2017 • edited Loading

jasontedor left a comment

Choose a reason for hiding this comment

jasontedor Jul 6, 2017

Choose a reason for hiding this comment

bleskes Jul 6, 2017

Choose a reason for hiding this comment

jasontedor Jul 6, 2017

Choose a reason for hiding this comment

ywelsch commented Jul 11, 2017

ywelsch commented Jun 30, 2017 •

edited

Loading