-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail in-sync replica if incoming global checkpoint is higher than local checkpoint #25485
Conversation
…er than local checkpoint
This condition is so bad/unexpected we always want a hard failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment.
if (shardState == IndexShardState.POST_RECOVERY || | ||
shardState == IndexShardState.STARTED || | ||
shardState == IndexShardState.RELOCATED) { | ||
throw new AssertionError("supposedly in-sync shard copy received a global checkpoint [" + globalCheckpoint + "] " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is entirely too harsh, this will fail the node if we get this wrong. We should fail the shard for sure though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Good catch. I missed it. It would still be good to kill the node when testing - so we should have some assertions here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, an assert
would be good so that we indeed fail hard during testing rather than failing the shard and having it recover and possibly not failing any tests.
Unfortunately the property asserted by this PR cannot hold as long as we use the cluster state as a basis to replicate changes. For example, an active shard that is being failed or closed by the master (e.g. a primary or replica relocation source after relocation completion) can receive a replication request with a global checkpoint that is higher than its local checkpoint, because the primary might have removed that shard copy from the |
Currently replication and recovery are both coordinated through the latest cluster state available on the ClusterService as well as through the GlobalCheckpointTracker (to have consistent local/global checkpoint information), making it difficult to understand the relation between recovery and replication, and requiring some tricky checks in the recovery code to coordinate between the two. This commit makes the primary the single owner of its replication group, which simplifies the replication model and allows to clean up corner cases we have in our recovery code. It also reduces the dependencies in the code, so that neither RecoverySourceXXX nor ReplicationOperation need access to the latest state on ClusterService anymore. Finally, it gives us the property that in-sync shard copies won't receive global checkpoint updates which are above their local checkpoint (relates #25485).
In case where an active replica detects that its local checkpoint is lower than the global checkpoint it receives from the primary, there should be a hard failure, as otherwise the replica might have its local checkpoint stuck from advancing. While we never expect this situation to happen (and if so will be probably due to a bug in the GlobalCheckPointTracker not properly accounting for this situation), we should treat it as a hard failure.