Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [Segment Replication] Resync failures results in removal of in-sync allocation id #7163

Closed
Tracked by #6761
dreamer-89 opened this issue Apr 14, 2023 · 2 comments
Assignees
Labels
bug Something isn't working distributed framework Indexing:Replication Issues and PRs related to core replication framework eg segrep

Comments

@dreamer-89
Copy link
Member

Describe the bug
Coming from #6761 exercise, few tests are flaky because of smaller subset of in-sync ids. In failing test, a network disruption (partition) is attained b/w replica nodes followed by primary node stop, resulting in promotion of one of the replica from one partition. The promoted replica performs resync operations on existing replicas. Due to partition, this operation fails on replicas part of other partition, followed by removal of these replicas from in-sync allocation ids set, resulting in test assertion trip.

Background
Cluster manager node assigns a unique id on shard allocation to a node, called as allocation id. Cluster manager keeps list of all active allocation ids (also called in-sync allocation ids) belonging to replication group in cluster state & persisted on disk. An inactive replica is one which is not able to keep with primary and thus shouldn't be used during failover. During failover when primary dies, cluster-manager then pings all nodes containing shard data, filters which are part of in-sync id and selects one node.

Impact
Low. This failure happened when there are other problems b/w node communication and is not a likely case.

To Reproduce
PrimaryAllocationIT.testPrimaryReplicaResyncFailed fails reliably

Expected behavior
Failure of resync b/w primary and replica should not result removal of replica's allocation id from in-sync ids.

@dreamer-89
Copy link
Member Author

The issue happens due to failures of RetentionLeaseSyncAction failures on replica. As this action extends TransportWriteAction failure of which results in marking shard out of sync. Thus the failure. This issue needs more deep dive. Prioritizing other 2.10.0 issues over this one as we are approaching code freeze date.

CC @mch2 @anasalkouz

@dreamer-89
Copy link
Member Author

This test fails because of replica shard copy been removed from in-sync set. This happens because during peer recovery, replica does not play (index) the local translog operations locally as it is not supported with NRTReplicationEngine (no IndexWriter). This results in primary shard resorting to file based recovery where primary first removes existing retention leases followed by sync to replica shard copies. RetentionLeaseSyncAction is extends TransportWriteAction where shard operation failures are identified as problematic and marks shard copy as stale along with shard failure and is the cause of test failure here. Since, sequence number recoveries are not allowed on replica shard copies with segment replication, this test needs to be muted/blocked with segment replication feature. Opened up #10003 separately to discuss any implications from ignoring sequence number based recoveries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework Indexing:Replication Issues and PRs related to core replication framework eg segrep
Projects
Status: Done
Development

No branches or pull requests

4 participants