[BUG] [Segment Replication] Resync failures results in removal of in-sync allocation id #7163
Labels
bug
Something isn't working
distributed framework
Indexing:Replication
Issues and PRs related to core replication framework eg segrep
Describe the bug
Coming from #6761 exercise, few tests are flaky because of smaller subset of in-sync ids. In failing test, a network disruption (partition) is attained b/w replica nodes followed by primary node stop, resulting in promotion of one of the replica from one partition. The promoted replica performs
resync
operations on existing replicas. Due to partition, this operation fails on replicas part of other partition, followed by removal of these replicas from in-sync allocation ids set, resulting in test assertion trip.Background
Cluster manager node assigns a unique id on shard allocation to a node, called as allocation id. Cluster manager keeps list of all active allocation ids (also called in-sync allocation ids) belonging to replication group in cluster state & persisted on disk. An inactive replica is one which is not able to keep with primary and thus shouldn't be used during failover. During failover when primary dies, cluster-manager then pings all nodes containing shard data, filters which are part of in-sync id and selects one node.
Impact
Low. This failure happened when there are other problems b/w node communication and is not a likely case.
To Reproduce
PrimaryAllocationIT.testPrimaryReplicaResyncFailed fails reliably
Expected behavior
Failure of resync b/w primary and replica should not result removal of replica's allocation id from in-sync ids.
The text was updated successfully, but these errors were encountered: