[BUG] [Segment Replication] Resync failures results in removal of in-sync allocation id #7163

dreamer-89 · 2023-04-14T16:49:46Z

Describe the bug
Coming from #6761 exercise, few tests are flaky because of smaller subset of in-sync ids. In failing test, a network disruption (partition) is attained b/w replica nodes followed by primary node stop, resulting in promotion of one of the replica from one partition. The promoted replica performs resync operations on existing replicas. Due to partition, this operation fails on replicas part of other partition, followed by removal of these replicas from in-sync allocation ids set, resulting in test assertion trip.

Background
Cluster manager node assigns a unique id on shard allocation to a node, called as allocation id. Cluster manager keeps list of all active allocation ids (also called in-sync allocation ids) belonging to replication group in cluster state & persisted on disk. An inactive replica is one which is not able to keep with primary and thus shouldn't be used during failover. During failover when primary dies, cluster-manager then pings all nodes containing shard data, filters which are part of in-sync id and selects one node.

Impact
Low. This failure happened when there are other problems b/w node communication and is not a likely case.

To Reproduce
PrimaryAllocationIT.testPrimaryReplicaResyncFailed fails reliably

Expected behavior
Failure of resync b/w primary and replica should not result removal of replica's allocation id from in-sync ids.

The text was updated successfully, but these errors were encountered:

dreamer-89 · 2023-08-29T00:41:09Z

The issue happens due to failures of RetentionLeaseSyncAction failures on replica. As this action extends TransportWriteAction failure of which results in marking shard out of sync. Thus the failure. This issue needs more deep dive. Prioritizing other 2.10.0 issues over this one as we are approaching code freeze date.

CC @mch2 @anasalkouz

dreamer-89 · 2023-09-12T22:56:33Z

This test fails because of replica shard copy been removed from in-sync set. This happens because during peer recovery, replica does not play (index) the local translog operations locally as it is not supported with NRTReplicationEngine (no IndexWriter). This results in primary shard resorting to file based recovery where primary first removes existing retention leases followed by sync to replica shard copies. RetentionLeaseSyncAction is extends TransportWriteAction where shard operation failures are identified as problematic and marks shard copy as stale along with shard failure and is the cause of test failure here. Since, sequence number recoveries are not allowed on replica shard copies with segment replication, this test needs to be muted/blocked with segment replication feature. Opened up #10003 separately to discuss any implications from ignoring sequence number based recoveries.

dreamer-89 added bug Something isn't working untriaged labels Apr 14, 2023

minalsha added the distributed framework label Apr 14, 2023

anasalkouz removed the untriaged label Apr 14, 2023

dreamer-89 mentioned this issue Apr 18, 2023

[Meta] [Segment Replication] Run all integration tests with segment replication enabled #6761

Closed

12 tasks

Bukhtawar added the Indexing:Replication Issues and PRs related to core replication framework eg segrep label Jul 27, 2023

Rishikesh1159 mentioned this issue Aug 22, 2023

[Segment Replication] Verify All Integ Tests are Passing with Segment Replication Enabled #8927

Open

5 tasks

dreamer-89 self-assigned this Aug 23, 2023

dreamer-89 mentioned this issue Sep 12, 2023

[Segment Replication] Sequence number based recoveries #10003

Open

dreamer-89 closed this as completed Sep 12, 2023

This was referenced Sep 12, 2023

[BUG] [Segment Replication] NO-OP recovery not attempted #7161

Closed

[Meta][Segment Replication]Tests failing due to segment replication specific behavior #9997

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] [Segment Replication] Resync failures results in removal of in-sync allocation id #7163

[BUG] [Segment Replication] Resync failures results in removal of in-sync allocation id #7163

dreamer-89 commented Apr 14, 2023

dreamer-89 commented Aug 29, 2023

dreamer-89 commented Sep 12, 2023

[BUG] [Segment Replication] Resync failures results in removal of in-sync allocation id #7163

[BUG] [Segment Replication] Resync failures results in removal of in-sync allocation id #7163

Comments

dreamer-89 commented Apr 14, 2023

dreamer-89 commented Aug 29, 2023

dreamer-89 commented Sep 12, 2023