[BUG] Parallel segment replication rounds for same index on replica #5706

dreamer-89 · 2023-01-04T23:18:41Z

Describe the bug
Coming from #5344 (comment), with primary allocation without a primary term bump (e.g. primary relocation), it is posible that replica is performing segment replication with both older and new primary.

To Reproduce
Used integration test below for simulating

   /**
     * This test tries to mimic state where segment replication from older primary (after primary recovery) is still
     * happening on target/replica node and not caught by existing guards (state/index/shard listeners). The test tries
     * to simulate this issue by blocking segment replication from older primary to a replica node and then
     * triggering a primary recovery to target. After primary change, the older primary still performing the segrep with
     * replica node.
     */
    public void testPrimaryRelocationWithDup() throws Exception {
        final String old_primary = internalCluster().startNode();
        createIndex();
        final String replica = internalCluster().startNode();
        ensureGreen(INDEX_NAME);

        CountDownLatch latch = new CountDownLatch(1);
        // Mock transport service to add behaviour of throwing corruption exception during segment replication process.
        MockTransportService mockTransportService = ((MockTransportService) internalCluster().getInstance(
            TransportService.class,
            old_primary
        ));
        mockTransportService.addSendBehavior(
            internalCluster().getInstance(TransportService.class, replica),
            (connection, requestId, action, request, options) -> {
                if (action.equals(SegmentReplicationTargetService.Actions.FILE_CHUNK)) {
                    try {
                        logger.info("--> blocking old primary");
                        latch.await();
                    } catch (InterruptedException e) {
                        throw new RuntimeException(e);
                    }
                }
                connection.sendRequest(requestId, action, request, options);
            }
        );

        final int initialDocCount = scaledRandomIntBetween(0, 200);
        for (int i = 0; i < initialDocCount; i++) {
            client().prepareIndex(INDEX_NAME).setId(Integer.toString(i)).setSource("field", "value" + i).execute().actionGet();
        }
        refresh(INDEX_NAME); // this blocks the segrep on old primary -> replica

        logger.info("--> start target node");
        final String new_primary = internalCluster().startNode();
        ClusterHealthResponse clusterHealthResponse = client().admin()
            .cluster()
            .prepareHealth()
            .setWaitForEvents(Priority.LANGUID)
            .setWaitForNodes("3")
            .execute()
            .actionGet();
        assertThat(clusterHealthResponse.isTimedOut(), equalTo(false));

        logger.info("--> relocate the shard");
        client().admin()
            .cluster()
            .prepareReroute()
            .add(new MoveAllocationCommand(INDEX_NAME, 0, old_primary, new_primary))
            .execute()
            .actionGet();
        clusterHealthResponse = client().admin()
            .cluster()
            .prepareHealth()
            .setWaitForEvents(Priority.LANGUID)
            .setWaitForNoRelocatingShards(true)
            .setTimeout(ACCEPTABLE_RELOCATION_TIME)
            .execute()
            .actionGet();
        assertThat(clusterHealthResponse.isTimedOut(), equalTo(false));

        logger.info("--> get the state, verify shard 1 primary moved from node1 to node2");
        ClusterState state = client().admin().cluster().prepareState().execute().actionGet().getState();

        logger.info("--> state {}", state);

        assertThat(
            state.getRoutingNodes().node(state.nodes().resolveNode(new_primary).getId()).iterator().next().state(),
            equalTo(ShardRoutingState.STARTED)
        );

        final int finalDocCount = initialDocCount;
        for (int i = initialDocCount; i < 2 * initialDocCount; i++) {
            client().prepareIndex(INDEX_NAME).setId(Integer.toString(i)).setSource("field", "value" + i).execute().actionGet();
        }
        refresh(INDEX_NAME);

        final IndexShard indexShard = getIndexShard(new_primary);

        ReplicationCollection<SegmentReplicationTarget> replications = internalCluster().getInstance(SegmentReplicationTargetService.class, replica).getOnGoingReplications();
        PrimaryShardReplicationSource source = (PrimaryShardReplicationSource) replications.getOngoingReplicationTarget(indexShard.shardId()).getSource();

        assertNotEquals(source.getSourceNode().getName(), old_primary);
        logger.info("Source node {} {}", source.getSourceNode().getName(), old_primary);

        logger.info("--> verifying count again {}", initialDocCount + finalDocCount);
        client().admin().indices().prepareRefresh().execute().actionGet();
        assertHitCount(
            client(new_primary).prepareSearch(INDEX_NAME).setSize(0).setPreference("_only_local").get(),
            initialDocCount + finalDocCount
        );
        assertHitCount(
            client(replica).prepareSearch(INDEX_NAME).setSize(0).setPreference("_only_local").get(),
            initialDocCount + finalDocCount
        );
        latch.countDown();
    }

Expected behavior
There should not be two rounds of segment replications for same shard. This can have unintended consequence on replica state.

The text was updated successfully, but these errors were encountered:

Poojita-Raj · 2023-01-09T19:47:11Z

Looking into it.

mch2 · 2023-01-09T19:49:18Z

I think the assertion being made in this test is invalid. We don't care if a replica is syncing to the old primary. Once the old primary shuts down the replica will start syncing from the new primary?

I think we do actually have a race condition where a replica could receive a checkpoint when it hits POST_RECOVERY and processes it. It will then force a round of segrep which does not block for any ongoing replications. We need the validation that prevents duplicate rounds of replication to move outside of onNewCheckpoint and apply whenever segrep is started.

Poojita-Raj · 2023-01-11T15:56:12Z

I think we do actually have a race condition where a replica could receive a checkpoint when it hits POST_RECOVERY and processes it.

This race condition could result in parallel replication events taking place which we would like to avoid. The most straightforward way of ensuring this is to avoid any replication event from taking place on receiving a new checkpoint if the index shard state is not STARTED.

dreamer-89 added bug Something isn't working untriaged labels Jan 4, 2023

This was referenced Jan 4, 2023

[Meta] Promote Segment Replication out of experimental. #5147

Closed

[Segment Replication] Update peer recovery logic for segment replication #5344

Merged

mch2 removed the untriaged label Jan 9, 2023

mch2 assigned Poojita-Raj Jan 9, 2023

Poojita-Raj mentioned this issue Jan 11, 2023

[Segment Replication] Adding check to avoid parallel replication events taking place #5831

Merged

6 tasks

mch2 closed this as completed in #5831 Jan 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Parallel segment replication rounds for same index on replica #5706

[BUG] Parallel segment replication rounds for same index on replica #5706

dreamer-89 commented Jan 4, 2023

Poojita-Raj commented Jan 9, 2023

mch2 commented Jan 9, 2023

Poojita-Raj commented Jan 11, 2023

[BUG] Parallel segment replication rounds for same index on replica #5706

[BUG] Parallel segment replication rounds for same index on replica #5706

Comments

dreamer-89 commented Jan 4, 2023

Poojita-Raj commented Jan 9, 2023

mch2 commented Jan 9, 2023

Poojita-Raj commented Jan 11, 2023