[BUG] Cluster fails to rebalance itself when skew_factor in NodeLoadAwareAllocationDecider is set to 0 #3497

imRishN · 2022-06-03T03:31:57Z

Describe the bug
After a partial outage in a cluster with 3 availability zone setup and skew_factor set to 0, we expect that all shards get assigned again once the cluster recovers assuming cluster had indices with n primary and 1 replica. But we see transiently that shards remain unassigned even after cluster has recovered and all nodes are up.

To Reproduce
Steps to reproduce the behavior:

Create a 3 availability zone domain with 5 nodes in per zone
Enable cluster.routing.allocation.awareness.force.zone.values and assign 3 zones to it
Enable cluster.routing.allocation.load_awareness.skew_factor and assign 0.0 to it
Create an index test-index-1 with 30 primaries and 1 replica
Now stop 3 nodes in a particular zone
Now create another index test-index-2 with 30 primaries and 1 replica
Now start all the stopped nodes back again
We expect all the shards from both the index to be assigned again and no unassigned shards, but we see transiently that 1 or 2 shards stays unassigned
Wrote a Integration Test to reproduce this bug

public void testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness() throws Exception {
        int nodeCountPerAZ = 5;
        int numOfShards = 30;
        int numOfReplica = 1;
        Settings commonSettings = Settings.builder()
            .put("cluster.routing.allocation.awareness.attributes", "zone")
            .put("cluster.routing.allocation.awareness.force.zone.values", "a,b,c")
            .put("cluster.routing.allocation.load_awareness.skew_factor", "0.0")
            .put("cluster.routing.allocation.load_awareness.provisioned_capacity", Integer.toString(nodeCountPerAZ*3))
            .build();

        logger.info("--> starting 15 nodes on zones 'a' & 'b' & 'c'");
        List<String> nodes_in_zone_a = internalCluster().startNodes(nodeCountPerAZ, Settings.builder().put(commonSettings).put("node.attr.zone", "a").build());
        List<String> nodes_in_zone_b = internalCluster().startNodes(nodeCountPerAZ, Settings.builder().put(commonSettings).put("node.attr.zone", "b").build());
        List<String> nodes_in_zone_c = internalCluster().startNodes(nodeCountPerAZ, Settings.builder().put(commonSettings).put("node.attr.zone", "c").build());

        // Creating index with 30 primary and 1 replica
        createIndex("test-1", Settings.builder()
            .put(IndexMetadata.SETTING_NUMBER_OF_SHARDS, numOfShards)
            .put(IndexMetadata.SETTING_NUMBER_OF_REPLICAS, numOfReplica)
            .build());

        ClusterHealthResponse health = client().admin().cluster().prepareHealth()
            .setIndices("test-1")
            .setWaitForEvents(Priority.LANGUID)
            .setWaitForGreenStatus()
            .setWaitForNodes(Integer.toString(nodeCountPerAZ*3))
            .setWaitForNoRelocatingShards(true)
            .setWaitForNoInitializingShards(true)
            .execute().actionGet();
        assertFalse(health.isTimedOut());

        ClusterState clusterState = client().admin().cluster().prepareState().execute().actionGet().getState();
        ObjectIntHashMap<String> counts = new ObjectIntHashMap<>();

        for (IndexRoutingTable indexRoutingTable : clusterState.routingTable()) {
            for (IndexShardRoutingTable indexShardRoutingTable : indexRoutingTable) {
                for (ShardRouting shardRouting : indexShardRoutingTable) {
                    counts.addTo(clusterState.nodes().get(shardRouting.currentNodeId()).getName(), 1);
                }
            }
        }

        assertThat(counts.size(), equalTo(nodeCountPerAZ*3));
        // All shards should be started
        assertThat(clusterState.getRoutingNodes().shardsWithState(STARTED).size(), equalTo(numOfShards*(numOfReplica+1)));

        // stopping half nodes in zone a
        int nodesToStop = nodeCountPerAZ/2;
        List<Settings> nodeDataPathSettings = new ArrayList<>();
        for(int i=0; i<nodesToStop; i++)
        {
            nodeDataPathSettings.add(internalCluster().dataPathSettings(nodes_in_zone_a.get(i)));
            internalCluster().stopRandomNode(InternalTestCluster.nameFilter(nodes_in_zone_a.get(i)));
        }

        client().admin().cluster().prepareReroute().setRetryFailed(true).get();
        health = client().admin().cluster().prepareHealth()
            .setIndices("test-1")
            .setWaitForEvents(Priority.LANGUID)
            .setWaitForNodes(Integer.toString(nodeCountPerAZ*3 - nodesToStop))
            .setWaitForNoRelocatingShards(true)
            .setWaitForNoInitializingShards(true)
            .execute().actionGet();
        assertFalse(health.isTimedOut());

        // Creating another index with 30 primary and 1 replica
        createIndex("test-2", Settings.builder()
            .put(IndexMetadata.SETTING_NUMBER_OF_SHARDS, numOfShards)
            .put(IndexMetadata.SETTING_NUMBER_OF_REPLICAS, numOfReplica)
            .build());

        health = client().admin().cluster().prepareHealth()
            .setIndices("test-1", "test-2")
            .setWaitForEvents(Priority.LANGUID)
            .setWaitForNodes(Integer.toString(nodeCountPerAZ*3 - nodesToStop))
            .setWaitForNoRelocatingShards(true)
            .setWaitForNoInitializingShards(true)
            .execute().actionGet();
        assertFalse(health.isTimedOut());

        // Restarting the nodes back
        for(int i=0; i<nodesToStop; i++)
        {
            internalCluster().startNode(Settings.builder().put("node.name", nodes_in_zone_a.get(i)).put(nodeDataPathSettings.get(i)).put(commonSettings).put("node.attr.zone", "a").build());
        }
        client().admin().cluster().prepareReroute().setRetryFailed(true).get();

        health = client().admin().cluster().prepareHealth()
            .setIndices("test-1", "test-2")
            .setWaitForEvents(Priority.LANGUID)
            .setWaitForNodes(Integer.toString(nodeCountPerAZ*3))
            .setWaitForGreenStatus()
            .setWaitForActiveShards(2 * numOfShards * (numOfReplica + 1))
            .setWaitForNoRelocatingShards(true)
            .setWaitForNoInitializingShards(true)
            .execute().actionGet();
        clusterState = client().admin().cluster().prepareState().execute().actionGet().getState();

        // All shards should be started now and cluster health should be green
        assertThat(clusterState.getRoutingNodes().shardsWithState(STARTED).size(), equalTo(2 * numOfShards * (numOfReplica + 1)));
        assertThat(health.isTimedOut(), equalTo(false));
    }

Expected behavior
After the partial zonal failure recovers, i.e. all the configured nodes are up and running then all the shards should have been assigned and cluster should be green.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

macOS Monterey
Version - 12.3.1

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

Bukhtawar · 2022-06-14T04:44:14Z

Resolved by #3563

Bukhtawar · 2022-06-14T14:04:50Z

Re-opening to ensure we close on backport

imRishN · 2022-11-04T13:20:24Z

closing as this is backported to 1.x and 2.x

imRishN added bug Something isn't working untriaged labels Jun 3, 2022

dreamer-89 removed the untriaged label Jun 7, 2022

imRishN mentioned this issue Jun 11, 2022

Add flat_skew setting to node overload decider #3563

Merged

5 tasks

Bukhtawar closed this as completed Jun 14, 2022

Bukhtawar added the pending backport Identifies an issue or PR that still needs to be backported label Jun 14, 2022

Bukhtawar reopened this Jun 14, 2022

imRishN closed this as completed Nov 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Cluster fails to rebalance itself when skew_factor in NodeLoadAwareAllocationDecider is set to 0 #3497

[BUG] Cluster fails to rebalance itself when skew_factor in NodeLoadAwareAllocationDecider is set to 0 #3497

imRishN commented Jun 3, 2022 •

edited

Loading

Bukhtawar commented Jun 14, 2022

Bukhtawar commented Jun 14, 2022

imRishN commented Nov 4, 2022

[BUG] Cluster fails to rebalance itself when skew_factor in NodeLoadAwareAllocationDecider is set to 0 #3497

[BUG] Cluster fails to rebalance itself when skew_factor in NodeLoadAwareAllocationDecider is set to 0 #3497

Comments

imRishN commented Jun 3, 2022 • edited Loading

Bukhtawar commented Jun 14, 2022

Bukhtawar commented Jun 14, 2022

imRishN commented Nov 4, 2022

imRishN commented Jun 3, 2022 •

edited

Loading