Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cluster fails to rebalance itself when skew_factor in NodeLoadAwareAllocationDecider is set to 0 #3497

Closed
imRishN opened this issue Jun 3, 2022 · 3 comments
Labels
bug Something isn't working pending backport Identifies an issue or PR that still needs to be backported

Comments

@imRishN
Copy link
Member

imRishN commented Jun 3, 2022

Describe the bug
After a partial outage in a cluster with 3 availability zone setup and skew_factor set to 0, we expect that all shards get assigned again once the cluster recovers assuming cluster had indices with n primary and 1 replica. But we see transiently that shards remain unassigned even after cluster has recovered and all nodes are up.

To Reproduce
Steps to reproduce the behavior:

  1. Create a 3 availability zone domain with 5 nodes in per zone
  2. Enable cluster.routing.allocation.awareness.force.zone.values and assign 3 zones to it
  3. Enable cluster.routing.allocation.load_awareness.skew_factor and assign 0.0 to it
  4. Create an index test-index-1 with 30 primaries and 1 replica
  5. Now stop 3 nodes in a particular zone
  6. Now create another index test-index-2 with 30 primaries and 1 replica
  7. Now start all the stopped nodes back again
  8. We expect all the shards from both the index to be assigned again and no unassigned shards, but we see transiently that 1 or 2 shards stays unassigned
  9. Wrote a Integration Test to reproduce this bug
public void testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness() throws Exception {
        int nodeCountPerAZ = 5;
        int numOfShards = 30;
        int numOfReplica = 1;
        Settings commonSettings = Settings.builder()
            .put("cluster.routing.allocation.awareness.attributes", "zone")
            .put("cluster.routing.allocation.awareness.force.zone.values", "a,b,c")
            .put("cluster.routing.allocation.load_awareness.skew_factor", "0.0")
            .put("cluster.routing.allocation.load_awareness.provisioned_capacity", Integer.toString(nodeCountPerAZ*3))
            .build();

        logger.info("--> starting 15 nodes on zones 'a' & 'b' & 'c'");
        List<String> nodes_in_zone_a = internalCluster().startNodes(nodeCountPerAZ, Settings.builder().put(commonSettings).put("node.attr.zone", "a").build());
        List<String> nodes_in_zone_b = internalCluster().startNodes(nodeCountPerAZ, Settings.builder().put(commonSettings).put("node.attr.zone", "b").build());
        List<String> nodes_in_zone_c = internalCluster().startNodes(nodeCountPerAZ, Settings.builder().put(commonSettings).put("node.attr.zone", "c").build());

        // Creating index with 30 primary and 1 replica
        createIndex("test-1", Settings.builder()
            .put(IndexMetadata.SETTING_NUMBER_OF_SHARDS, numOfShards)
            .put(IndexMetadata.SETTING_NUMBER_OF_REPLICAS, numOfReplica)
            .build());

        ClusterHealthResponse health = client().admin().cluster().prepareHealth()
            .setIndices("test-1")
            .setWaitForEvents(Priority.LANGUID)
            .setWaitForGreenStatus()
            .setWaitForNodes(Integer.toString(nodeCountPerAZ*3))
            .setWaitForNoRelocatingShards(true)
            .setWaitForNoInitializingShards(true)
            .execute().actionGet();
        assertFalse(health.isTimedOut());

        ClusterState clusterState = client().admin().cluster().prepareState().execute().actionGet().getState();
        ObjectIntHashMap<String> counts = new ObjectIntHashMap<>();

        for (IndexRoutingTable indexRoutingTable : clusterState.routingTable()) {
            for (IndexShardRoutingTable indexShardRoutingTable : indexRoutingTable) {
                for (ShardRouting shardRouting : indexShardRoutingTable) {
                    counts.addTo(clusterState.nodes().get(shardRouting.currentNodeId()).getName(), 1);
                }
            }
        }

        assertThat(counts.size(), equalTo(nodeCountPerAZ*3));
        // All shards should be started
        assertThat(clusterState.getRoutingNodes().shardsWithState(STARTED).size(), equalTo(numOfShards*(numOfReplica+1)));

        // stopping half nodes in zone a
        int nodesToStop = nodeCountPerAZ/2;
        List<Settings> nodeDataPathSettings = new ArrayList<>();
        for(int i=0; i<nodesToStop; i++)
        {
            nodeDataPathSettings.add(internalCluster().dataPathSettings(nodes_in_zone_a.get(i)));
            internalCluster().stopRandomNode(InternalTestCluster.nameFilter(nodes_in_zone_a.get(i)));
        }

        client().admin().cluster().prepareReroute().setRetryFailed(true).get();
        health = client().admin().cluster().prepareHealth()
            .setIndices("test-1")
            .setWaitForEvents(Priority.LANGUID)
            .setWaitForNodes(Integer.toString(nodeCountPerAZ*3 - nodesToStop))
            .setWaitForNoRelocatingShards(true)
            .setWaitForNoInitializingShards(true)
            .execute().actionGet();
        assertFalse(health.isTimedOut());

        // Creating another index with 30 primary and 1 replica
        createIndex("test-2", Settings.builder()
            .put(IndexMetadata.SETTING_NUMBER_OF_SHARDS, numOfShards)
            .put(IndexMetadata.SETTING_NUMBER_OF_REPLICAS, numOfReplica)
            .build());

        health = client().admin().cluster().prepareHealth()
            .setIndices("test-1", "test-2")
            .setWaitForEvents(Priority.LANGUID)
            .setWaitForNodes(Integer.toString(nodeCountPerAZ*3 - nodesToStop))
            .setWaitForNoRelocatingShards(true)
            .setWaitForNoInitializingShards(true)
            .execute().actionGet();
        assertFalse(health.isTimedOut());

        // Restarting the nodes back
        for(int i=0; i<nodesToStop; i++)
        {
            internalCluster().startNode(Settings.builder().put("node.name", nodes_in_zone_a.get(i)).put(nodeDataPathSettings.get(i)).put(commonSettings).put("node.attr.zone", "a").build());
        }
        client().admin().cluster().prepareReroute().setRetryFailed(true).get();

        health = client().admin().cluster().prepareHealth()
            .setIndices("test-1", "test-2")
            .setWaitForEvents(Priority.LANGUID)
            .setWaitForNodes(Integer.toString(nodeCountPerAZ*3))
            .setWaitForGreenStatus()
            .setWaitForActiveShards(2 * numOfShards * (numOfReplica + 1))
            .setWaitForNoRelocatingShards(true)
            .setWaitForNoInitializingShards(true)
            .execute().actionGet();
        clusterState = client().admin().cluster().prepareState().execute().actionGet().getState();

        // All shards should be started now and cluster health should be green
        assertThat(clusterState.getRoutingNodes().shardsWithState(STARTED).size(), equalTo(2 * numOfShards * (numOfReplica + 1)));
        assertThat(health.isTimedOut(), equalTo(false));
    }

Expected behavior
After the partial zonal failure recovers, i.e. all the configured nodes are up and running then all the shards should have been assigned and cluster should be green.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • macOS Monterey
  • Version - 12.3.1

Additional context
Add any other context about the problem here.

@Bukhtawar
Copy link
Collaborator

Resolved by #3563

@Bukhtawar Bukhtawar added the pending backport Identifies an issue or PR that still needs to be backported label Jun 14, 2022
@Bukhtawar Bukhtawar reopened this Jun 14, 2022
@Bukhtawar
Copy link
Collaborator

Re-opening to ensure we close on backport

@imRishN
Copy link
Member Author

imRishN commented Nov 4, 2022

closing as this is backported to 1.x and 2.x

@imRishN imRishN closed this as completed Nov 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending backport Identifies an issue or PR that still needs to be backported
Projects
None yet
Development

No branches or pull requests

3 participants