Fix CheckTargetShardsCountStep #(48460) (#89176)

**The issue** The flaky test simulates the following: - Create a shrink policy with an invalid target shard count - Then change the policy to have a valid target shard count - Expectation: the `check-target-shards-count` will return true and the shrink operation will be successful. What was happening in some cases in the background: - Create the shrink policy with an invalid target shard count - The `check-target-shards-count` gets created and queued to be executed with the invalid target shards count. - The task doesn't get enough priority to be executed - We change the policy to have a valid target shards count - We execute the queued task which still has the outdated target shard count. **Proof** We enriched the code with some extra logging to verify that the above scenario is actually correct: ``` ## Adding the check target shards to the executingTasks [2022-08-08T18:02:52,824][INFO ][o.e.x.i.IndexLifecycleRunner] [javaRestTest-0] #78460: Adding task to queue check if I can shrink to numberOfShards = 5 [2022-08-08T18:02:52,825][TRACE][o.e.x.i.h.ILMHistoryStore] [javaRestTest-0] queueing ILM history item for indexing [ilm-history-5]: [{"index":"index-zmmrkzfhht","policy":"policy-bEmKF","@timestamp":1659970972825,"index_age":12608,"success":true,"state":{"phase":"warm","phase_definition":"{\"policy\":\"policy-bEmKF\",\"phase_definition\":{\"min_age\":\"0ms\",\"actions\":{\"shrink\":{\"number_of_shards\":5}}},\"version\":1,\"modified_date_in_millis\":1659970962968}","action_time":"1659970968847","phase_time":"1659970966014","action":"shrink","step":"check-target-shards-count","creation_date":"1659970960217","step_time":"1659970972076"}}] ## We change the policy before even the condition is never even evaluated. [2022-08-08T18:02:52,825][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [javaRestTest-0] updating index lifecycle policy [policy-bEmKF] [2022-08-08T18:02:52,826][DEBUG][o.e.x.c.i.PhaseCacheManagement] [javaRestTest-0] [index-zmmrkzfhht] updated policy [policy-bEmKF] contains the same phase step keys and can be refreshed [2022-08-08T18:02:52,826][TRACE][o.e.x.c.i.PhaseCacheManagement] [javaRestTest-0] [index-zmmrkzfhht] updating cached phase definition for policy [policy-bEmKF] [2022-08-08T18:02:52,826][DEBUG][o.e.x.c.i.PhaseCacheManagement] [javaRestTest-0] refreshed policy [policy-bEmKF] phase definition for [1] indices ## We check the condition for the first time but the target shard count is already outdated [2022-08-08T18:02:53,406][ERROR][o.e.x.c.i.CheckTargetShardsCountStep] [javaRestTest-0] #78460: Policy has different target number of shards in cluster state 2 vs what will be executed 5. [2022-08-08T18:02:53,441][DEBUG][o.e.x.c.i.CheckTargetShardsCountStep] [javaRestTest-0] lifecycle action of policy [policy-bEmKF] for index [index-zmmrkzfhht] cannot make progress because the target shards count [5] must be a factor of the source index's shards count [4] ``` **Impact** We do not think that the impact is that big for production clusters because there are many more cluster state updates. However, it might cause some inconvenience to the users who fixed a policy and do not see the effect as soon as they could have. **The fix** Our proposed fix is to not provide the target shard count upon task creation but to retrieve from the cluster state. This way we ensure it will have the newest value. **Future work** Currently for every cluster state we go through all the indices and we check if any step needs to be executed. This doesn't scale well. We would like to try to switch to a more efficient model potentially with a cluster state observer. Issue will be created soon. Resolves #78460
elastic · Aug 25, 2022 · 862c885 · 862c885
1 parent 16e4cb1
commit 862c885
Show file tree

Hide file tree

Showing 2 changed files with 36 additions and 3 deletions.
diff --git a/...ilm/qa/multi-node/src/javaRestTest/java/org/elasticsearch/xpack/TimeSeriesRestDriver.java b/...ilm/qa/multi-node/src/javaRestTest/java/org/elasticsearch/xpack/TimeSeriesRestDriver.java
@@ -16,6 +16,7 @@
 import org.elasticsearch.client.RestClient;
 import org.elasticsearch.cluster.metadata.IndexMetadata;
 import org.elasticsearch.cluster.metadata.Template;
+import org.elasticsearch.cluster.routing.allocation.DataTier;
 import org.elasticsearch.common.Strings;
 import org.elasticsearch.common.settings.Settings;
 import org.elasticsearch.common.xcontent.XContentHelper;
@@ -47,6 +48,7 @@
 import java.util.concurrent.TimeUnit;
 
 import static java.util.Collections.singletonMap;
+import static org.elasticsearch.cluster.metadata.IndexMetadata.SETTING_NUMBER_OF_SHARDS;
 import static org.elasticsearch.test.ESTestCase.randomAlphaOfLengthBetween;
 import static org.elasticsearch.test.ESTestCase.randomBoolean;
 import static org.elasticsearch.test.ESTestCase.waitUntil;
@@ -396,6 +398,25 @@ public static String getSnapshotState(RestClient client, String snapshot) throws
         return (String) snapResponse.get("state");
     }
 
+    /**
+     * This method waits to get the shrunk index name and if it fails, it triggers once a cluster state update and tries again.
+     * The motivation behind this method is that in ShrinkAction there are cluster state dependent steps, for example
+     * {@link org.elasticsearch.xpack.core.ilm.CheckTargetShardsCountStep}, that might miss the latest policy update if they are
+     * already queued, see {@link org.elasticsearch.xpack.ilm.IndexLifecycleRunner#submitUnlessAlreadyQueued}. In the real world
+     * there is usually another cluster state coming but since this is the test world there is not. That is what this method is simulating.
+     */
+    @Nullable
+    public static String waitAndGetShrinkIndexNameWithExtraClusterStateChange(RestClient client, String originalIndex)
+        throws InterruptedException, IOException {
+        String shrunkenIndexName = waitAndGetShrinkIndexName(client, originalIndex);
+        if (shrunkenIndexName == null) {
+            logger.info("Executing dummy cluster update to re-trigger a cluster state dependent step.");
+            executeDummyClusterStateUpdate(client);
+            shrunkenIndexName = waitAndGetShrinkIndexName(client, originalIndex);
+        }
+        return shrunkenIndexName;
+    }
+
     @SuppressWarnings("unchecked")
     @Nullable
     public static String waitAndGetShrinkIndexName(RestClient client, String originalIndex) throws InterruptedException {
@@ -443,6 +464,17 @@ public static String waitAndGetShrinkIndexName(RestClient client, String origina
         return shrunkenIndexName[0];
     }
 
+    private static void executeDummyClusterStateUpdate(RestClient client) throws IOException {
+        createIndexWithSettings(
+            client,
+            "dummy-index",
+            Settings.builder()
+                .put(SETTING_NUMBER_OF_SHARDS, 1)
+                .put(IndexMetadata.SETTING_NUMBER_OF_REPLICAS, 0)
+                .putNull(DataTier.TIER_PREFERENCE)
+        );
+    }
+
     public static Template getTemplate(String policyName) {
         return new Template(getLifecycleSettings(policyName), null, null);
     }

diff --git a/.../multi-node/src/javaRestTest/java/org/elasticsearch/xpack/ilm/actions/ShrinkActionIT.java b/.../multi-node/src/javaRestTest/java/org/elasticsearch/xpack/ilm/actions/ShrinkActionIT.java
@@ -9,7 +9,6 @@
 
 import org.apache.http.entity.ContentType;
 import org.apache.http.entity.StringEntity;
-import org.apache.lucene.tests.util.LuceneTestCase.AwaitsFix;
 import org.elasticsearch.client.Request;
 import org.elasticsearch.client.ResponseException;
 import org.elasticsearch.cluster.metadata.IndexMetadata;
@@ -53,7 +52,9 @@
 import static org.elasticsearch.xpack.TimeSeriesRestDriver.indexDocument;
 import static org.elasticsearch.xpack.TimeSeriesRestDriver.updatePolicy;
 import static org.elasticsearch.xpack.TimeSeriesRestDriver.waitAndGetShrinkIndexName;
+import static org.elasticsearch.xpack.TimeSeriesRestDriver.waitAndGetShrinkIndexNameWithExtraClusterStateChange;
 import static org.hamcrest.Matchers.equalTo;
+import static org.hamcrest.Matchers.notNullValue;
 import static org.hamcrest.Matchers.nullValue;
 
 public class ShrinkActionIT extends ESRestTestCase {
@@ -321,7 +322,6 @@ public void testSetSingleNodeAllocationRetriesUntilItSucceeds() throws Exception
         assertBusy(() -> assertThat(getStepKeyForIndex(client(), shrunkenIndex), equalTo(PhaseCompleteStep.finalStep("warm").getKey())));
     }
 
-    @AwaitsFix(bugUrl = "https://github.com/elastic/elasticsearch/issues/78460")
     public void testAutomaticRetryFailedShrinkAction() throws Exception {
         int numShards = 4;
         int divisor = randomFrom(2, 4);
@@ -348,7 +348,8 @@ public void testAutomaticRetryFailedShrinkAction() throws Exception {
         updatePolicy(client(), index, policy);
 
         // assert corrected policy is picked up and index is shrunken
-        String shrunkenIndex = waitAndGetShrinkIndexName(client(), index);
+        String shrunkenIndex = waitAndGetShrinkIndexNameWithExtraClusterStateChange(client(), index);
+        assertThat(shrunkenIndex, notNullValue());
         assertBusy(() -> assertTrue(indexExists(shrunkenIndex)), 30, TimeUnit.SECONDS);
         assertBusy(() -> assertTrue(aliasExists(shrunkenIndex, index)));
         assertBusy(() -> assertThat(getStepKeyForIndex(client(), shrunkenIndex), equalTo(PhaseCompleteStep.finalStep("warm").getKey())));