[ML] Speed up persistent task rechecks in ML failover tests (#43291)

The ML failover tests sometimes need to wait for jobs to be assigned to new nodes following a node failure. They wait 10 seconds for this to happen. However, if the node that failed was the master node and a new master was elected then this 10 seconds might not be long enough as a refresh of the memory stats will delay job assignment. Once the memory refresh completes the persistent task will be assigned when the next cluster state update occurs or after the periodic recheck interval, which defaults to 30 seconds. Rather than increase the length of the wait for assignment to 31 seconds, this change decreases the periodic recheck interval to 1 second. Fixes #43289
elastic · Jun 18, 2019 · 1fad4e1 · 1fad4e1
1 parent 5d3cae4
commit 1fad4e1
Showing 1 changed file with 12 additions and 0 deletions.
diff --git a/x-pack/plugin/ml/src/test/java/org/elasticsearch/xpack/ml/support/BaseMlIntegTestCase.java b/x-pack/plugin/ml/src/test/java/org/elasticsearch/xpack/ml/support/BaseMlIntegTestCase.java
@@ -23,6 +23,7 @@
 import org.elasticsearch.index.reindex.ReindexPlugin;
 import org.elasticsearch.indices.recovery.RecoveryState;
 import org.elasticsearch.license.LicenseService;
+import org.elasticsearch.persistent.PersistentTasksClusterService;
 import org.elasticsearch.plugins.Plugin;
 import org.elasticsearch.test.ESIntegTestCase;
 import org.elasticsearch.test.discovery.TestZenDiscovery;
@@ -351,6 +352,17 @@ public static void deleteAllJobs(Logger logger, Client client) throws Exception
     }
 
     protected String awaitJobOpenedAndAssigned(String jobId, String queryNode) throws Exception {
+
+        PersistentTasksClusterService persistentTasksClusterService =
+            internalCluster().getInstance(PersistentTasksClusterService.class, internalCluster().getMasterName());
+        // Speed up rechecks to a rate that is quicker than what settings would allow.
+        // The check would work eventually without doing this, but the assertBusy() below
+        // would need to wait 30 seconds, which would make the test run very slowly.
+        // The 1 second refresh puts a greater burden on the master node to recheck
+        // persistent tasks, but it will cope in these tests as it's not doing much
+        // else.
+        persistentTasksClusterService.setRecheckInterval(TimeValue.timeValueSeconds(1));
+
         AtomicReference<String> jobNode = new AtomicReference<>();
         assertBusy(() -> {
             GetJobsStatsAction.Response statsResponse =