Skip to content

Commit

Permalink
[ML] Speed up persistent task rechecks in ML failover tests (#43291)
Browse files Browse the repository at this point in the history
The ML failover tests sometimes need to wait for jobs to be
assigned to new nodes following a node failure.  They wait
10 seconds for this to happen.  However, if the node that
failed was the master node and a new master was elected then
this 10 seconds might not be long enough as a refresh of the
memory stats will delay job assignment.  Once the memory
refresh completes the persistent task will be assigned when
the next cluster state update occurs or after the periodic
recheck interval, which defaults to 30 seconds.  Rather than
increase the length of the wait for assignment to 31 seconds,
this change decreases the periodic recheck interval to 1
second.

Fixes #43289
  • Loading branch information
droberts195 committed Jun 18, 2019
1 parent 5d3cae4 commit 1fad4e1
Showing 1 changed file with 12 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
import org.elasticsearch.index.reindex.ReindexPlugin;
import org.elasticsearch.indices.recovery.RecoveryState;
import org.elasticsearch.license.LicenseService;
import org.elasticsearch.persistent.PersistentTasksClusterService;
import org.elasticsearch.plugins.Plugin;
import org.elasticsearch.test.ESIntegTestCase;
import org.elasticsearch.test.discovery.TestZenDiscovery;
Expand Down Expand Up @@ -351,6 +352,17 @@ public static void deleteAllJobs(Logger logger, Client client) throws Exception
}

protected String awaitJobOpenedAndAssigned(String jobId, String queryNode) throws Exception {

PersistentTasksClusterService persistentTasksClusterService =
internalCluster().getInstance(PersistentTasksClusterService.class, internalCluster().getMasterName());
// Speed up rechecks to a rate that is quicker than what settings would allow.
// The check would work eventually without doing this, but the assertBusy() below
// would need to wait 30 seconds, which would make the test run very slowly.
// The 1 second refresh puts a greater burden on the master node to recheck
// persistent tasks, but it will cope in these tests as it's not doing much
// else.
persistentTasksClusterService.setRecheckInterval(TimeValue.timeValueSeconds(1));

AtomicReference<String> jobNode = new AtomicReference<>();
assertBusy(() -> {
GetJobsStatsAction.Response statsResponse =
Expand Down

0 comments on commit 1fad4e1

Please sign in to comment.