[CI] Failure of {p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled} #71646

droberts195 · 2021-04-13T17:07:16Z

Build scan:

https://gradle-enterprise.elastic.co/s/7fst6frfszn2q

Repro line:

./gradlew ':x-pack:plugin:yamlRestTest' --tests "org.elasticsearch.xpack.test.rest.XPackRestIT.test {p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled}" -Dtests.seed=F8B7E3049DC90A4A -Dtests.security.manager=true -Dtests.locale=ar-IQ -Dtests.timezone=Africa/Accra -Druntime.java=11 -Dtests.rest.blacklist=getting_started/10_monitor_cluster_health/*

Reproduces locally?:

No

Applicable branches:

master

Failure history:

Just once for this particular failure - it's a rare side effect of #71552 which I merged today.

Failure excerpt:

  1> [2021-04-13T15:28:53,679][INFO ][o.e.x.t.r.XPackRestIT    ] [test] [p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled] before test
  1> [2021-04-13T15:29:55,671][WARN ][o.e.c.RestClient         ] [test] request [DELETE http://[::1]:40919/*,-.ds-ilm-history-*?expand_wildcards=open%2Cclosed%2Chidden] returned 1 warnings: [299 Elasticsearch-8.0.0-SNAPSHOT-46efa6ad04e2ab98044f9d99cca39f7f8ba36ffe "this request accesses system indices: [.ml-config], but in a future major version, direct access to system indices will be prevented by default"]
  1> [2021-04-13T15:29:55,867][INFO ][o.e.x.t.r.XPackRestIT    ] [test] There are still tasks running after this test that might break subsequent tests [cluster:admin/features/reset, cluster:admin/xpack/ml/job/close, xpack/ml/job[c]].
  1> [2021-04-13T15:29:55,867][INFO ][o.e.x.t.r.XPackRestIT    ] [test] [p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled] after test
  2> REPRODUCE WITH: ./gradlew ':x-pack:plugin:yamlRestTest' --tests "org.elasticsearch.xpack.test.rest.XPackRestIT.test {p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled}" -Dtests.seed=F8B7E3049DC90A4A -Dtests.security.manager=true -Dtests.locale=ar-IQ -Dtests.timezone=Africa/Accra -Druntime.java=11 -Dtests.rest.blacklist=getting_started/10_monitor_cluster_health/*
  2> java.net.SocketTimeoutException: 60,000 milliseconds timeout on connection http-outgoing-1 [ACTIVE]
        at __randomizedtesting.SeedInfo.seed([F8B7E3049DC90A4A:70E3DCDE333567B2]:0)
        at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:869)
        at org.elasticsearch.client.RestClient.performRequest(RestClient.java:283)
        at org.elasticsearch.client.RestClient.performRequest(RestClient.java:270)
        at org.elasticsearch.xpack.core.ml.integration.MlRestTestStateCleaner.clearMlMetadata(MlRestTestStateCleaner.java:36)
        at org.elasticsearch.xpack.test.rest.AbstractXPackRestTest.clearMlState(AbstractXPackRestTest.java:222)
        at org.elasticsearch.xpack.test.rest.AbstractXPackRestTest.cleanup(AbstractXPackRestTest.java:206)
        at jdk.internal.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:1004)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
        at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
        at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
        at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
        at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:824)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:475)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
        at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
        at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
        at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
        at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:831)
        at java.base/java.lang.Thread.run(Thread.java:834)

        Caused by:
        java.net.SocketTimeoutException: 60,000 milliseconds timeout on connection http-outgoing-1 [ACTIVE]
            at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387)
            at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92)
            at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39)
            at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175)
            at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:261)
            at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:502)
            at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:211)
            at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280)
            at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
            at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
            ... 1 more

Also, from the server-side logs:

[2021-04-13T15:28:54,816][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [yamlRestTest-0] [set-upgrade-mode-job] [autodetect/60062] [CResourceMonitor.cc@77] Setting model memory limit to 1024 MB
[2021-04-13T15:28:54,857][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Successfully set job state to [opened] for job [set-upgrade-mode-job]
[2021-04-13T15:28:55,028][INFO ][o.e.x.m.d.DatafeedJob    ] [yamlRestTest-0] [set-upgrade-mode-job] Datafeed started (from: 1970-01-01T00:00:00.000Z to: real-time) with frequency [600000ms]
[2021-04-13T15:28:55,041][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Starting to set [upgrade_mode] to [true] from [false]
[2021-04-13T15:28:55,041][TRACE][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Executing cluster state update
[2021-04-13T15:28:55,059][DEBUG][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Flushing job set-upgrade-mode-job
[2021-04-13T15:28:55,078][TRACE][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Cluster update response built: true
[2021-04-13T15:28:55,078][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Enabling upgrade mode, must isolate datafeeds
[2021-04-13T15:28:55,078][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Isolating datafeeds: [set-upgrade-mode-job-datafeed]
[2021-04-13T15:28:55,092][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Isolated the datafeeds
[2021-04-13T15:28:55,092][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Un-assigning persistent tasks : [ datafeed-set-upgrade-mode-job-datafeed, job-set-upgrade-mode-job ]
[2021-04-13T15:28:55,117][WARN ][o.e.x.m.MlAssignmentNotifier] [yamlRestTest-0] [set-upgrade-mode-job] No node found to start datafeed [set-upgrade-mode-job-datafeed]. Reasons [persistent task cannot be assigned while upgrade mode is enabled.]
[2021-04-13T15:28:55,123][INFO ][o.e.x.m.d.DatafeedManager] [yamlRestTest-0] [task has been removed, cancelling locally] attempt to stop datafeed [set-upgrade-mode-job-datafeed] [33]
[2021-04-13T15:28:55,124][INFO ][o.e.x.m.d.DatafeedManager] [yamlRestTest-0] [task has been removed, cancelling locally] attempt to stop datafeed [set-upgrade-mode-job-datafeed] for job [set-upgrade-mode-job]
[2021-04-13T15:28:55,124][INFO ][o.e.x.m.d.DatafeedManager] [yamlRestTest-0] [task has been removed, cancelling locally] try lock [0s] to stop datafeed [set-upgrade-mode-job-datafeed] for job [set-upgrade-mode-job]...
[2021-04-13T15:28:55,124][INFO ][o.e.x.m.d.DatafeedManager] [yamlRestTest-0] [task has been removed, cancelling locally] stopping datafeed [set-upgrade-mode-job-datafeed] for job [set-upgrade-mode-job], acquired [true]...
[2021-04-13T15:28:55,127][INFO ][o.e.x.m.d.DatafeedManager] [yamlRestTest-0] [task has been removed, cancelling locally] datafeed [set-upgrade-mode-job-datafeed] for job [set-upgrade-mode-job] has been stopped
[2021-04-13T15:28:55,177][TRACE][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] [set-upgrade-mode-job] Killing process: awaitCompletion = [false]; reason = [task has been removed, cancelling locally]
[2021-04-13T15:28:55,177][INFO ][o.e.x.m.j.p.a.ProcessContext] [yamlRestTest-0] Killing job [set-upgrade-mode-job], because [task has been removed, cancelling locally]
[2021-04-13T15:28:55,179][INFO ][o.e.x.m.j.p.a.o.AutodetectResultProcessor] [yamlRestTest-0] [set-upgrade-mode-job] 0 buckets parsed from autodetect output
[2021-04-13T15:28:55,191][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [yamlRestTest-0] [controller/21903] [CDetachedProcessSpawner.cc@176] Child process with PID 60062 was terminated by signal 15
[2021-04-13T15:28:55,281][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Waited for tasks to be unassigned
[2021-04-13T15:28:55,281][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Completed upgrade mode request
[2021-04-13T15:28:55,337][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Starting to set [upgrade_mode] to [false] from [true]
[2021-04-13T15:28:55,338][TRACE][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Executing cluster state update
[2021-04-13T15:28:55,364][TRACE][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Cluster update response built: true
[2021-04-13T15:28:55,364][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Disabling upgrade mode, must wait for tasks to not have AWAITING_UPGRADE assignment
[2021-04-13T15:28:55,385][WARN ][o.e.x.m.MlAssignmentNotifier] [yamlRestTest-0] [set-upgrade-mode-job] No node found to start datafeed [set-upgrade-mode-job-datafeed]. Reasons [cannot start datafeed [set-upgrade-mode-job-datafeed], because the job's [set-upgrade-mode-job] state is stale]
[2021-04-13T15:28:55,390][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Done waiting for tasks to be out of AWAITING_UPGRADE
[2021-04-13T15:28:55,390][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Completed upgrade mode request
[2021-04-13T15:28:55,395][INFO ][o.e.x.m.j.t.OpenJobPersistentTasksExecutor] [yamlRestTest-0] [set-upgrade-mode-job] job has running datafeed task; reverting to current snapshot
[2021-04-13T15:28:55,402][INFO ][o.e.x.m.a.TransportRevertModelSnapshotAction] [yamlRestTest-0] Reverting to snapshot 'empty'
[2021-04-13T15:28:55,443][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Upgrade mode noop
[2021-04-13T15:28:55,468][INFO ][o.e.c.m.MetadataMappingService] [yamlRestTest-0] [.ml-anomalies-shared/rgowJq8gTg-4EAMu7UXFjQ] update_mapping [_doc]
[2021-04-13T15:28:55,504][INFO ][o.e.x.m.MachineLearning  ] [yamlRestTest-0] Starting machine learning feature reset
...
[2021-04-13T15:28:55,643][INFO ][o.e.x.m.a.TransportRevertModelSnapshotAction] [yamlRestTest-0] [set-upgrade-mode-job] Removing intervening annotations after reverting model: deleting annotations after [1970-01-01T00:00:00.000+0000]
[2021-04-13T15:28:55,646][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Opening job [set-upgrade-mode-job]
[2021-04-13T15:28:55,660][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] [set-upgrade-mode-job] Loading model snapshot [N/A], job latest_record_timestamp [N/A]
[2021-04-13T15:28:55,661][WARN ][o.e.x.m.MlAssignmentNotifier] [yamlRestTest-0] [set-upgrade-mode-job] No node found to start datafeed [set-upgrade-mode-job-datafeed]. Reasons [persistent task will not be assigned as a feature reset is in progress.]
[2021-04-13T15:28:55,694][INFO ][o.e.x.m.a.TransportKillProcessAction] [yamlRestTest-0] [set-upgrade-mode-job] Killing job
[2021-04-13T15:28:55,695][TRACE][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] [set-upgrade-mode-job] Killing process: awaitCompletion = [true]; reason = [null]
[2021-04-13T15:28:55,748][DEBUG][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Attempting to close job [set-upgrade-mode-job], because [close job (api)]
[2021-04-13T15:28:55,749][DEBUG][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Cannot close job [set-upgrade-mode-job] as it has already been closed or is closing
[2021-04-13T15:28:56,357][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [yamlRestTest-0] [set-upgrade-mode-job] [autodetect/60227] [CResourceMonitor.cc@77] Setting model memory limit to 1024 MB
[2021-04-13T15:28:56,375][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Successfully set job state to [opened] for job [set-upgrade-mode-job]
[2021-04-13T15:29:55,608][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-annotations-6/qLJ6HGeBTNitPnpB6IBhzw] deleting index
[2021-04-13T15:29:55,608][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-anomalies-shared/rgowJq8gTg-4EAMu7UXFjQ] deleting index
[2021-04-13T15:29:55,608][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-state-000001/FYdpPMdyQwCZi8gdTF1pcQ] deleting index
[2021-04-13T15:29:55,608][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-notifications-000001/5SIcjCfjSJOJUjQI9lXOqQ] deleting index
[2021-04-13T15:29:55,608][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-config/M-RwVbuvQH6CblvPnSlU1A] deleting index
[2021-04-13T15:29:55,608][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [airline-data/rCAN-r-ZQkukuOJuRPatUA] deleting index

When we close immediately after killing the process it seems that we might report the close as successful while the persistent task still exists.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-04-13T17:07:19Z

Pinging @elastic/ml-core (Team:ML)

benwtrent · 2021-04-13T19:44:00Z

I see the race condition 🤦

take a look:

[2021-04-13T15:28:55,390][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Done waiting for tasks to be out of AWAITING_UPGRADE
[2021-04-13T15:28:55,390][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Completed upgrade mode request
[2021-04-13T15:28:55,395][INFO ][o.e.x.m.j.t.OpenJobPersistentTasksExecutor] [yamlRestTest-0] [set-upgrade-mode-job] job has running datafeed task; reverting to current snapshot
[2021-04-13T15:28:55,402][INFO ][o.e.x.m.a.TransportRevertModelSnapshotAction] [yamlRestTest-0] Reverting to snapshot 'empty'
[2021-04-13T15:28:55,443][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Upgrade mode noop
[2021-04-13T15:28:55,468][INFO ][o.e.c.m.MetadataMappingService] [yamlRestTest-0] [.ml-anomalies-shared/rgowJq8gTg-4EAMu7UXFjQ] update_mapping [_doc]
[2021-04-13T15:28:55,504][INFO ][o.e.x.m.MachineLearning  ] [yamlRestTest-0] Starting machine learning feature reset
[2021-04-13T15:28:55,636][INFO ][o.e.x.m.a.TransportRevertModelSnapshotAction] [yamlRestTest-0] [set-upgrade-mode-job] Removing intervening records after reverting model: deleting results after [1970-01-01T00:00:00.000+0000]
[2021-04-13T15:28:55,635][WARN ][o.e.x.t.a.TransportStopTransformAction] [yamlRestTest-0] Could not refresh state, state information might be outdated
<snip>...</snip>
[2021-04-13T15:28:55,643][INFO ][o.e.x.m.a.TransportRevertModelSnapshotAction] [yamlRestTest-0] [set-upgrade-mode-job] Removing intervening annotations after reverting model: deleting annotations after [1970-01-01T00:00:00.000+0000]
[2021-04-13T15:28:55,646][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Opening job [set-upgrade-mode-job]
[2021-04-13T15:28:55,660][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] [set-upgrade-mode-job] Loading model snapshot [N/A], job latest_record_timestamp [N/A]
[2021-04-13T15:28:55,661][WARN ][o.e.x.m.MlAssignmentNotifier] [yamlRestTest-0] [set-upgrade-mode-job] No node found to start datafeed [set-upgrade-mode-job-datafeed]. Reasons [persistent task will not be assigned as a feature reset is in progress.]
[2021-04-13T15:28:55,694][INFO ][o.e.x.m.a.TransportKillProcessAction] [yamlRestTest-0] [set-upgrade-mode-job] Killing job
[2021-04-13T15:28:55,695][TRACE][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] [set-upgrade-mode-job] Killing process: awaitCompletion = [true]; reason = [null]
[2021-04-13T15:28:55,748][DEBUG][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Attempting to close job [set-upgrade-mode-job], because [close job (api)]
[2021-04-13T15:28:55,749][DEBUG][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Cannot close job [set-upgrade-mode-job] as it has already been closed or is closing
[2021-04-13T15:28:56,357][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [yamlRestTest-0] [set-upgrade-mode-job] [autodetect/60227] [CResourceMonitor.cc@77] Setting model memory limit to 1024 MB
[2021-04-13T15:28:56,375][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Successfully set job state to [opened] for job [set-upgrade-mode-job]
[2021-04-13T15:29:55,608][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-annotations-6/qLJ6HGeBTNitPnpB6IBhzw] deleting index

We attempt to kill the job while its awaiting to get assigned, THEN we call close as as it has already been closed or is closing

But still, see line

[2021-04-13T15:28:56,375][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Successfully set job state to [opened] for job [set-upgrade-mode-job]

The task STILL gets assigned to open 🤦. So, if the task is just recently assigned (reverting the snapshot), and reset mode is enabled, it ignores the stop/kill requests and finishes opening...

mark-vieira · 2021-04-14T00:58:14Z

I'm seeing loads of other ML test failures as well, are these all related?

https://gradle-enterprise.elastic.co/s/a5akqexeg6htm/tests/:x-pack:plugin:yamlRestTest/org.elasticsearch.xpack.test.rest.XPackRestIT/test%20%7Bp0=wildcard%2F10_wildcard_basic%2FNo%20wildcard%20wildcard%20query%7D#1

https://gradle-enterprise.elastic.co/s/a5akqexeg6htm/tests/:x-pack:plugin:yamlRestTest/org.elasticsearch.xpack.test.rest.XPackRestIT/test%20%7Bp0=rollup%2Fstop_job%2FTest%20stop%20nonexistent%20job%7D#1

droberts195 · 2021-04-14T06:30:23Z

are these all related?

When ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled fails during cleanup it sometimes makes the next few tests fail too because the task that spills over into the next test is still running during that test’s cleanup. If that next test is an ML test it will likely clean up successfully. If it’s not an ML test it won’t, because it won’t be trying to clean up ML jobs.

The characteristics to look for to say whether a test failure is collateral damage of ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled are:

The test failed shortly after ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled
The failure reason was 3 leftover tasks including feature reset and an ML job, e.g.:

java.lang.AssertionError: 3 active tasks found:
xpack/ml/job[c]                  ocDmIN43QwizJeOU63iCKA:113985 cluster:158                   persistent 1618341176133 19:12:56 2.1m        127.0.0.1 yamlRestTest-0 job-set-upgrade-mode-job
cluster:admin/features/reset     ocDmIN43QwizJeOU63iCKA:114037 -                             transport  1618341176156 19:12:56 2.1m        127.0.0.1 yamlRestTest-0
cluster:admin/xpack/ml/job/close ocDmIN43QwizJeOU63iCKA:114110 -                             transport  1618341176288 19:12:56 2.1m        127.0.0.1 yamlRestTest-0
expected:<0> but was:<3>

probakowski · 2021-04-14T10:24:05Z

Are those the same failure? There's a lot of similar failures on windows
https://gradle-enterprise.elastic.co/s/akzs7pnfk7auk
https://gradle-enterprise.elastic.co/s/2rkbl6osae4d2

If a machine learning job is killed while it is attempting to open, there is a race condition that may cause it to not close. This is most evident during the `reset_feature` API call. The reset feature API will kill the jobs, then call close quickly to wait for the persistent tasks to complete. But, if this is called while a job is attempting to be assigned to a node, there is a window where the process continues to start even though we attempted to kill and close it. This commit locks the process context on `kill`, and sets the job to `closing`. This way if the process context is already locked (to start), we won't try to kill it until it is fully started. Setting the job to `closing` allows the starting process to exit early if the `kill` command has already been completed (before the communicator was created). closes #71646

If a machine learning job is killed while it is attempting to open, there is a race condition that may cause it to not close. This is most evident during the `reset_feature` API call. The reset feature API will kill the jobs, then call close quickly to wait for the persistent tasks to complete. But, if this is called while a job is attempting to be assigned to a node, there is a window where the process continues to start even though we attempted to kill and close it. This commit locks the process context on `kill`, and sets the job to `closing`. This way if the process context is already locked (to start), we won't try to kill it until it is fully started. Setting the job to `closing` allows the starting process to exit early if the `kill` command has already been completed (before the communicator was created). closes elastic#71646

) If a machine learning job is killed while it is attempting to open, there is a race condition that may cause it to not close. This is most evident during the `reset_feature` API call. The reset feature API will kill the jobs, then call close quickly to wait for the persistent tasks to complete. But, if this is called while a job is attempting to be assigned to a node, there is a window where the process continues to start even though we attempted to kill and close it. This commit locks the process context on `kill`, and sets the job to `closing`. This way if the process context is already locked (to start), we won't try to kill it until it is fully started. Setting the job to `closing` allows the starting process to exit early if the `kill` command has already been completed (before the communicator was created). closes #71646

droberts195 added >test-failure Triaged test failures from CI :ml Machine learning labels Apr 13, 2021

elasticmachine added the Team:ML Meta label for the ML team label Apr 13, 2021

droberts195 mentioned this issue Apr 13, 2021

[ML] More debugging to confirm the cause of #71646 #71647

Closed

benwtrent self-assigned this Apr 13, 2021

benwtrent mentioned this issue Apr 13, 2021

[ML] fix machine learning job close/kill race condition #71656

Merged

benwtrent mentioned this issue Apr 14, 2021

[ML] muting upgrade mode tests due to cleanup failure #71684

Merged

benwtrent closed this as completed in #71656 Apr 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Failure of {p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled} #71646

[CI] Failure of {p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled} #71646

droberts195 commented Apr 13, 2021

elasticmachine commented Apr 13, 2021

benwtrent commented Apr 13, 2021

mark-vieira commented Apr 14, 2021 •

edited

Loading

droberts195 commented Apr 14, 2021

probakowski commented Apr 14, 2021

[CI] Failure of {p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled} #71646

[CI] Failure of {p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled} #71646

Comments

droberts195 commented Apr 13, 2021

elasticmachine commented Apr 13, 2021

benwtrent commented Apr 13, 2021

mark-vieira commented Apr 14, 2021 • edited Loading

droberts195 commented Apr 14, 2021

probakowski commented Apr 14, 2021

mark-vieira commented Apr 14, 2021 •

edited

Loading