Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Failure of {p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled} #71646

Closed
droberts195 opened this issue Apr 13, 2021 · 5 comments · Fixed by #71656
Closed
Assignees
Labels
:ml Machine learning Team:ML Meta label for the ML team >test-failure Triaged test failures from CI

Comments

@droberts195
Copy link
Contributor

Build scan:

https://gradle-enterprise.elastic.co/s/7fst6frfszn2q

Repro line:

./gradlew ':x-pack:plugin:yamlRestTest' --tests "org.elasticsearch.xpack.test.rest.XPackRestIT.test {p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled}" -Dtests.seed=F8B7E3049DC90A4A -Dtests.security.manager=true -Dtests.locale=ar-IQ -Dtests.timezone=Africa/Accra -Druntime.java=11 -Dtests.rest.blacklist=getting_started/10_monitor_cluster_health/*

Reproduces locally?:

No

Applicable branches:

master

Failure history:

Just once for this particular failure - it's a rare side effect of #71552 which I merged today.

Failure excerpt:

  1> [2021-04-13T15:28:53,679][INFO ][o.e.x.t.r.XPackRestIT    ] [test] [p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled] before test
  1> [2021-04-13T15:29:55,671][WARN ][o.e.c.RestClient         ] [test] request [DELETE http://[::1]:40919/*,-.ds-ilm-history-*?expand_wildcards=open%2Cclosed%2Chidden] returned 1 warnings: [299 Elasticsearch-8.0.0-SNAPSHOT-46efa6ad04e2ab98044f9d99cca39f7f8ba36ffe "this request accesses system indices: [.ml-config], but in a future major version, direct access to system indices will be prevented by default"]
  1> [2021-04-13T15:29:55,867][INFO ][o.e.x.t.r.XPackRestIT    ] [test] There are still tasks running after this test that might break subsequent tests [cluster:admin/features/reset, cluster:admin/xpack/ml/job/close, xpack/ml/job[c]].
  1> [2021-04-13T15:29:55,867][INFO ][o.e.x.t.r.XPackRestIT    ] [test] [p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled] after test
  2> REPRODUCE WITH: ./gradlew ':x-pack:plugin:yamlRestTest' --tests "org.elasticsearch.xpack.test.rest.XPackRestIT.test {p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled}" -Dtests.seed=F8B7E3049DC90A4A -Dtests.security.manager=true -Dtests.locale=ar-IQ -Dtests.timezone=Africa/Accra -Druntime.java=11 -Dtests.rest.blacklist=getting_started/10_monitor_cluster_health/*
  2> java.net.SocketTimeoutException: 60,000 milliseconds timeout on connection http-outgoing-1 [ACTIVE]
        at __randomizedtesting.SeedInfo.seed([F8B7E3049DC90A4A:70E3DCDE333567B2]:0)
        at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:869)
        at org.elasticsearch.client.RestClient.performRequest(RestClient.java:283)
        at org.elasticsearch.client.RestClient.performRequest(RestClient.java:270)
        at org.elasticsearch.xpack.core.ml.integration.MlRestTestStateCleaner.clearMlMetadata(MlRestTestStateCleaner.java:36)
        at org.elasticsearch.xpack.test.rest.AbstractXPackRestTest.clearMlState(AbstractXPackRestTest.java:222)
        at org.elasticsearch.xpack.test.rest.AbstractXPackRestTest.cleanup(AbstractXPackRestTest.java:206)
        at jdk.internal.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:1004)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
        at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
        at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
        at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
        at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:824)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:475)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
        at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
        at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
        at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
        at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:831)
        at java.base/java.lang.Thread.run(Thread.java:834)

        Caused by:
        java.net.SocketTimeoutException: 60,000 milliseconds timeout on connection http-outgoing-1 [ACTIVE]
            at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387)
            at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92)
            at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39)
            at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175)
            at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:261)
            at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:502)
            at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:211)
            at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280)
            at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
            at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
            ... 1 more

Also, from the server-side logs:

[2021-04-13T15:28:54,816][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [yamlRestTest-0] [set-upgrade-mode-job] [autodetect/60062] [CResourceMonitor.cc@77] Setting model memory limit to 1024 MB
[2021-04-13T15:28:54,857][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Successfully set job state to [opened] for job [set-upgrade-mode-job]
[2021-04-13T15:28:55,028][INFO ][o.e.x.m.d.DatafeedJob    ] [yamlRestTest-0] [set-upgrade-mode-job] Datafeed started (from: 1970-01-01T00:00:00.000Z to: real-time) with frequency [600000ms]
[2021-04-13T15:28:55,041][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Starting to set [upgrade_mode] to [true] from [false]
[2021-04-13T15:28:55,041][TRACE][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Executing cluster state update
[2021-04-13T15:28:55,059][DEBUG][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Flushing job set-upgrade-mode-job
[2021-04-13T15:28:55,078][TRACE][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Cluster update response built: true
[2021-04-13T15:28:55,078][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Enabling upgrade mode, must isolate datafeeds
[2021-04-13T15:28:55,078][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Isolating datafeeds: [set-upgrade-mode-job-datafeed]
[2021-04-13T15:28:55,092][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Isolated the datafeeds
[2021-04-13T15:28:55,092][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Un-assigning persistent tasks : [ datafeed-set-upgrade-mode-job-datafeed, job-set-upgrade-mode-job ]
[2021-04-13T15:28:55,117][WARN ][o.e.x.m.MlAssignmentNotifier] [yamlRestTest-0] [set-upgrade-mode-job] No node found to start datafeed [set-upgrade-mode-job-datafeed]. Reasons [persistent task cannot be assigned while upgrade mode is enabled.]
[2021-04-13T15:28:55,123][INFO ][o.e.x.m.d.DatafeedManager] [yamlRestTest-0] [task has been removed, cancelling locally] attempt to stop datafeed [set-upgrade-mode-job-datafeed] [33]
[2021-04-13T15:28:55,124][INFO ][o.e.x.m.d.DatafeedManager] [yamlRestTest-0] [task has been removed, cancelling locally] attempt to stop datafeed [set-upgrade-mode-job-datafeed] for job [set-upgrade-mode-job]
[2021-04-13T15:28:55,124][INFO ][o.e.x.m.d.DatafeedManager] [yamlRestTest-0] [task has been removed, cancelling locally] try lock [0s] to stop datafeed [set-upgrade-mode-job-datafeed] for job [set-upgrade-mode-job]...
[2021-04-13T15:28:55,124][INFO ][o.e.x.m.d.DatafeedManager] [yamlRestTest-0] [task has been removed, cancelling locally] stopping datafeed [set-upgrade-mode-job-datafeed] for job [set-upgrade-mode-job], acquired [true]...
[2021-04-13T15:28:55,127][INFO ][o.e.x.m.d.DatafeedManager] [yamlRestTest-0] [task has been removed, cancelling locally] datafeed [set-upgrade-mode-job-datafeed] for job [set-upgrade-mode-job] has been stopped
[2021-04-13T15:28:55,177][TRACE][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] [set-upgrade-mode-job] Killing process: awaitCompletion = [false]; reason = [task has been removed, cancelling locally]
[2021-04-13T15:28:55,177][INFO ][o.e.x.m.j.p.a.ProcessContext] [yamlRestTest-0] Killing job [set-upgrade-mode-job], because [task has been removed, cancelling locally]
[2021-04-13T15:28:55,179][INFO ][o.e.x.m.j.p.a.o.AutodetectResultProcessor] [yamlRestTest-0] [set-upgrade-mode-job] 0 buckets parsed from autodetect output
[2021-04-13T15:28:55,191][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [yamlRestTest-0] [controller/21903] [CDetachedProcessSpawner.cc@176] Child process with PID 60062 was terminated by signal 15
[2021-04-13T15:28:55,281][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Waited for tasks to be unassigned
[2021-04-13T15:28:55,281][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Completed upgrade mode request
[2021-04-13T15:28:55,337][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Starting to set [upgrade_mode] to [false] from [true]
[2021-04-13T15:28:55,338][TRACE][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Executing cluster state update
[2021-04-13T15:28:55,364][TRACE][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Cluster update response built: true
[2021-04-13T15:28:55,364][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Disabling upgrade mode, must wait for tasks to not have AWAITING_UPGRADE assignment
[2021-04-13T15:28:55,385][WARN ][o.e.x.m.MlAssignmentNotifier] [yamlRestTest-0] [set-upgrade-mode-job] No node found to start datafeed [set-upgrade-mode-job-datafeed]. Reasons [cannot start datafeed [set-upgrade-mode-job-datafeed], because the job's [set-upgrade-mode-job] state is stale]
[2021-04-13T15:28:55,390][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Done waiting for tasks to be out of AWAITING_UPGRADE
[2021-04-13T15:28:55,390][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Completed upgrade mode request
[2021-04-13T15:28:55,395][INFO ][o.e.x.m.j.t.OpenJobPersistentTasksExecutor] [yamlRestTest-0] [set-upgrade-mode-job] job has running datafeed task; reverting to current snapshot
[2021-04-13T15:28:55,402][INFO ][o.e.x.m.a.TransportRevertModelSnapshotAction] [yamlRestTest-0] Reverting to snapshot 'empty'
[2021-04-13T15:28:55,443][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Upgrade mode noop
[2021-04-13T15:28:55,468][INFO ][o.e.c.m.MetadataMappingService] [yamlRestTest-0] [.ml-anomalies-shared/rgowJq8gTg-4EAMu7UXFjQ] update_mapping [_doc]
[2021-04-13T15:28:55,504][INFO ][o.e.x.m.MachineLearning  ] [yamlRestTest-0] Starting machine learning feature reset
...
[2021-04-13T15:28:55,643][INFO ][o.e.x.m.a.TransportRevertModelSnapshotAction] [yamlRestTest-0] [set-upgrade-mode-job] Removing intervening annotations after reverting model: deleting annotations after [1970-01-01T00:00:00.000+0000]
[2021-04-13T15:28:55,646][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Opening job [set-upgrade-mode-job]
[2021-04-13T15:28:55,660][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] [set-upgrade-mode-job] Loading model snapshot [N/A], job latest_record_timestamp [N/A]
[2021-04-13T15:28:55,661][WARN ][o.e.x.m.MlAssignmentNotifier] [yamlRestTest-0] [set-upgrade-mode-job] No node found to start datafeed [set-upgrade-mode-job-datafeed]. Reasons [persistent task will not be assigned as a feature reset is in progress.]
[2021-04-13T15:28:55,694][INFO ][o.e.x.m.a.TransportKillProcessAction] [yamlRestTest-0] [set-upgrade-mode-job] Killing job
[2021-04-13T15:28:55,695][TRACE][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] [set-upgrade-mode-job] Killing process: awaitCompletion = [true]; reason = [null]
[2021-04-13T15:28:55,748][DEBUG][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Attempting to close job [set-upgrade-mode-job], because [close job (api)]
[2021-04-13T15:28:55,749][DEBUG][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Cannot close job [set-upgrade-mode-job] as it has already been closed or is closing
[2021-04-13T15:28:56,357][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [yamlRestTest-0] [set-upgrade-mode-job] [autodetect/60227] [CResourceMonitor.cc@77] Setting model memory limit to 1024 MB
[2021-04-13T15:28:56,375][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Successfully set job state to [opened] for job [set-upgrade-mode-job]
[2021-04-13T15:29:55,608][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-annotations-6/qLJ6HGeBTNitPnpB6IBhzw] deleting index
[2021-04-13T15:29:55,608][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-anomalies-shared/rgowJq8gTg-4EAMu7UXFjQ] deleting index
[2021-04-13T15:29:55,608][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-state-000001/FYdpPMdyQwCZi8gdTF1pcQ] deleting index
[2021-04-13T15:29:55,608][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-notifications-000001/5SIcjCfjSJOJUjQI9lXOqQ] deleting index
[2021-04-13T15:29:55,608][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-config/M-RwVbuvQH6CblvPnSlU1A] deleting index
[2021-04-13T15:29:55,608][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [airline-data/rCAN-r-ZQkukuOJuRPatUA] deleting index

When we close immediately after killing the process it seems that we might report the close as successful while the persistent task still exists.

@droberts195 droberts195 added >test-failure Triaged test failures from CI :ml Machine learning labels Apr 13, 2021
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Apr 13, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@benwtrent
Copy link
Member

I see the race condition 🤦

take a look:

[2021-04-13T15:28:55,390][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Done waiting for tasks to be out of AWAITING_UPGRADE
[2021-04-13T15:28:55,390][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Completed upgrade mode request
[2021-04-13T15:28:55,395][INFO ][o.e.x.m.j.t.OpenJobPersistentTasksExecutor] [yamlRestTest-0] [set-upgrade-mode-job] job has running datafeed task; reverting to current snapshot
[2021-04-13T15:28:55,402][INFO ][o.e.x.m.a.TransportRevertModelSnapshotAction] [yamlRestTest-0] Reverting to snapshot 'empty'
[2021-04-13T15:28:55,443][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Upgrade mode noop
[2021-04-13T15:28:55,468][INFO ][o.e.c.m.MetadataMappingService] [yamlRestTest-0] [.ml-anomalies-shared/rgowJq8gTg-4EAMu7UXFjQ] update_mapping [_doc]
[2021-04-13T15:28:55,504][INFO ][o.e.x.m.MachineLearning  ] [yamlRestTest-0] Starting machine learning feature reset
[2021-04-13T15:28:55,636][INFO ][o.e.x.m.a.TransportRevertModelSnapshotAction] [yamlRestTest-0] [set-upgrade-mode-job] Removing intervening records after reverting model: deleting results after [1970-01-01T00:00:00.000+0000]
[2021-04-13T15:28:55,635][WARN ][o.e.x.t.a.TransportStopTransformAction] [yamlRestTest-0] Could not refresh state, state information might be outdated
<snip>...</snip>
[2021-04-13T15:28:55,643][INFO ][o.e.x.m.a.TransportRevertModelSnapshotAction] [yamlRestTest-0] [set-upgrade-mode-job] Removing intervening annotations after reverting model: deleting annotations after [1970-01-01T00:00:00.000+0000]
[2021-04-13T15:28:55,646][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Opening job [set-upgrade-mode-job]
[2021-04-13T15:28:55,660][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] [set-upgrade-mode-job] Loading model snapshot [N/A], job latest_record_timestamp [N/A]
[2021-04-13T15:28:55,661][WARN ][o.e.x.m.MlAssignmentNotifier] [yamlRestTest-0] [set-upgrade-mode-job] No node found to start datafeed [set-upgrade-mode-job-datafeed]. Reasons [persistent task will not be assigned as a feature reset is in progress.]
[2021-04-13T15:28:55,694][INFO ][o.e.x.m.a.TransportKillProcessAction] [yamlRestTest-0] [set-upgrade-mode-job] Killing job
[2021-04-13T15:28:55,695][TRACE][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] [set-upgrade-mode-job] Killing process: awaitCompletion = [true]; reason = [null]
[2021-04-13T15:28:55,748][DEBUG][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Attempting to close job [set-upgrade-mode-job], because [close job (api)]
[2021-04-13T15:28:55,749][DEBUG][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Cannot close job [set-upgrade-mode-job] as it has already been closed or is closing
[2021-04-13T15:28:56,357][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [yamlRestTest-0] [set-upgrade-mode-job] [autodetect/60227] [CResourceMonitor.cc@77] Setting model memory limit to 1024 MB
[2021-04-13T15:28:56,375][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Successfully set job state to [opened] for job [set-upgrade-mode-job]
[2021-04-13T15:29:55,608][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-annotations-6/qLJ6HGeBTNitPnpB6IBhzw] deleting index

We attempt to kill the job while its awaiting to get assigned, THEN we call close as as it has already been closed or is closing

But still, see line

[2021-04-13T15:28:56,375][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Successfully set job state to [opened] for job [set-upgrade-mode-job]

The task STILL gets assigned to open 🤦. So, if the task is just recently assigned (reverting the snapshot), and reset mode is enabled, it ignores the stop/kill requests and finishes opening...

@droberts195
Copy link
Contributor Author

are these all related?

When ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled fails during cleanup it sometimes makes the next few tests fail too because the task that spills over into the next test is still running during that test’s cleanup. If that next test is an ML test it will likely clean up successfully. If it’s not an ML test it won’t, because it won’t be trying to clean up ML jobs.

The characteristics to look for to say whether a test failure is collateral damage of ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled are:

  1. The test failed shortly after ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled
  2. The failure reason was 3 leftover tasks including feature reset and an ML job, e.g.:
java.lang.AssertionError: 3 active tasks found:
xpack/ml/job[c]                  ocDmIN43QwizJeOU63iCKA:113985 cluster:158                   persistent 1618341176133 19:12:56 2.1m        127.0.0.1 yamlRestTest-0 job-set-upgrade-mode-job
cluster:admin/features/reset     ocDmIN43QwizJeOU63iCKA:114037 -                             transport  1618341176156 19:12:56 2.1m        127.0.0.1 yamlRestTest-0
cluster:admin/xpack/ml/job/close ocDmIN43QwizJeOU63iCKA:114110 -                             transport  1618341176288 19:12:56 2.1m        127.0.0.1 yamlRestTest-0
expected:<0> but was:<3>

@probakowski
Copy link
Contributor

Are those the same failure? There's a lot of similar failures on windows
https://gradle-enterprise.elastic.co/s/akzs7pnfk7auk
https://gradle-enterprise.elastic.co/s/2rkbl6osae4d2

benwtrent added a commit that referenced this issue Apr 14, 2021
If a machine learning job is killed while it is attempting to open, there is a race condition that may cause it to not close.

This is most evident during the `reset_feature` API call. The reset feature API will kill the jobs, then call close quickly to wait for the persistent tasks to complete. 

But, if this is called while a job is attempting to be assigned to a node, there is a window where the process continues to start even though we attempted to kill and close it.

This commit locks the process context on `kill`, and sets the job to `closing`. This way if the process context is already locked (to start), we won't try to kill it until it is fully started.

Setting the job to `closing` allows the starting process to exit early if the `kill` command has already been completed (before the communicator was created).

closes #71646
benwtrent added a commit to benwtrent/elasticsearch that referenced this issue Apr 15, 2021
If a machine learning job is killed while it is attempting to open, there is a race condition that may cause it to not close.

This is most evident during the `reset_feature` API call. The reset feature API will kill the jobs, then call close quickly to wait for the persistent tasks to complete. 

But, if this is called while a job is attempting to be assigned to a node, there is a window where the process continues to start even though we attempted to kill and close it.

This commit locks the process context on `kill`, and sets the job to `closing`. This way if the process context is already locked (to start), we won't try to kill it until it is fully started.

Setting the job to `closing` allows the starting process to exit early if the `kill` command has already been completed (before the communicator was created).

closes elastic#71646
benwtrent added a commit that referenced this issue Apr 15, 2021
)

If a machine learning job is killed while it is attempting to open, there is a race condition that may cause it to not close.

This is most evident during the `reset_feature` API call. The reset feature API will kill the jobs, then call close quickly to wait for the persistent tasks to complete. 

But, if this is called while a job is attempting to be assigned to a node, there is a window where the process continues to start even though we attempted to kill and close it.

This commit locks the process context on `kill`, and sets the job to `closing`. This way if the process context is already locked (to start), we won't try to kill it until it is fully started.

Setting the job to `closing` allows the starting process to exit early if the `kill` command has already been completed (before the communicator was created).

closes #71646
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning Team:ML Meta label for the ML team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants