ML: Adds set_upgrade_mode API endpoint #37837

benwtrent · 2019-01-24T23:12:40Z

This adds the ability for the cluster to enter "upgrade mode" for ML jobs and datafeeds.

This entails the following:

Setting a field in the cluster state (for the ml_metadata) that indicates if upgrade_mode is enabled/disabled according to its boolean value
Isolating Datafeeds so that they stop pushing data to the ML jobs
Unassigning Datafeed and Job persistent tasks so that they stop executing and can later be restarted
- This is done with the tasks staying cluster state so that they can be reassigned to an appropriate node when upgrade_mode: false

The API is synchronous and only allows one caller to hit it at a time. When the API is returned, the guaruntees are:

If upgrade_mode: true that all tasks are unassigned and stopped. Meaning that all .ml* indices are no longer being written to by any internal processes
If upgrade_mode: false that all tasks have been re-assigned to appropriate nodes and are executing again.

Note: When jobs are restarted, they restart from some time in the past and load their previous state (if available) from a snapshot.

Still need to add docs, the rest should be good to review

elasticmachine · 2019-01-24T23:12:41Z

Pinging @elastic/ml-core

droberts195

Overall this looks very good, but I don't think it's acceptable to wait for every ML persistent task to be assigned to a node when switching out of upgrade mode. There may have been reasons why some tasks couldn't be assigned to nodes irrespective of upgrade mode. So we probably need to look more closely at what was going wrong when trying to enable upgrade mode in the presence of ML tasks not assigned to nodes and fix those problems instead.

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/XPackClientPlugin.java

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/MlTasks.java

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportOpenJobAction.java

droberts195 · 2019-01-25T11:39:03Z

...plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportSetUpgradeModeAction.java

+              If we are enabling the option, we need to isolate the datafeeds so we can unassign the ML Jobs
+          </.1>
+          <.2>
+              If we are disabling the option, we need to wait to make sure all the jobs get reallocated to an appropriate node


I don't think "get reallocated to an appropriate node" is the right condition here. The condition we wait for should be that they have an assignment other than AWAITING_UPGRADE, i.e. we've rechecked their allocations at least once after changing the ML cluster state flag.

We cannot guarantee that every job will be assigned to a node before returning, as there could be a reason other than upgrade mode that is stopping a job being assigned to a node.

@droberts195 yeah, I have been thinking about this. The conditions on return have been a constant internal debate.

Waiting for an getAssignmentExplanation().equals(AWAITING_UPGRADE.explanation) == false could work.

@droberts195 that simple check works for the job tasks. However, for the datafeed tasks, I had to make an additional check to verify that the job assignmentId has converged.

This SHOULD happen because when the new assignment and the old one are equivalent, that is a no-op in the reassignment logic

...plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportSetUpgradeModeAction.java

droberts195 · 2019-01-25T17:47:53Z

...plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportSetUpgradeModeAction.java

+                            persistentTasksCustomMetaData.findTasks(DATAFEED_TASK_NAME,
+                                (t) ->
+                                    t.getAssignment().equals(AWAITING_UPGRADE) ||
+                                    t.getAssignment().getExplanation().contains("state is stale"))


I think this "state is stale" condition is a sign that there will be situations where the endpoint could fail in production use. A datafeed could be unassigned and have "state is stale" in its assignment reason if a node died shortly before set_upgrade_mode?enabled=true was called. Presumably then this situation would have the same "issues" that calling set_upgrade_mode?enabled=false immediately followed by set_upgrade_mode?enabled=true has.

If this is a really hard problem to solve then we could merge this PR as-is so that the Kibana team have something to test against, but keep working on a followup to do whatever is required to remove this "state is stale" test.

@droberts195 If the node died, the job would attempt to re-assign right?

Could initiall fail due to resource constraints (assignmentId increments as Assignment reason changes)

Then attempts again during upgrade (assignmentId increments as explanation reason is now due to the upgrade)

Upgrade finishes, tries again and can either join a node or fail due to resource constraints, either way it should stabilize to a specific assignment?

I am not sure who all updates that internal task state so that it would eventually come in sync with what the datafeed sees.

benwtrent · 2019-01-25T19:32:04Z

run elasticsearch-ci/oss-distro-docs tests

droberts195

LGTM

I think this is good enough to give the Kibana team something to start testing with.

I'll look into the "state is stale" condition myself before 6.7 feature freeze.

* ML: Add MlMetadata.upgrade_mode and API * Adding tests * Adding wait conditionals for the upgrade_mode call to return * Adding tests * adjusting format and tests * Adjusting wait conditions for api return and msgs * adjusting doc tests * adding upgrade mode tests to black list

benwtrent added 5 commits January 23, 2019 12:07

ML: Add MlMetadata.upgrade_mode and API

56096a2

Adding tests

1e9e40a

Adding wait conditionals for the upgrade_mode call to return

36124f8

Adding tests

5fe582c

adjusting format and tests

ba22d11

benwtrent added >feature v7.0.0 :ml Machine learning v6.7.0 labels Jan 24, 2019

benwtrent mentioned this pull request Jan 24, 2019

Upgrade Assistant - Phase 2 - Reindexing elastic/kibana#26368

Closed

19 tasks

droberts195 reviewed Jan 25, 2019

View reviewed changes

benwtrent added 3 commits January 25, 2019 10:29

Adjusting wait conditions for api return and msgs

1ecbc75

Merge branch 'master' into feature/ml-upgrade-mode

c240ad5

adjusting doc tests

5d724ec

droberts195 reviewed Jan 25, 2019

View reviewed changes

adding upgrade mode tests to black list

f9bdb33

Merge branch 'master' into feature/ml-upgrade-mode

11a57da

droberts195 approved these changes Jan 28, 2019

View reviewed changes

benwtrent merged commit 7e4c0e6 into elastic:master Jan 28, 2019

benwtrent deleted the feature/ml-upgrade-mode branch January 28, 2019 15:07

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

droberts195 mentioned this pull request Mar 27, 2020

[ML] Integrate data frame analytics with ML upgrade mode #54326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML: Adds set_upgrade_mode API endpoint #37837

ML: Adds set_upgrade_mode API endpoint #37837

benwtrent commented Jan 24, 2019

elasticmachine commented Jan 24, 2019

droberts195 left a comment

droberts195 Jan 25, 2019

benwtrent Jan 25, 2019

benwtrent Jan 25, 2019

droberts195 Jan 25, 2019

benwtrent Jan 25, 2019

benwtrent commented Jan 25, 2019

droberts195 left a comment

ML: Adds set_upgrade_mode API endpoint #37837

ML: Adds set_upgrade_mode API endpoint #37837

Conversation

benwtrent commented Jan 24, 2019

elasticmachine commented Jan 24, 2019

droberts195 left a comment

Choose a reason for hiding this comment

droberts195 Jan 25, 2019

Choose a reason for hiding this comment

benwtrent Jan 25, 2019

Choose a reason for hiding this comment

benwtrent Jan 25, 2019

Choose a reason for hiding this comment

droberts195 Jan 25, 2019

Choose a reason for hiding this comment

benwtrent Jan 25, 2019

Choose a reason for hiding this comment

benwtrent commented Jan 25, 2019

droberts195 left a comment

Choose a reason for hiding this comment