ML: Add upgrade mode docs, hlrc, and fix bug #37942

benwtrent · 2019-01-28T20:03:08Z

This adds reference docs, the HLRC side of things, and fixes a bug for when there are no tasks when the option is set.

elasticmachine · 2019-01-28T20:03:10Z

Pinging @elastic/ml-core

davidkyle

LGTM

docs/java-rest/high-level/ml/set-upgrade-mode.asciidoc

lcawl · 2019-01-28T22:40:24Z

docs/reference/ml/apis/set-upgrade-mode.asciidoc

+
+Before indices related to {ml} jobs and {dfeeds} can be locked and upgraded, all
+currently running jobs and {dfeeds} should be paused. When you set the `enabled`
+parameter to `true`, the API pauses all job and {dfeed} tasks, which facilitates


The upgrade docs say "Stop any X-Pack machine learning jobs that are running before starting the upgrade process. ". Can we therefore change this introductory sentence to something like this?:
"Before you start the upgrade progress, you must stop all jobs and datafeeds. This API simplifies that task."

Short answer

I don't think we should use the word "stop" because we have a "stop datafeed" API and this endpoint does something different.

We could change the introductory sentence to something like this:

Before you start an upgrade that requires reindexing of ML internal indices, you must halt all jobs and {dfeeds}. This API simplifies that task. This API can also be used during other upgrades to prevent ML jobs shifting nodes multiple times as the upgrade progresses.

Long answer

We should probably expand the section in the upgrade docs to match current reality more closely. In the beginning (5.4/5.5) it was not safe to leave ML jobs running during a rolling upgrade, and the current advice dates back to then.

Subsequently we've worked to make rolling upgrade with open ML jobs safe, at least in terms of things not completely breaking. However, if practical, there is still a benefit to gracefully closing all ML jobs before upgrading and reopening them after the upgrade is complete because doing a graceful close means they'll persist their model state at the moment of closure, so will restart with the exact same model they had before the upgrade. The "if practical" bit is important though, because gracefully closing a large number of jobs or jobs with very large model state can take a long time, and such a delay is not always acceptable. We persist model state in the background at the end of lookback and every 3-4 hours in real-time operation, so a model is unlikely to have changed dramatically between the last background persistence and a forceful close.

Therefore what to do on upgrade is very much an "it depends" situation.

From an ML perspective there are two types of upgrade:

An upgrade where nodes have to be restarted but indices do not have to be reindexed

An upgrade where nodes have to be restarted and indices have to be reindexed

Type 2 here is much rarer than type 1. It only occurs on major version upgrade where ML was used in the previous major version, for example on upgrade from 6.x to 7.x when ML was first used in 5.x.

A type 2 upgrade has the hard requirement that jobs are not running during the reindexing portion of the upgrade, and there are two options to achieve this:

Close all ML jobs, do the upgrade, reopen the jobs that were closed - notice that this requires an external person or process to remember which jobs need reopening

Enable upgrade mode, do the upgrade, disable upgrade mode

A type 1 upgrade does not have such a hard requirement. There are 3 options:

Do the upgrade without closing ML jobs - ML jobs will shift around the available ML nodes as nodes are stopped and started, leading to highest availability of active ML jobs but also highest load on the cluster of ML jobs starting up and restoring model state

Close all ML jobs, do the upgrade, reopen the jobs that were closed - notice that this requires an external person or process to remember which jobs need reopening

Enable upgrade mode, do the upgrade, disable upgrade mode

So in the type 1 case upgrade mode avoids some of the cluster load from jobs potentially shifting nodes multiple times and loading their model state multiple times as the upgrade progresses. But it does not provide the benefit of persisting the absolute latest model state immediately before upgrade and restoring that on restart.

@lcawl this is not stopping either. At least it is not stopping them in the same meaning we use elsewhere in the documentation.

It is stopping the current tasks from executing and disallowing new ones to be started. I will try and fix the wording

I'm not a big fan of "halt", since that seems to just be a synonym of "stop". If it's doing something different (i.e. stopping tasks within the job or datafeed without actually stopping the job or datafeed), then we either need a different word or to be clearer about exactly what it's stopping.

How about something like this:

When you upgrade your cluster, in some circumstances you must restart your nodes and
reindex your {ml} indices. In those circumstances, there must be no {ml} jobs running. You can close the {ml} jobs, do the upgrade, then open all the jobs again. Alternatively, you can use this API to temporarily pause the jobs [or whatever works best here... stop the tasks associated with the jobs?] and prevent new jobs from opening. You can also use this API during upgrades that do not require you to reindex your {ml} indices, though stopping jobs is not a requirement in that case.

For more information, see {stack-ref}/upgrading-elastic-stack.html[Upgrading the {stack}].

Note: I've created a task to improve the upgrade docs here: elastic/stack-docs#192

droberts195

I left a few comments, one of which is large and partly relates to the ML upgrade docs. The ML upgrade docs can be changed in a followup PR to keep this one moving, but I added my thoughts here to give the background for other suggestions I made.

client/rest-high-level/src/main/java/org/elasticsearch/client/MLRequestConverters.java

docs/reference/ml/apis/set-upgrade-mode.asciidoc

docs/java-rest/high-level/ml/set-upgrade-mode.asciidoc

docs/reference/ml/apis/set-upgrade-mode.asciidoc

droberts195 · 2019-01-29T11:51:59Z

docs/reference/ml/apis/set-upgrade-mode.asciidoc

+
+Before indices related to {ml} jobs and {dfeeds} can be locked and upgraded, all
+currently running jobs and {dfeeds} should be paused. When you set the `enabled`
+parameter to `true`, the API pauses all job and {dfeed} tasks, which facilitates


Short answer

I don't think we should use the word "stop" because we have a "stop datafeed" API and this endpoint does something different.

We could change the introductory sentence to something like this:

Before you start an upgrade that requires reindexing of ML internal indices, you must halt all jobs and {dfeeds}. This API simplifies that task. This API can also be used during other upgrades to prevent ML jobs shifting nodes multiple times as the upgrade progresses.

Long answer

We should probably expand the section in the upgrade docs to match current reality more closely. In the beginning (5.4/5.5) it was not safe to leave ML jobs running during a rolling upgrade, and the current advice dates back to then.

Subsequently we've worked to make rolling upgrade with open ML jobs safe, at least in terms of things not completely breaking. However, if practical, there is still a benefit to gracefully closing all ML jobs before upgrading and reopening them after the upgrade is complete because doing a graceful close means they'll persist their model state at the moment of closure, so will restart with the exact same model they had before the upgrade. The "if practical" bit is important though, because gracefully closing a large number of jobs or jobs with very large model state can take a long time, and such a delay is not always acceptable. We persist model state in the background at the end of lookback and every 3-4 hours in real-time operation, so a model is unlikely to have changed dramatically between the last background persistence and a forceful close.

Therefore what to do on upgrade is very much an "it depends" situation.

From an ML perspective there are two types of upgrade:

An upgrade where nodes have to be restarted but indices do not have to be reindexed

An upgrade where nodes have to be restarted and indices have to be reindexed

Type 2 here is much rarer than type 1. It only occurs on major version upgrade where ML was used in the previous major version, for example on upgrade from 6.x to 7.x when ML was first used in 5.x.

A type 2 upgrade has the hard requirement that jobs are not running during the reindexing portion of the upgrade, and there are two options to achieve this:

Close all ML jobs, do the upgrade, reopen the jobs that were closed - notice that this requires an external person or process to remember which jobs need reopening

Enable upgrade mode, do the upgrade, disable upgrade mode

A type 1 upgrade does not have such a hard requirement. There are 3 options:

Do the upgrade without closing ML jobs - ML jobs will shift around the available ML nodes as nodes are stopped and started, leading to highest availability of active ML jobs but also highest load on the cluster of ML jobs starting up and restoring model state

Close all ML jobs, do the upgrade, reopen the jobs that were closed - notice that this requires an external person or process to remember which jobs need reopening

Enable upgrade mode, do the upgrade, disable upgrade mode

So in the type 1 case upgrade mode avoids some of the cluster load from jobs potentially shifting nodes multiple times and loading their model state multiple times as the upgrade progresses. But it does not provide the benefit of persisting the absolute latest model state immediately before upgrade and restoring that on restart.

docs/reference/ml/apis/set-upgrade-mode.asciidoc

…lasticsearch into feature/ml-upgrade-mode-docs

droberts195

This looks good to me now.

@lcawl would you like to suggest any further changes to wording before this is merged?

droberts195 · 2019-01-29T16:26:11Z

docs/reference/ml/apis/set-upgrade-mode.asciidoc

+prohibits new job and {dfeed} tasks from starting.
+
+Subsequently, you can call the API with the enabled parameter set to false,
+ which causes {ml} jobs and {dfeeds} to return to their desired states.


nit: there's a space at the beginning of the line

docs/reference/ml/apis/set-upgrade-mode.asciidoc

Co-Authored-By: benwtrent <[email protected]>

* ML: Add upgrade mode docs, hlrc, and fix bug * [DOCS] Fixes build error and edits text * adjusting docs * Update docs/reference/ml/apis/set-upgrade-mode.asciidoc Co-Authored-By: benwtrent <[email protected]> * Update set-upgrade-mode.asciidoc * Update set-upgrade-mode.asciidoc

javanna · 2019-01-30T14:38:25Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/MlMetadata.java

@@ -224,7 +224,7 @@ public MlMetadataDiff(StreamInput in) throws IOException {
        public void writeTo(StreamOutput out) throws IOException {
            jobs.writeTo(out);
            datafeeds.writeTo(out);
-            if (out.getVersion().onOrAfter(Version.V_7_0_0)) {
+            if (out.getVersion().onOrAfter(Version.V_6_7_0)) {


would it be possible to fix this separately next time, and add unit tests for serialization that fail with the wrong version conditionals?

ML: Add upgrade mode docs, hlrc, and fix bug

7d403f6

benwtrent added >enhancement v7.0.0 :ml Machine learning v6.7.0 labels Jan 28, 2019

benwtrent requested a review from lcawl January 28, 2019 20:03

benwtrent mentioned this pull request Jan 28, 2019

Upgrade Assistant - Phase 2 - Reindexing elastic/kibana#26368

Closed

19 tasks

davidkyle approved these changes Jan 28, 2019

View reviewed changes

docs/java-rest/high-level/ml/set-upgrade-mode.asciidoc Outdated Show resolved Hide resolved

docs/java-rest/high-level/ml/set-upgrade-mode.asciidoc Outdated Show resolved Hide resolved

[DOCS] Fixes build error and edits text

d260f9e

lcawl reviewed Jan 28, 2019

View reviewed changes

droberts195 reviewed Jan 29, 2019

View reviewed changes

benwtrent added 3 commits January 29, 2019 08:03

adjusting docs

b33c236

Merge branch 'master' into feature/ml-upgrade-mode-docs

bece9f4

Merge branch 'feature/ml-upgrade-mode-docs' of github.com:benwtrent/e…

66eead2

…lasticsearch into feature/ml-upgrade-mode-docs

droberts195 approved these changes Jan 29, 2019

View reviewed changes

lcawl mentioned this pull request Jan 29, 2019

Clarify machine learning upgrade steps elastic/stack-docs#192

Closed

lcawl reviewed Jan 29, 2019

View reviewed changes

docs/reference/ml/apis/set-upgrade-mode.asciidoc Outdated Show resolved Hide resolved

lcawl and others added 3 commits January 30, 2019 06:44

Update docs/reference/ml/apis/set-upgrade-mode.asciidoc

5f53b0f

Co-Authored-By: benwtrent <[email protected]>

Update set-upgrade-mode.asciidoc

5f97613

Update set-upgrade-mode.asciidoc

44c62e2

benwtrent merged commit 8280a20 into elastic:master Jan 30, 2019

benwtrent deleted the feature/ml-upgrade-mode-docs branch January 30, 2019 12:51

javanna reviewed Jan 30, 2019

View reviewed changes

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

lcawl mentioned this pull request Feb 14, 2019

[DOCS] Updates methods for upgrading machine learning #38876

Merged

droberts195 mentioned this pull request Mar 27, 2020

[ML] Integrate data frame analytics with ML upgrade mode #54326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML: Add upgrade mode docs, hlrc, and fix bug #37942

ML: Add upgrade mode docs, hlrc, and fix bug #37942

benwtrent commented Jan 28, 2019

elasticmachine commented Jan 28, 2019

davidkyle left a comment

lcawl Jan 28, 2019

droberts195 Jan 29, 2019

benwtrent Jan 29, 2019

lcawl Jan 29, 2019

lcawl Jan 29, 2019 •

edited

Loading

droberts195 left a comment

droberts195 Jan 29, 2019

droberts195 left a comment

droberts195 Jan 29, 2019

javanna Jan 30, 2019

ML: Add upgrade mode docs, hlrc, and fix bug #37942

ML: Add upgrade mode docs, hlrc, and fix bug #37942

Conversation

benwtrent commented Jan 28, 2019

elasticmachine commented Jan 28, 2019

davidkyle left a comment

Choose a reason for hiding this comment

lcawl Jan 28, 2019

Choose a reason for hiding this comment

droberts195 Jan 29, 2019

Choose a reason for hiding this comment

benwtrent Jan 29, 2019

Choose a reason for hiding this comment

lcawl Jan 29, 2019

Choose a reason for hiding this comment

lcawl Jan 29, 2019 • edited Loading

Choose a reason for hiding this comment

droberts195 left a comment

Choose a reason for hiding this comment

droberts195 Jan 29, 2019

Choose a reason for hiding this comment

droberts195 left a comment

Choose a reason for hiding this comment

droberts195 Jan 29, 2019

Choose a reason for hiding this comment

javanna Jan 30, 2019

Choose a reason for hiding this comment

lcawl Jan 29, 2019 •

edited

Loading