-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ML: Add upgrade mode docs, hlrc, and fix bug #37942
ML: Add upgrade mode docs, hlrc, and fix bug #37942
Conversation
Pinging @elastic/ml-core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
Before indices related to {ml} jobs and {dfeeds} can be locked and upgraded, all | ||
currently running jobs and {dfeeds} should be paused. When you set the `enabled` | ||
parameter to `true`, the API pauses all job and {dfeed} tasks, which facilitates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The upgrade docs say "Stop any X-Pack machine learning jobs that are running before starting the upgrade process. ". Can we therefore change this introductory sentence to something like this?:
"Before you start the upgrade progress, you must stop all jobs and datafeeds. This API simplifies that task."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short answer
I don't think we should use the word "stop" because we have a "stop datafeed" API and this endpoint does something different.
We could change the introductory sentence to something like this:
Before you start an upgrade that requires reindexing of ML internal indices, you must halt all jobs and {dfeeds}. This API simplifies that task. This API can also be used during other upgrades to prevent ML jobs shifting nodes multiple times as the upgrade progresses.
Long answer
We should probably expand the section in the upgrade docs to match current reality more closely. In the beginning (5.4/5.5) it was not safe to leave ML jobs running during a rolling upgrade, and the current advice dates back to then.
Subsequently we've worked to make rolling upgrade with open ML jobs safe, at least in terms of things not completely breaking. However, if practical, there is still a benefit to gracefully closing all ML jobs before upgrading and reopening them after the upgrade is complete because doing a graceful close means they'll persist their model state at the moment of closure, so will restart with the exact same model they had before the upgrade. The "if practical" bit is important though, because gracefully closing a large number of jobs or jobs with very large model state can take a long time, and such a delay is not always acceptable. We persist model state in the background at the end of lookback and every 3-4 hours in real-time operation, so a model is unlikely to have changed dramatically between the last background persistence and a forceful close.
Therefore what to do on upgrade is very much an "it depends" situation.
From an ML perspective there are two types of upgrade:
- An upgrade where nodes have to be restarted but indices do not have to be reindexed
- An upgrade where nodes have to be restarted and indices have to be reindexed
Type 2 here is much rarer than type 1. It only occurs on major version upgrade where ML was used in the previous major version, for example on upgrade from 6.x to 7.x when ML was first used in 5.x.
A type 2 upgrade has the hard requirement that jobs are not running during the reindexing portion of the upgrade, and there are two options to achieve this:
- Close all ML jobs, do the upgrade, reopen the jobs that were closed - notice that this requires an external person or process to remember which jobs need reopening
- Enable upgrade mode, do the upgrade, disable upgrade mode
A type 1 upgrade does not have such a hard requirement. There are 3 options:
- Do the upgrade without closing ML jobs - ML jobs will shift around the available ML nodes as nodes are stopped and started, leading to highest availability of active ML jobs but also highest load on the cluster of ML jobs starting up and restoring model state
- Close all ML jobs, do the upgrade, reopen the jobs that were closed - notice that this requires an external person or process to remember which jobs need reopening
- Enable upgrade mode, do the upgrade, disable upgrade mode
So in the type 1 case upgrade mode avoids some of the cluster load from jobs potentially shifting nodes multiple times and loading their model state multiple times as the upgrade progresses. But it does not provide the benefit of persisting the absolute latest model state immediately before upgrade and restoring that on restart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lcawl this is not stopping either. At least it is not stopping them in the same meaning we use elsewhere in the documentation.
It is stopping the current tasks from executing and disallowing new ones to be started. I will try and fix the wording
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a big fan of "halt", since that seems to just be a synonym of "stop". If it's doing something different (i.e. stopping tasks within the job or datafeed without actually stopping the job or datafeed), then we either need a different word or to be clearer about exactly what it's stopping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about something like this:
When you upgrade your cluster, in some circumstances you must restart your nodes and
reindex your {ml} indices. In those circumstances, there must be no {ml} jobs running. You can close the {ml} jobs, do the upgrade, then open all the jobs again. Alternatively, you can use this API to temporarily pause the jobs [or whatever works best here... stop the tasks associated with the jobs?] and prevent new jobs from opening. You can also use this API during upgrades that do not require you to reindex your {ml} indices, though stopping jobs is not a requirement in that case.For more information, see {stack-ref}/upgrading-elastic-stack.html[Upgrading the {stack}].
Note: I've created a task to improve the upgrade docs here: elastic/stack-docs#192
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments, one of which is large and partly relates to the ML upgrade docs. The ML upgrade docs can be changed in a followup PR to keep this one moving, but I added my thoughts here to give the background for other suggestions I made.
client/rest-high-level/src/main/java/org/elasticsearch/client/MLRequestConverters.java
Show resolved
Hide resolved
|
||
Before indices related to {ml} jobs and {dfeeds} can be locked and upgraded, all | ||
currently running jobs and {dfeeds} should be paused. When you set the `enabled` | ||
parameter to `true`, the API pauses all job and {dfeed} tasks, which facilitates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short answer
I don't think we should use the word "stop" because we have a "stop datafeed" API and this endpoint does something different.
We could change the introductory sentence to something like this:
Before you start an upgrade that requires reindexing of ML internal indices, you must halt all jobs and {dfeeds}. This API simplifies that task. This API can also be used during other upgrades to prevent ML jobs shifting nodes multiple times as the upgrade progresses.
Long answer
We should probably expand the section in the upgrade docs to match current reality more closely. In the beginning (5.4/5.5) it was not safe to leave ML jobs running during a rolling upgrade, and the current advice dates back to then.
Subsequently we've worked to make rolling upgrade with open ML jobs safe, at least in terms of things not completely breaking. However, if practical, there is still a benefit to gracefully closing all ML jobs before upgrading and reopening them after the upgrade is complete because doing a graceful close means they'll persist their model state at the moment of closure, so will restart with the exact same model they had before the upgrade. The "if practical" bit is important though, because gracefully closing a large number of jobs or jobs with very large model state can take a long time, and such a delay is not always acceptable. We persist model state in the background at the end of lookback and every 3-4 hours in real-time operation, so a model is unlikely to have changed dramatically between the last background persistence and a forceful close.
Therefore what to do on upgrade is very much an "it depends" situation.
From an ML perspective there are two types of upgrade:
- An upgrade where nodes have to be restarted but indices do not have to be reindexed
- An upgrade where nodes have to be restarted and indices have to be reindexed
Type 2 here is much rarer than type 1. It only occurs on major version upgrade where ML was used in the previous major version, for example on upgrade from 6.x to 7.x when ML was first used in 5.x.
A type 2 upgrade has the hard requirement that jobs are not running during the reindexing portion of the upgrade, and there are two options to achieve this:
- Close all ML jobs, do the upgrade, reopen the jobs that were closed - notice that this requires an external person or process to remember which jobs need reopening
- Enable upgrade mode, do the upgrade, disable upgrade mode
A type 1 upgrade does not have such a hard requirement. There are 3 options:
- Do the upgrade without closing ML jobs - ML jobs will shift around the available ML nodes as nodes are stopped and started, leading to highest availability of active ML jobs but also highest load on the cluster of ML jobs starting up and restoring model state
- Close all ML jobs, do the upgrade, reopen the jobs that were closed - notice that this requires an external person or process to remember which jobs need reopening
- Enable upgrade mode, do the upgrade, disable upgrade mode
So in the type 1 case upgrade mode avoids some of the cluster load from jobs potentially shifting nodes multiple times and loading their model state multiple times as the upgrade progresses. But it does not provide the benefit of persisting the absolute latest model state immediately before upgrade and restoring that on restart.
…lasticsearch into feature/ml-upgrade-mode-docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me now.
@lcawl would you like to suggest any further changes to wording before this is merged?
prohibits new job and {dfeed} tasks from starting. | ||
|
||
Subsequently, you can call the API with the enabled parameter set to false, | ||
which causes {ml} jobs and {dfeeds} to return to their desired states. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: there's a space at the beginning of the line
* ML: Add upgrade mode docs, hlrc, and fix bug * [DOCS] Fixes build error and edits text * adjusting docs * Update docs/reference/ml/apis/set-upgrade-mode.asciidoc Co-Authored-By: benwtrent <[email protected]> * Update set-upgrade-mode.asciidoc * Update set-upgrade-mode.asciidoc
@@ -224,7 +224,7 @@ public MlMetadataDiff(StreamInput in) throws IOException { | |||
public void writeTo(StreamOutput out) throws IOException { | |||
jobs.writeTo(out); | |||
datafeeds.writeTo(out); | |||
if (out.getVersion().onOrAfter(Version.V_7_0_0)) { | |||
if (out.getVersion().onOrAfter(Version.V_6_7_0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be possible to fix this separately next time, and add unit tests for serialization that fail with the wrong version conditionals?
This adds reference docs, the HLRC side of things, and fixes a bug for when there are no tasks when the option is set.