-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ML: Adds set_upgrade_mode API endpoint #37837
Conversation
Pinging @elastic/ml-core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks very good, but I don't think it's acceptable to wait for every ML persistent task to be assigned to a node when switching out of upgrade mode. There may have been reasons why some tasks couldn't be assigned to nodes irrespective of upgrade mode. So we probably need to look more closely at what was going wrong when trying to enable upgrade mode in the presence of ML tasks not assigned to nodes and fix those problems instead.
x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/XPackClientPlugin.java
Show resolved
Hide resolved
x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/MlTasks.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportOpenJobAction.java
Outdated
Show resolved
Hide resolved
If we are enabling the option, we need to isolate the datafeeds so we can unassign the ML Jobs | ||
</.1> | ||
<.2> | ||
If we are disabling the option, we need to wait to make sure all the jobs get reallocated to an appropriate node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think "get reallocated to an appropriate node" is the right condition here. The condition we wait for should be that they have an assignment other than AWAITING_UPGRADE
, i.e. we've rechecked their allocations at least once after changing the ML cluster state flag.
We cannot guarantee that every job will be assigned to a node before returning, as there could be a reason other than upgrade mode that is stopping a job being assigned to a node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@droberts195 yeah, I have been thinking about this. The conditions on return have been a constant internal debate.
Waiting for an getAssignmentExplanation().equals(AWAITING_UPGRADE.explanation) == false
could work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@droberts195 that simple check works for the job tasks. However, for the datafeed tasks, I had to make an additional check to verify that the job assignmentId has converged.
This SHOULD happen because when the new assignment and the old one are equivalent, that is a no-op in the reassignment logic
...plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportSetUpgradeModeAction.java
Outdated
Show resolved
Hide resolved
...plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportSetUpgradeModeAction.java
Outdated
Show resolved
Hide resolved
...plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportSetUpgradeModeAction.java
Outdated
Show resolved
Hide resolved
...plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportSetUpgradeModeAction.java
Outdated
Show resolved
Hide resolved
...plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportSetUpgradeModeAction.java
Outdated
Show resolved
Hide resolved
persistentTasksCustomMetaData.findTasks(DATAFEED_TASK_NAME, | ||
(t) -> | ||
t.getAssignment().equals(AWAITING_UPGRADE) || | ||
t.getAssignment().getExplanation().contains("state is stale")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this "state is stale" condition is a sign that there will be situations where the endpoint could fail in production use. A datafeed could be unassigned and have "state is stale" in its assignment reason if a node died shortly before set_upgrade_mode?enabled=true
was called. Presumably then this situation would have the same "issues" that calling set_upgrade_mode?enabled=false
immediately followed by set_upgrade_mode?enabled=true
has.
If this is a really hard problem to solve then we could merge this PR as-is so that the Kibana team have something to test against, but keep working on a followup to do whatever is required to remove this "state is stale" test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@droberts195 If the node died, the job would attempt to re-assign right?
- Could initiall fail due to resource constraints (assignmentId increments as Assignment reason changes)
- Then attempts again during upgrade (assignmentId increments as explanation reason is now due to the upgrade)
- Upgrade finishes, tries again and can either join a node or fail due to resource constraints, either way it should stabilize to a specific assignment?
I am not sure who all updates that internal task state so that it would eventually come in sync with what the datafeed sees.
run elasticsearch-ci/oss-distro-docs tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I think this is good enough to give the Kibana team something to start testing with.
I'll look into the "state is stale" condition myself before 6.7 feature freeze.
* ML: Add MlMetadata.upgrade_mode and API * Adding tests * Adding wait conditionals for the upgrade_mode call to return * Adding tests * adjusting format and tests * Adjusting wait conditions for api return and msgs * adjusting doc tests * adding upgrade mode tests to black list
This adds the ability for the cluster to enter "upgrade mode" for ML jobs and datafeeds.
This entails the following:
upgrade_mode
is enabled/disabled according to its boolean valueupgrade_mode: false
The API is synchronous and only allows one caller to hit it at a time. When the API is returned, the guaruntees are:
upgrade_mode: true
that all tasks are unassigned and stopped. Meaning that all.ml*
indices are no longer being written to by any internal processesupgrade_mode: false
that all tasks have been re-assigned to appropriate nodes and are executing again.Note: When jobs are restarted, they restart from some time in the past and load their previous state (if available) from a snapshot.
Still need to add docs, the rest should be good to review