Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML: Adds set_upgrade_mode API endpoint #37837

Merged
merged 10 commits into from
Jan 28, 2019

Conversation

benwtrent
Copy link
Member

This adds the ability for the cluster to enter "upgrade mode" for ML jobs and datafeeds.

This entails the following:

  • Setting a field in the cluster state (for the ml_metadata) that indicates if upgrade_mode is enabled/disabled according to its boolean value
  • Isolating Datafeeds so that they stop pushing data to the ML jobs
  • Unassigning Datafeed and Job persistent tasks so that they stop executing and can later be restarted
    • This is done with the tasks staying cluster state so that they can be reassigned to an appropriate node when upgrade_mode: false

The API is synchronous and only allows one caller to hit it at a time. When the API is returned, the guaruntees are:

  • If upgrade_mode: true that all tasks are unassigned and stopped. Meaning that all .ml* indices are no longer being written to by any internal processes
  • If upgrade_mode: false that all tasks have been re-assigned to appropriate nodes and are executing again.

Note: When jobs are restarted, they restart from some time in the past and load their previous state (if available) from a snapshot.

Still need to add docs, the rest should be good to review

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks very good, but I don't think it's acceptable to wait for every ML persistent task to be assigned to a node when switching out of upgrade mode. There may have been reasons why some tasks couldn't be assigned to nodes irrespective of upgrade mode. So we probably need to look more closely at what was going wrong when trying to enable upgrade mode in the presence of ML tasks not assigned to nodes and fix those problems instead.

If we are enabling the option, we need to isolate the datafeeds so we can unassign the ML Jobs
</.1>
<.2>
If we are disabling the option, we need to wait to make sure all the jobs get reallocated to an appropriate node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think "get reallocated to an appropriate node" is the right condition here. The condition we wait for should be that they have an assignment other than AWAITING_UPGRADE, i.e. we've rechecked their allocations at least once after changing the ML cluster state flag.

We cannot guarantee that every job will be assigned to a node before returning, as there could be a reason other than upgrade mode that is stopping a job being assigned to a node.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droberts195 yeah, I have been thinking about this. The conditions on return have been a constant internal debate.

Waiting for an getAssignmentExplanation().equals(AWAITING_UPGRADE.explanation) == false could work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droberts195 that simple check works for the job tasks. However, for the datafeed tasks, I had to make an additional check to verify that the job assignmentId has converged.

This SHOULD happen because when the new assignment and the old one are equivalent, that is a no-op in the reassignment logic

persistentTasksCustomMetaData.findTasks(DATAFEED_TASK_NAME,
(t) ->
t.getAssignment().equals(AWAITING_UPGRADE) ||
t.getAssignment().getExplanation().contains("state is stale"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this "state is stale" condition is a sign that there will be situations where the endpoint could fail in production use. A datafeed could be unassigned and have "state is stale" in its assignment reason if a node died shortly before set_upgrade_mode?enabled=true was called. Presumably then this situation would have the same "issues" that calling set_upgrade_mode?enabled=false immediately followed by set_upgrade_mode?enabled=true has.

If this is a really hard problem to solve then we could merge this PR as-is so that the Kibana team have something to test against, but keep working on a followup to do whatever is required to remove this "state is stale" test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droberts195 If the node died, the job would attempt to re-assign right?

  • Could initiall fail due to resource constraints (assignmentId increments as Assignment reason changes)
  • Then attempts again during upgrade (assignmentId increments as explanation reason is now due to the upgrade)
  • Upgrade finishes, tries again and can either join a node or fail due to resource constraints, either way it should stabilize to a specific assignment?

I am not sure who all updates that internal task state so that it would eventually come in sync with what the datafeed sees.

@benwtrent
Copy link
Member Author

run elasticsearch-ci/oss-distro-docs tests

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I think this is good enough to give the Kibana team something to start testing with.

I'll look into the "state is stale" condition myself before 6.7 feature freeze.

@benwtrent benwtrent merged commit 7e4c0e6 into elastic:master Jan 28, 2019
@benwtrent benwtrent deleted the feature/ml-upgrade-mode branch January 28, 2019 15:07
benwtrent added a commit that referenced this pull request Jan 28, 2019
* ML: Add MlMetadata.upgrade_mode and API

* Adding tests

* Adding wait conditionals for the upgrade_mode call to return

* Adding tests

* adjusting format and tests

* Adjusting wait conditions for api return and msgs

* adjusting doc tests

* adding upgrade mode tests to black list
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants