Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Datafeed does not start when allow_lazy_open is enabled #53763

Closed
sophiec20 opened this issue Mar 18, 2020 · 4 comments · Fixed by #53918
Closed

[ML] Datafeed does not start when allow_lazy_open is enabled #53763

sophiec20 opened this issue Mar 18, 2020 · 4 comments · Fixed by #53918
Assignees
Labels
>bug :ml Machine learning

Comments

@sophiec20
Copy link
Contributor

sophiec20 commented Mar 18, 2020

Found in 7.7.0-SNAPSHOT "build_hash" : "2f0aca992bb8c91c17603050807891cad2e41483", "build_date" : "2020-03-16T02:52:34.086738Z",

  • 3 node cluster, all nodes acting as data, master and ml
  • All nodes are co-located on the same 16GB VM
  • "xpack.ml.max_machine_memory_percent" : 16

I have a script that creates 16 jobs in succession. Each job requires 2GB model memory.

The first 3 jobs open and the datafeeds start.
The 4th job returns opened:false and the datafeed fails to start with the following:

open job        {"opened":false}
start datafeed  {"error":{"root_cause":[{"type":"status_exception","reason":"Could not start datafeed, allocation explanation []"}],"type":"status_exception","reason":"Could not ...

In the job list, the job state is opening and the datafeed state is stopped. No errors are visible.

As one of the first 3 jobs completes, one of the opening jobs transitions its state to opened. However the datafeed remains stopped.

These are the job messages for a job that was lazy opening.
image

Expected behavior would be for the datafeed to be starting and for it to start once resource became available (which would happen when one of the other jobs closed, in this scenario).

Once jobs have completed, I can manually start the datafeed on one of the opened jobs and it will complete without on-screen errors. (I cannot start one of the opening jobs, which is to be expected.)

@sophiec20 sophiec20 added >bug :ml Machine learning labels Mar 18, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@droberts195 droberts195 self-assigned this Mar 19, 2020
@droberts195
Copy link
Contributor

droberts195 commented Mar 20, 2020

Expected behavior would be for the datafeed to be starting and for it to start once resource became available

There are two bits to this. Obviously it's a bug that the datafeed doesn't start once resource is available and that's relatively easy to fix.

Having the state being starting rather than stopped while it's waiting is a bigger problem. Although the code has a starting state, at the moment it's never used. These comments relate to that:

// TODO: report (task != null && task.getState() == null) as STARTING in version 8, and fix side effects

// The STARTING state is not used anywhere at present, so this should never happen.
// At present datafeeds that have a persistent task that hasn't yet been assigned
// a state are reported as STOPPED (which is not great). It could be considered a
// breaking change to introduce the STARTING state though, so let's aim to do it in
// version 8. Also consider treating STARTING like STARTED for stop API behaviour.

So:

  • The plan was to make starting an externally visible state in 8.0 to avoid screwing up any programmatic integration that doesn't expect that state in a minor version upgrade.
  • We could do this sooner if we think the effects of not having it are bad enough - stopped is certainly a confusing state for a datafeed that's waiting for node assignment.
  • The comment "Also consider treating STARTING like STARTED for stop API behaviour." alludes to further complications with making this change. We would need to do that if we make starting a first-class citizen, otherwise you wouldn't be able to stop a datafeed that was waiting for node assignment, that in turn would prevent you closing the job and we'd be back to the horrible deadlock situation we had 6 months ago. But the bigger problem is whether there would be any unforeseen side effects of introducing starting.

We should probably discuss offline whether introducing starting to the outside world in 7.7 would count as a breaking change or a non-breaking enhancement.

@sophiec20
Copy link
Contributor Author

I have no desire to introduce starting if this causes excessive implementation complications. We have a tight coupling to the job/datafeed relationship, so complexity in the multi-state relationship is also something we should avoid in our future roadmap. I hadn't thought that through enough when writing the expected behaviour para above.

To clarify scope and intention of this ticket - it is to ensure that jobs can lazy open. So, let's fix the bug.

@droberts195
Copy link
Contributor

I think we might need to make starting externally visible in the datafeed stats. After fixing just the lazy start side it confuses the UI if the externally visible status of the datafeed is stopped. The UI has no way to know that it should attempt to stop that datafeed.

I will make the change on Monday.

droberts195 added a commit to droberts195/elasticsearch that referenced this issue Mar 21, 2020
It is possible for ML jobs to open lazily if the "allow_lazy_open"
option in the job config is set to true.  Such jobs wait in the
"opening" state until a node has sufficient capacity to run them.

This commit fixes the bug that prevented datafeeds for jobs lazily
waiting assignment from being started.  The state of such datafeeds
is "starting", and they can be stopped by the stop datafeed API
while in this state with or without force.

Relates elastic#53763
droberts195 added a commit that referenced this issue Mar 24, 2020
It is possible for ML jobs to open lazily if the "allow_lazy_open"
option in the job config is set to true.  Such jobs wait in the
"opening" state until a node has sufficient capacity to run them.

This commit fixes the bug that prevented datafeeds for jobs lazily
waiting assignment from being started.  The state of such datafeeds
is "starting", and they can be stopped by the stop datafeed API
while in this state with or without force.

Fixes #53763
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants