Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Writing Results Retries Continue After Analytics Job Stopped #53687

Closed
blaklaybul opened this issue Mar 17, 2020 · 2 comments · Fixed by #53725
Closed

[ML] Writing Results Retries Continue After Analytics Job Stopped #53687

blaklaybul opened this issue Mar 17, 2020 · 2 comments · Fixed by #53725
Assignees
Labels
>bug :ml Machine learning

Comments

@blaklaybul
Copy link

blaklaybul commented Mar 17, 2020

This was found on a recent 7.7 build. I created a classification analysis that became stuck in the writing_results phase.

[2020-03-17T12:58:57,528][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [reba.lan] [openml-kr-vs-kp-classifier-0] [data_frame_analyzer/8755] [CBoostedTreeImpl.cc@241] Training finished after 18 iterations. Time per iteration in ms mean: 1287.84 std. dev:  2697.19
[2020-03-17T12:58:57,626][INFO ][o.e.x.m.d.p.AnalyticsResultProcessor] [reba.lan] [openml-kr-vs-kp-classifier-0] Started writing results
[2020-03-17T12:58:57,882][INFO ][o.e.c.m.MetaDataMappingService] [reba.lan] [openml-kr-vs-kp-classified-0/h_cT5mm3QSWfTv6b3ZBlTQ] update_mapping [_doc]
[2020-03-17T12:58:58,149][WARN ][o.e.x.m.u.p.ResultsPersisterService] [reba.lan] [openml-kr-vs-kp-classifier-0] failed to index after [1] attempts. Will attempt again in [50ms].
[2020-03-17T12:58:58,361][WARN ][o.e.x.m.u.p.ResultsPersisterService] [reba.lan] [openml-kr-vs-kp-classifier-0] failed to index after [2] attempts. Will attempt again in [75ms].
[2020-03-17T12:58:58,570][WARN ][o.e.x.m.u.p.ResultsPersisterService] [reba.lan] [openml-kr-vs-kp-classifier-0] failed to index after [3] attempts. Will attempt again in [276ms].
...lots more retires here...
[2020-03-17T13:16:19,080][WARN ][o.e.x.m.u.p.ResultsPersisterService] [reba.lan] [openml-kr-vs-kp-classifier-0] failed to index after [15] attempts. Will attempt again in [846433ms]

Stopping the job via the UI removed the job from the jobs list, but the job remains in a stopping state. The retires continue even after stopping the job:

[2020-03-17T13:44:24,643][INFO ][o.e.x.m.a.TransportStopDataFrameAnalyticsAction] [reba.lan] [openml-kr-vs-kp-classifier-0] Stopping task with force [true]
[2020-03-17T13:44:24,668][INFO ][o.e.x.m.a.TransportStopDataFrameAnalyticsAction] [reba.lan] [openml-kr-vs-kp-classifier-0] Stopping task with force [true]
[2020-03-17T13:44:24,669][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [reba.lan] [controller/6507] [CDetachedProcessSpawner.cc@177] Child process with PID 8755 was terminated by signal 15
[2020-03-17T13:44:24,670][ERROR][o.e.x.m.p.l.CppLogMessageHandler] [reba.lan] [controller/6507] [CDetachedProcessSpawner.cc@99] Will not attempt to kill process 8755: not a child process
[2020-03-17T13:44:24,670][ERROR][o.e.x.m.p.l.CppLogMessageHandler] [reba.lan] [controller/6507] [CCommandProcessor.cc@96] Failed to kill process with PID 8755
[2020-03-17T13:45:13,300][WARN ][o.e.x.m.u.p.ResultsPersisterService] [reba.lan] [openml-kr-vs-kp-classifier-0] failed to index after [17] attempts. Will attempt again in [884782ms]
.
.
.
[2020-03-17T14:14:48,971][WARN ][o.e.x.m.u.p.ResultsPersisterService] [reba.lan] [openml-kr-vs-kp-classifier-0] failed to index after [19] attempts. Will attempt again in [850734ms]

job config:

{
      "id" : "openml-kr-vs-kp-classifier-0",
      "source" : {
        "index" : [
          "openml-kr-vs-kp"
        ],
        "query" : {
          "match_all" : { }
        }
      },
      "dest" : {
        "index" : "openml-kr-vs-kp-classified-0",
        "results_field" : "ml"
      },
      "analysis" : {
        "classification" : {
          "dependent_variable" : "class",
          "class_assignment_objective" : "maximize_accuracy",
          "num_top_classes" : 2,
          "prediction_field_name" : "class_prediction",
          "training_percent" : 90.0,
          "randomize_seed" : 7077816937788972687
        }
      },
      "model_memory_limit" : "512mb",
      "create_time" : 1584464253331,
      "version" : "7.7.0",
      "allow_lazy_start" : false
    }
@blaklaybul blaklaybul added >bug :ml Machine learning labels Mar 17, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@blaklaybul blaklaybul changed the title [ML] Writing Results Retries Continues After Analytics Stopped [ML] Writing Results Retries Continue After Analytics Job Stopped Mar 17, 2020
@benwtrent benwtrent self-assigned this Mar 17, 2020
@benwtrent
Copy link
Member

I think this shows two problems:

  • stop is not closing out the retries
  • we are retrying on ALL failures. There are certain failures that will probably not be recoverable (i.e. mapping failures).

benwtrent added a commit that referenced this issue Mar 19, 2020
… and stop retrying when analytics job is stopping (#53725)

This fixes two issues:


- Results persister would retry actions even if they are not intermittent. An example of an persistent failure is a doc mapping problem.
- Data frame analytics would continue to retry to persist results even after the job is stopped.

closes #53687
benwtrent added a commit to benwtrent/elasticsearch that referenced this issue Mar 19, 2020
… and stop retrying when analytics job is stopping (elastic#53725)

This fixes two issues:


- Results persister would retry actions even if they are not intermittent. An example of an persistent failure is a doc mapping problem.
- Data frame analytics would continue to retry to persist results even after the job is stopped.

closes elastic#53687
benwtrent added a commit that referenced this issue Mar 19, 2020
…ittent and stop retrying when analytics job is stopping (#53725) (#53808)

* [ML] only retry persistence failures when the failure is intermittent and stop retrying when analytics job is stopping (#53725)

This fixes two issues:


- Results persister would retry actions even if they are not intermittent. An example of an persistent failure is a doc mapping problem.
- Data frame analytics would continue to retry to persist results even after the job is stopped.

closes #53687
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning
Projects
None yet
3 participants