[ML] Allow asynchronous job deletion #34058

dimitris-athanasiou · 2018-09-25T17:24:49Z

This changes the delete job API by adding
the choice to delete a job asynchronously.
The commit adds a wait_for_completion parameter
to the delete job request. When set to false,
the action returns immediately and the response
contains the task id.

This also changes the handling of subsequent
delete requests for a job that is already being
deleted. It now uses the task framework to check
if the job is being deleted instead of the cluster
state. This is a beneficial for it is going to also
be working once the job configs are moved out of the
cluster state and into an index. Also, force delete
requests that are waiting for the job to be deleted
will not proceed with the deletion if the first task
fails. This will prevent overloading the cluster. Instead,
the failure is communicated better via notifications
so that the user may retry.

Finally, this makes the deleting property of the job
visible (also it was renamed from deleted). This allows
a client to render a deleting job differently.

Closes #32836

elasticmachine · 2018-09-25T17:24:53Z

Pinging @elastic/ml-core

dimitris-athanasiou · 2018-09-25T17:25:42Z

I labelled this with "WIP" only because I have a failing test that I need to resolve. The code is pretty stable so can be reviewed. Also, could @imotov please have a look at the usage of the tasks framework?

dimitris-athanasiou · 2018-09-26T14:03:34Z

I have now resolved the test failure and I've tested in cluster with multiple nodes. Ready for review!

benwtrent · 2018-09-26T14:15:28Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/MlMetadata.java

@@ -287,7 +287,7 @@ public Builder deleteJob(String jobId, PersistentTasksCustomMetaData tasks) {
            if (job == null) {
                throw new ResourceNotFoundException("job [" + jobId + "] does not exist");
            }
-            if (job.isDeleted() == false) {
+            if (job.isDeleting() == false) {


So, what happens when deleteJob gets called twice in a row while a job is in the deleting status? On the second call, would the error not be ResourceNotFoundException("job [" + jobId + "] does not exist"); Or is the private final SortedMap<String, Job> jobs; repopulated somewhere?

We now protect against subsequent deletes using the task framework. You can see the implementation in TransportDeleteJobAction.masterOperation.

imotov

Left some task manager-related comments.

imotov · 2018-09-26T14:02:29Z

server/src/main/java/org/elasticsearch/tasks/TaskResult.java

@@ -76,7 +76,7 @@ public TaskResult(TaskInfo task, Exception error) throws IOException {
     * Construct a {@linkplain TaskResult} for a task that completed successfully.
     */
    public TaskResult(TaskInfo task, ToXContent response) throws IOException {
-        this(true, task, null, toXContent(response));
+        this(true, task, null, XContentHelper.toXContent(response, Requests.INDEX_CONTENT_TYPE, true));


I don't think we store it in the human readable format at the moment. Do we need to make it human readable?

Good point. I went with human readable as I think these responses will be consumed by users, so it might be best to store the human readable form (if any). But I don't feel strongly about it. Happy to do as you suggest.

I am ok with human readable, my only concern is that it is somewhat breaking change since it changes the format (by adding new fields). @nik9000 what do you think about this change?

I believe this only changes the stuff in the task part. Which should be fine because we don't map that anyway. And we kind of expect users to be ok with new fields in the http responses and this is fairly similar conceptually. So I'm +1 on it.

imotov · 2018-09-26T14:35:15Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportDeleteJobAction.java

-        throw new UnsupportedOperationException("the Task parameter is required");
+    private Optional<Task> findExistingStartedTask(Task currentTask) {
+        return taskManager.getTasks().values().stream().filter(filteredTask ->
+                currentTask.getDescription().equals(filteredTask.getDescription()) &&


This comparison seems a bit fragile. If we want to do it with task manager (see my other comment above). I would suggest adding a filter for the action name and then if action matches JobDeletion, we can cast it to JobDeletionTask and get job id from the task instead of description.

imotov · 2018-09-26T14:36:35Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportDeleteJobAction.java

+        ParentTaskAssigningClient parentTaskClient = new ParentTaskAssigningClient(client, taskId);
+
+        // Check if there is a deletion task for this job already and if yes wait for it to complete
+        Optional<Task> existingStartedTask;


I think with this change we no longer have a race condition, when two tasks could end up waiting for each other as we had before, but it is still very complicated (especially the waitForExistingTaskToComplete part). What would you think about having a static map of arrays of listeners here. On start, in a critical section, you can check if there is already an element with the same job ID present in the map if not, you add your job ID with an empty array and start delete process, if job ID exists - just add yourself to the array of listeners. On completion, again in a critical section, you remove the array with your job ID from the map and execute all listeners that were registered there.

Happy to try that! It sounds less complex indeed. Why do you suggest the listener-map should be static? Given the action is a singleton object, it could just be a final member, right?

Why do you suggest the listener-map should be static?

Well, not enough coffee, I guess. Yes, it should be just a member of action since action is a singleton.

I have pushed a commit to refactor using listeners. It did simplify things a lot. Could you have another look please?

droberts195

Looks good. I just saw one thing that needs to be confirmed to ensure the rename won't cause problems.

droberts195 · 2018-09-27T09:52:22Z

client/rest-high-level/src/main/java/org/elasticsearch/client/ml/job/config/Job.java

@@ -559,6 +573,11 @@ public Builder setResultsIndexName(String resultsIndexName) {
            return this;
        }

+        public Builder setDeleting(Boolean deleting) {


I wonder if this should not be public, because is there ever a case when the end user would sensibly change the value of this?

Good point. I'll reduce its visibility.

droberts195 · 2018-09-27T10:03:09Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/job/config/Job.java

@@ -77,7 +77,7 @@
    public static final ParseField MODEL_SNAPSHOT_ID = new ParseField("model_snapshot_id");
    public static final ParseField MODEL_SNAPSHOT_MIN_VERSION = new ParseField("model_snapshot_min_version");
    public static final ParseField RESULTS_INDEX_NAME = new ParseField("results_index_name");
-    public static final ParseField DELETED = new ParseField("deleted");
+    public static final ParseField DELETING = new ParseField("deleting");


Please can you double check that we have lenient parsing of job configs from cluster state in 6.0.

The reason is that a 6.0 node may end up parsing a 6.5 job that is being deleted if a full cluster restart of a mixed version cluster happens for some reason. Assuming we have lenient parsing for stored job configs in 6.0 this should work as before - the fact that the job was being deleted when the cluster was shut down will be forgotten, but it can be deleted again - same as it always was in 6.0.

I checked and there is lenient parsing in 6.0 so we're good on this front.

droberts195

LGTM

This changes the delete job API by adding the choice to delete a job asynchronously. The commit adds a `wait_for_completion` parameter to the delete job request. When set to `false`, the action returns immediately and the response contains the task id. This also changes the handling of subsequent delete requests for a job that is already being deleted. It now uses the task framework to check if the job is being deleted instead of the cluster state. This is a beneficial for it is going to also be working once the job configs are moved out of the cluster state and into an index. Also, force delete requests that are waiting for the job to be deleted will not proceed with the deletion if the first task fails. This will prevent overloading the cluster. Instead, the failure is communicated better via notifications so that the user may retry. Finally, this makes the `deleting` property of the job visible (also it was renamed from `deleted`). This allows a client to render a deleting job differently. Closes elastic#32836

dimitris-athanasiou · 2018-10-04T23:41:11Z

@imotov Has given me the green light on this as well!

This changes the delete job API by adding the choice to delete a job asynchronously. The commit adds a `wait_for_completion` parameter to the delete job request. When set to `false`, the action returns immediately and the response contains the task id. This also changes the handling of subsequent delete requests for a job that is already being deleted. It now uses the task framework to check if the job is being deleted instead of the cluster state. This is a beneficial for it is going to also be working once the job configs are moved out of the cluster state and into an index. Also, force delete requests that are waiting for the job to be deleted will not proceed with the deletion if the first task fails. This will prevent overloading the cluster. Instead, the failure is communicated better via notifications so that the user may retry. Finally, this makes the `deleting` property of the job visible (also it was renamed from `deleted`). This allows a client to render a deleting job differently. Closes #32836

* master: Rename CCR stats implementation (elastic#34300) Add max_children limit to nested sort (elastic#33587) MINOR: Remove Dead Code from Netty4Transport (elastic#34134) Rename clsuterformation -> testclusters (elastic#34299) [Build] make sure there are no duplicate classes in third party audit (elastic#34213) BWC Build: Read CI properties to determine java version (elastic#34295) [DOCS] Fix typo and add [float] Allow User/Password realms to disable authc (elastic#34033) Enable security automaton caching (elastic#34028) Preserve thread context during authentication. (elastic#34290) [ML] Allow asynchronous job deletion (elastic#34058)

* master: (63 commits) [Build] randomizedtesting: Allow property values to be closures (elastic#34319) Feature/hlrc ml docs cleanup (elastic#34316) Docs: DRY up CRUD docs (elastic#34203) Minor corrections in geo-queries.asciidoc (elastic#34314) [DOCS] Remove beta label from normalizers (elastic#34326) Adjust size of BigArrays in circuit breaker test Adapt bwc version after backport Follow stats structure (elastic#34301) Rename CCR stats implementation (elastic#34300) Add max_children limit to nested sort (elastic#33587) MINOR: Remove Dead Code from Netty4Transport (elastic#34134) Rename clsuterformation -> testclusters (elastic#34299) [Build] make sure there are no duplicate classes in third party audit (elastic#34213) BWC Build: Read CI properties to determine java version (elastic#34295) [DOCS] Fix typo and add [float] Allow User/Password realms to disable authc (elastic#34033) Enable security automaton caching (elastic#34028) Preserve thread context during authentication. (elastic#34290) [ML] Allow asynchronous job deletion (elastic#34058) HLRC: ML Adding get datafeed stats API (elastic#34271) ...

This changes the delete job API by adding the choice to delete a job asynchronously. The commit adds a `wait_for_completion` parameter to the delete job request. When set to `false`, the action returns immediately and the response contains the task id. This also changes the handling of subsequent delete requests for a job that is already being deleted. It now uses the task framework to check if the job is being deleted instead of the cluster state. This is a beneficial for it is going to also be working once the job configs are moved out of the cluster state and into an index. Also, force delete requests that are waiting for the job to be deleted will not proceed with the deletion if the first task fails. This will prevent overloading the cluster. Instead, the failure is communicated better via notifications so that the user may retry. Finally, this makes the `deleting` property of the job visible (also it was renamed from `deleted`). This allows a client to render a deleting job differently. Closes #32836

dimitris-athanasiou added >enhancement review WIP v7.0.0 :ml Machine learning v6.5.0 labels Sep 25, 2018

dimitris-athanasiou removed the WIP label Sep 26, 2018

benwtrent reviewed Sep 26, 2018

View reviewed changes

imotov reviewed Sep 26, 2018

View reviewed changes

droberts195 reviewed Sep 27, 2018

View reviewed changes

droberts195 approved these changes Sep 27, 2018

View reviewed changes

dimitris-athanasiou added 4 commits September 27, 2018 16:24

Synchronize search of existing tasks and mark started

2ff5c9d

Refactor to use listeners

2ad8c9f

Make Job.setDeleting package private and fix logging error

2cfddfc

dimitris-athanasiou force-pushed the add-async-job-deletion-option branch from efcbf1e to 2cfddfc Compare September 27, 2018 13:54

Remove DeleteJobIT test as it's no longer relevant

656fa8a

dimitris-athanasiou merged commit 4dacfa9 into elastic:master Oct 4, 2018

dimitris-athanasiou deleted the add-async-job-deletion-option branch October 4, 2018 23:41

davidkyle mentioned this pull request Nov 7, 2018

[ML] Prevent notifications being created on deletion of a non existent job #35337

Merged

Mpdreamz mentioned this pull request Dec 13, 2018

[meta] 6.5.0 Release elastic/elasticsearch-net#3457

Closed

codebrain mentioned this pull request Jan 28, 2019

[meta] 6.6.0 Release elastic/elasticsearch-net#3552

Closed

48 tasks

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Allow asynchronous job deletion #34058

[ML] Allow asynchronous job deletion #34058

dimitris-athanasiou commented Sep 25, 2018

elasticmachine commented Sep 25, 2018

dimitris-athanasiou commented Sep 25, 2018

dimitris-athanasiou commented Sep 26, 2018

benwtrent Sep 26, 2018

dimitris-athanasiou Sep 26, 2018

benwtrent Sep 26, 2018

imotov left a comment

imotov Sep 26, 2018

dimitris-athanasiou Sep 26, 2018 •

edited

Loading

imotov Sep 26, 2018

nik9000 Sep 27, 2018

imotov Sep 26, 2018

imotov Sep 26, 2018

dimitris-athanasiou Sep 26, 2018

imotov Sep 26, 2018

dimitris-athanasiou Sep 27, 2018

droberts195 left a comment

droberts195 Sep 27, 2018

dimitris-athanasiou Sep 27, 2018

droberts195 Sep 27, 2018

dimitris-athanasiou Sep 27, 2018

droberts195 left a comment

dimitris-athanasiou commented Oct 4, 2018

[ML] Allow asynchronous job deletion #34058

[ML] Allow asynchronous job deletion #34058

Conversation

dimitris-athanasiou commented Sep 25, 2018

elasticmachine commented Sep 25, 2018

dimitris-athanasiou commented Sep 25, 2018

dimitris-athanasiou commented Sep 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imotov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimitris-athanasiou Sep 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droberts195 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droberts195 left a comment

Choose a reason for hiding this comment

dimitris-athanasiou commented Oct 4, 2018

dimitris-athanasiou Sep 26, 2018 •

edited

Loading