-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Allow asynchronous job deletion #34058
[ML] Allow asynchronous job deletion #34058
Conversation
Pinging @elastic/ml-core |
I labelled this with "WIP" only because I have a failing test that I need to resolve. The code is pretty stable so can be reviewed. Also, could @imotov please have a look at the usage of the tasks framework? |
I have now resolved the test failure and I've tested in cluster with multiple nodes. Ready for review! |
@@ -287,7 +287,7 @@ public Builder deleteJob(String jobId, PersistentTasksCustomMetaData tasks) { | |||
if (job == null) { | |||
throw new ResourceNotFoundException("job [" + jobId + "] does not exist"); | |||
} | |||
if (job.isDeleted() == false) { | |||
if (job.isDeleting() == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, what happens when deleteJob
gets called twice in a row while a job is in the deleting
status? On the second call, would the error not be ResourceNotFoundException("job [" + jobId + "] does not exist");
Or is the private final SortedMap<String, Job> jobs;
repopulated somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now protect against subsequent deletes using the task framework. You can see the implementation in TransportDeleteJobAction.masterOperation
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some task manager-related comments.
@@ -76,7 +76,7 @@ public TaskResult(TaskInfo task, Exception error) throws IOException { | |||
* Construct a {@linkplain TaskResult} for a task that completed successfully. | |||
*/ | |||
public TaskResult(TaskInfo task, ToXContent response) throws IOException { | |||
this(true, task, null, toXContent(response)); | |||
this(true, task, null, XContentHelper.toXContent(response, Requests.INDEX_CONTENT_TYPE, true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we store it in the human readable format at the moment. Do we need to make it human readable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I went with human readable as I think these responses will be consumed by users, so it might be best to store the human readable form (if any). But I don't feel strongly about it. Happy to do as you suggest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with human readable, my only concern is that it is somewhat breaking change since it changes the format (by adding new fields). @nik9000 what do you think about this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this only changes the stuff in the task
part. Which should be fine because we don't map that anyway. And we kind of expect users to be ok with new fields in the http responses and this is fairly similar conceptually. So I'm +1 on it.
throw new UnsupportedOperationException("the Task parameter is required"); | ||
private Optional<Task> findExistingStartedTask(Task currentTask) { | ||
return taskManager.getTasks().values().stream().filter(filteredTask -> | ||
currentTask.getDescription().equals(filteredTask.getDescription()) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comparison seems a bit fragile. If we want to do it with task manager (see my other comment above). I would suggest adding a filter for the action name and then if action matches JobDeletion, we can cast it to JobDeletionTask and get job id from the task instead of description.
ParentTaskAssigningClient parentTaskClient = new ParentTaskAssigningClient(client, taskId); | ||
|
||
// Check if there is a deletion task for this job already and if yes wait for it to complete | ||
Optional<Task> existingStartedTask; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think with this change we no longer have a race condition, when two tasks could end up waiting for each other as we had before, but it is still very complicated (especially the waitForExistingTaskToComplete part). What would you think about having a static map of arrays of listeners here. On start, in a critical section, you can check if there is already an element with the same job ID present in the map if not, you add your job ID with an empty array and start delete process, if job ID exists - just add yourself to the array of listeners. On completion, again in a critical section, you remove the array with your job ID from the map and execute all listeners that were registered there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to try that! It sounds less complex indeed. Why do you suggest the listener-map should be static? Given the action is a singleton object, it could just be a final member, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you suggest the listener-map should be static?
Well, not enough coffee, I guess. Yes, it should be just a member of action since action is a singleton.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pushed a commit to refactor using listeners. It did simplify things a lot. Could you have another look please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I just saw one thing that needs to be confirmed to ensure the rename won't cause problems.
@@ -559,6 +573,11 @@ public Builder setResultsIndexName(String resultsIndexName) { | |||
return this; | |||
} | |||
|
|||
public Builder setDeleting(Boolean deleting) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this should not be public
, because is there ever a case when the end user would sensibly change the value of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'll reduce its visibility.
@@ -77,7 +77,7 @@ | |||
public static final ParseField MODEL_SNAPSHOT_ID = new ParseField("model_snapshot_id"); | |||
public static final ParseField MODEL_SNAPSHOT_MIN_VERSION = new ParseField("model_snapshot_min_version"); | |||
public static final ParseField RESULTS_INDEX_NAME = new ParseField("results_index_name"); | |||
public static final ParseField DELETED = new ParseField("deleted"); | |||
public static final ParseField DELETING = new ParseField("deleting"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please can you double check that we have lenient parsing of job configs from cluster state in 6.0.
The reason is that a 6.0 node may end up parsing a 6.5 job that is being deleted if a full cluster restart of a mixed version cluster happens for some reason. Assuming we have lenient parsing for stored job configs in 6.0 this should work as before - the fact that the job was being deleted when the cluster was shut down will be forgotten, but it can be deleted again - same as it always was in 6.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked and there is lenient parsing in 6.0
so we're good on this front.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This changes the delete job API by adding the choice to delete a job asynchronously. The commit adds a `wait_for_completion` parameter to the delete job request. When set to `false`, the action returns immediately and the response contains the task id. This also changes the handling of subsequent delete requests for a job that is already being deleted. It now uses the task framework to check if the job is being deleted instead of the cluster state. This is a beneficial for it is going to also be working once the job configs are moved out of the cluster state and into an index. Also, force delete requests that are waiting for the job to be deleted will not proceed with the deletion if the first task fails. This will prevent overloading the cluster. Instead, the failure is communicated better via notifications so that the user may retry. Finally, this makes the `deleting` property of the job visible (also it was renamed from `deleted`). This allows a client to render a deleting job differently. Closes elastic#32836
efcbf1e
to
2cfddfc
Compare
@imotov Has given me the green light on this as well! |
This changes the delete job API by adding the choice to delete a job asynchronously. The commit adds a `wait_for_completion` parameter to the delete job request. When set to `false`, the action returns immediately and the response contains the task id. This also changes the handling of subsequent delete requests for a job that is already being deleted. It now uses the task framework to check if the job is being deleted instead of the cluster state. This is a beneficial for it is going to also be working once the job configs are moved out of the cluster state and into an index. Also, force delete requests that are waiting for the job to be deleted will not proceed with the deletion if the first task fails. This will prevent overloading the cluster. Instead, the failure is communicated better via notifications so that the user may retry. Finally, this makes the `deleting` property of the job visible (also it was renamed from `deleted`). This allows a client to render a deleting job differently. Closes #32836
* master: Rename CCR stats implementation (elastic#34300) Add max_children limit to nested sort (elastic#33587) MINOR: Remove Dead Code from Netty4Transport (elastic#34134) Rename clsuterformation -> testclusters (elastic#34299) [Build] make sure there are no duplicate classes in third party audit (elastic#34213) BWC Build: Read CI properties to determine java version (elastic#34295) [DOCS] Fix typo and add [float] Allow User/Password realms to disable authc (elastic#34033) Enable security automaton caching (elastic#34028) Preserve thread context during authentication. (elastic#34290) [ML] Allow asynchronous job deletion (elastic#34058)
* master: (63 commits) [Build] randomizedtesting: Allow property values to be closures (elastic#34319) Feature/hlrc ml docs cleanup (elastic#34316) Docs: DRY up CRUD docs (elastic#34203) Minor corrections in geo-queries.asciidoc (elastic#34314) [DOCS] Remove beta label from normalizers (elastic#34326) Adjust size of BigArrays in circuit breaker test Adapt bwc version after backport Follow stats structure (elastic#34301) Rename CCR stats implementation (elastic#34300) Add max_children limit to nested sort (elastic#33587) MINOR: Remove Dead Code from Netty4Transport (elastic#34134) Rename clsuterformation -> testclusters (elastic#34299) [Build] make sure there are no duplicate classes in third party audit (elastic#34213) BWC Build: Read CI properties to determine java version (elastic#34295) [DOCS] Fix typo and add [float] Allow User/Password realms to disable authc (elastic#34033) Enable security automaton caching (elastic#34028) Preserve thread context during authentication. (elastic#34290) [ML] Allow asynchronous job deletion (elastic#34058) HLRC: ML Adding get datafeed stats API (elastic#34271) ...
This changes the delete job API by adding the choice to delete a job asynchronously. The commit adds a `wait_for_completion` parameter to the delete job request. When set to `false`, the action returns immediately and the response contains the task id. This also changes the handling of subsequent delete requests for a job that is already being deleted. It now uses the task framework to check if the job is being deleted instead of the cluster state. This is a beneficial for it is going to also be working once the job configs are moved out of the cluster state and into an index. Also, force delete requests that are waiting for the job to be deleted will not proceed with the deletion if the first task fails. This will prevent overloading the cluster. Instead, the failure is communicated better via notifications so that the user may retry. Finally, this makes the `deleting` property of the job visible (also it was renamed from `deleted`). This allows a client to render a deleting job differently. Closes #32836
This changes the delete job API by adding
the choice to delete a job asynchronously.
The commit adds a
wait_for_completion
parameterto the delete job request. When set to
false
,the action returns immediately and the response
contains the task id.
This also changes the handling of subsequent
delete requests for a job that is already being
deleted. It now uses the task framework to check
if the job is being deleted instead of the cluster
state. This is a beneficial for it is going to also
be working once the job configs are moved out of the
cluster state and into an index. Also, force delete
requests that are waiting for the job to be deleted
will not proceed with the deletion if the first task
fails. This will prevent overloading the cluster. Instead,
the failure is communicated better via notifications
so that the user may retry.
Finally, this makes the
deleting
property of the jobvisible (also it was renamed from
deleted
). This allowsa client to render a deleting job differently.
Closes #32836