[Rollup] Job deletion should be invoked on the allocated task #34574

polyfractal · 2018-10-17T22:19:03Z

We should delete a job by directly talking to the allocated task and telling it to shutdown. Today we shut down a job via the persistent task framework. This is not ideal because, while the job has been removed from the persistent task CS, the allocated task continues to live until it gets the shutdown message.

This means a user can delete a job, immediately delete the rollup index, and then see new documents appear in the just-deleted index. This happens because the indexer in the allocated task is still running and indexes a few more documents before getting the shutdown command.

In this PR, the transport action is changed to a TransportTasksAction, and we invoke onCancelled() directly on the matching job. The race condition still exists after this PR (albeit less likely), but this was a precursor to fixing the issue and a self-contained chunk of code. I'll followup with a second PR to fix the race itself.

We should delete a job by directly talking to the allocated task and telling it to shutdown. Today we shut down a job via the persistent task framework. This is not ideal because, while the job has been removed from the persistent task CS, the allocated task continues to live until it gets the shutdown message. This means a user can delete a job, immediately delete the rollup index, and then see new documents appear in the just-deleted index. This happens because the indexer in the allocated task is still running and indexes a few more documents before getting the shutdown command. In this PR, the transport action is changed to a TransportTasksAction, and we invoke `onCancelled()` directly on the matching job. There is still a potential for a race if a bulk happens right before the task shuts down, the async bulk will complete and cause the same situation. The lag between signal and stop is shorter, so the chance is less, but it still exists. We'll fix this in a followup PR.

elasticmachine · 2018-10-17T22:19:05Z

Pinging @elastic/es-search-aggs

colings86

@polyfractal I left a comment but I also wonder whether its worth asking someone more familiar with the persistent tasks framework to look at this as well?

colings86 · 2018-10-18T10:11:03Z

...llup/src/main/java/org/elasticsearch/xpack/rollup/action/TransportDeleteRollupJobAction.java

+        // If there are more, in production it should be ok as long as they are acknowledge shutting down.
+        // But in testing we'd like to know there were more than one hence the assert
+        assert tasks.size() == 1;
+        boolean cancelled = tasks.stream().anyMatch(DeleteRollupJobAction.Response::isDeleted);


Given the comment above should this not look for all tasks in the stream being deleted rather than just one?

polyfractal · 2018-10-18T13:21:46Z

Oops, yes, I meant to ping @martijnvg as well but forgot. Mvg would you mind taking a look at the persistent-task changes here too?

To simplify the synchronization, we decided that only a STOPPED job can be deleted. This commit enforces that check at the task level. The rest of the changes are just to facilitate percolating this information back to the user: - Common XContent output for failed task/node is moved out of ListTasksResponse and into the superclass for general use - The Rest layer is adjusted to throw a 5xx error if there are failures - Tweak to the acknowledged logic to account for task failures Also updated docs and tests

polyfractal · 2018-10-18T18:34:07Z

Chatted with @jimczi this morning, and we came up with a slightly different plan to help keep the synchronization simple:

DeleteJob talks to the allocated task directly, but only allows deletion of STOPPED jobs
StopJob is modified to include a wait_for_stopped flag which blocks the call until the indexer actually stops, instead of setting the flag and returning
@jimczi is working on a state refactor which should simplify a lot of the synchronization complexity we have right now. This can proceed at it's leisure because 1) and 2) fix the immediate issue.

I just pushed a commit to accomplish 1), and 2) will be done in a followup PR.

The workflow is a little more irritating now (user must stop a job then delete). But it felt more natural for a "stop" operation to block than a "delete" operation, and it keeps life simple internally.

martijnvg · 2018-10-19T07:41:57Z

...llup/src/main/java/org/elasticsearch/xpack/rollup/action/TransportDeleteRollupJobAction.java

+        final DiscoveryNodes nodes = state.nodes();
+
+        if (nodes.isLocalNodeElectedMaster()) {
+            PersistentTasksCustomMetaData pTasksMeta = state.getMetaData().custom(PersistentTasksCustomMetaData.TYPE);


This is better than checking the cluster state on a non elected master node, but it is still possible that this cluster state is stale if for example multiple delete job requests for the same job id are executed concurrently. I think reading the cluster state from cluster state update thread will fix this properly (ClusterService#submitStateUpdateTask(...) without updating the cluster state). Maybe this approach is not worth it, considering that invoking onCancelled() twice has no negative impact.

Ah, that's good to know for future reference.

I think we're ok here because as you said cancelling twice is fine. onCancelled() will set state to ABORTING... if the indexer was idle during the first request shutdown() will be called directly. If not, it will defer until the indexer finishes.

And in both cases, the second request will stay at ABORTING and won't call shutdown().

I think we'd still have stale CS issues anyhow, since we're shutting down at the task first then alerting the master. So the task could shutdown, send the markAsCompleted() request, a second delete arrives at the master and checks state and sees the task is still alive before the markAsCompleted() shows up.

Since the rollup cleanup code was moved to ESRestTestCase, this code is redundant (and throws errors since it is out of date)

…ocated_task_deletion

polyfractal · 2018-10-22T16:32:00Z

@jimczi @colings86 would you mind taking another look with the current changes?

I'll kick a CI run off once github/CI come back to life.

jimczi

I left some minor comments, LGTM otherwise

jimczi · 2018-10-22T16:42:32Z

...gin/core/src/main/java/org/elasticsearch/xpack/core/rollup/action/DeleteRollupJobAction.java

+            if (this == o) return true;
+            if (o == null || getClass() != o.getClass()) return false;
+            DeleteRollupJobAction.Response response = (DeleteRollupJobAction.Response) o;
+            return acknowledged == response.acknowledged;


nit: shouldn't we do super.equals first (and have an equals impl in BaseTasksResponse) ?

Good catch! Thanks

jimczi · 2018-10-22T16:42:40Z

...gin/core/src/main/java/org/elasticsearch/xpack/core/rollup/action/DeleteRollupJobAction.java

+
+        @Override
+        public int hashCode() {
+            return Objects.hash(acknowledged);


Same here ?

…ocated_task_deletion

We should delete a job by directly talking to the allocated task and telling it to shutdown. Today we shut down a job via the persistent task framework. This is not ideal because, while the job has been removed from the persistent task CS, the allocated task continues to live until it gets the shutdown message. This means a user can delete a job, immediately delete the rollup index, and then see new documents appear in the just-deleted index. This happens because the indexer in the allocated task is still running and indexes a few more documents before getting the shutdown command. In this PR, the transport action is changed to a TransportTasksAction, and we invoke onCancelled() directly on the matching job. The race condition still exists after this PR (albeit less likely), but this was a precursor to fixing the issue and a self-contained chunk of code. A second PR will followup to fix the race itself.

* master: (24 commits) ingest: better support for conditionals with simulate?verbose (elastic#34155) [Rollup] Job deletion should be invoked on the allocated task (elastic#34574) [DOCS] .Security index is never auto created (elastic#34589) CCR: Requires soft-deletes on the follower (elastic#34725) re-enable bwc tests (elastic#34743) Empty GetAliases authorization fix (elastic#34444) INGEST: Document Processor Conditional (elastic#33388) [CCR] Add total fetch time leader stat (elastic#34577) SQL: Support pattern against compatible indices (elastic#34718) [CCR] Auto follow pattern APIs adjustments (elastic#34518) [Test] Remove dead code from ExceptionSerializationTests (elastic#34713) A small typo in migration-assistance doc (elastic#34704) ingest: processor stats (elastic#34724) SQL: Implement IN(value1, value2, ...) expression. (elastic#34581) Tests: Add checks to GeoDistanceQueryBuilderTests (elastic#34273) INGEST: Rename Pipeline Processor Param. (elastic#34733) Core: Move IndexNameExpressionResolver to java time (elastic#34507) [DOCS] Force Merge: clarify execution and storage requirements (elastic#33882) TESTING.asciidoc fix examples using forbidden annotation (elastic#34515) SQL: Implement `CONVERT`, an alternative to `CAST` (elastic#34660) ...

We should delete a job by directly talking to the allocated task and telling it to shutdown. Today we shut down a job via the persistent task framework. This is not ideal because, while the job has been removed from the persistent task CS, the allocated task continues to live until it gets the shutdown message. This means a user can delete a job, immediately delete the rollup index, and then see new documents appear in the just-deleted index. This happens because the indexer in the allocated task is still running and indexes a few more documents before getting the shutdown command. In this PR, the transport action is changed to a TransportTasksAction, and we invoke onCancelled() directly on the matching job. The race condition still exists after this PR (albeit less likely), but this was a precursor to fixing the issue and a self-contained chunk of code. A second PR will followup to fix the race itself.

$polyfractal$

$@polyfractal$ polyfractal added v7.0.0 >refactoring :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data v6.5.0 labels Oct 17, 2018

$@polyfractal$ polyfractal requested review from colings86 and jimczi October 17, 2018 22:19

$@polyfractal$

checkstyle

f4d5bef

colings86 reviewed Oct 18, 2018

View reviewed changes

polyfractal added 2 commits October 18, 2018 14:10

$@polyfractal$

checkstyle

0e88165

martijnvg reviewed Oct 19, 2018

View reviewed changes

polyfractal added 2 commits October 19, 2018 12:39

$@polyfractal$

Remove redundant test cleanup

6f54698

Since the rollup cleanup code was moved to ESRestTestCase, this code is redundant (and throws errors since it is out of date)

$@polyfractal$

Merge remote-tracking branch 'origin/master' into rollup_wait_for_all…

3312673

…ocated_task_deletion

jimczi approved these changes Oct 22, 2018

View reviewed changes

polyfractal added 3 commits October 22, 2018 13:13

$@polyfractal$

Fix equals and hashcode for Delete and BaseTasksResponse

3bc180e

$@polyfractal$

Fix doc tests

7fdd9e7

$@polyfractal$

Merge remote-tracking branch 'origin/master' into rollup_wait_for_all…

b6d75ad

…ocated_task_deletion

$@polyfractal$ polyfractal merged commit 4dbf498 into elastic:master Oct 23, 2018

$@polyfractal$ polyfractal mentioned this pull request Oct 24, 2018

[Rollup] Add wait_for_completion option to StopRollupJob API #34811

Merged

$@polyfractal$ polyfractal mentioned this pull request Nov 6, 2018

RollupDocumentationIT.testStartRollupJob fails #35295

Closed

hendrikmuhs mentioned this pull request Dec 4, 2018

[ML-DataFrame] cancel indexer on job deletion and remove task, allow only stopped jo… #36204

Merged

colings86 added the v7.0.0-beta1 label Feb 7, 2019

colings86 removed the v7.0.0 label Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Rollup] Job deletion should be invoked on the allocated task #34574

[Rollup] Job deletion should be invoked on the allocated task #34574

$@polyfractal$ polyfractal commented Oct 17, 2018

elasticmachine commented Oct 17, 2018

colings86 left a comment

colings86 Oct 18, 2018

polyfractal commented Oct 18, 2018

polyfractal commented Oct 18, 2018

martijnvg Oct 19, 2018

$@polyfractal$ polyfractal Oct 19, 2018

polyfractal commented Oct 22, 2018

jimczi left a comment

jimczi Oct 22, 2018

$@polyfractal$ polyfractal Oct 22, 2018

jimczi Oct 22, 2018

[Rollup] Job deletion should be invoked on the allocated task #34574

[Rollup] Job deletion should be invoked on the allocated task #34574

Conversation

polyfractal commented Oct 17, 2018

elasticmachine commented Oct 17, 2018

colings86 left a comment

Choose a reason for hiding this comment

colings86 Oct 18, 2018

Choose a reason for hiding this comment

polyfractal commented Oct 18, 2018

polyfractal commented Oct 18, 2018

martijnvg Oct 19, 2018

Choose a reason for hiding this comment

polyfractal Oct 19, 2018

Choose a reason for hiding this comment

polyfractal commented Oct 22, 2018

jimczi left a comment

Choose a reason for hiding this comment

jimczi Oct 22, 2018

Choose a reason for hiding this comment

polyfractal Oct 22, 2018

Choose a reason for hiding this comment

jimczi Oct 22, 2018

Choose a reason for hiding this comment

$@polyfractal$ polyfractal commented Oct 17, 2018

$@polyfractal$ polyfractal Oct 19, 2018

$@polyfractal$ polyfractal Oct 22, 2018