Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix TasksIT#testGetTaskWaitForCompletionWithoutStoringResult #108094

Merged
merged 17 commits into from
May 29, 2024

Conversation

arteam
Copy link
Contributor

@arteam arteam commented Apr 30, 2024

It seems that the failure (the missed index) has always existed in the test scenario and it's supposed to be handled by TransportGetTaskAction.java. We catch IndexNotFoundException here and convert it to ResourceNotFoundException. Then we catch ResourceNotFoundException here and return a snapshot of a task as a response.

In the stack trace, getFinishedTaskFromIndex was called from getRunningTaskFromNode, not from waitedForCompletion due to a race between creating a get request and unblocking request which are sent asynchronously. I've changed the waitForCompletionTestCase test method to unblock the task only after the request started waiting for the task completion by registering a removal listener. By doing so, we make sure we test the "wait for completion" branch when task is running.

The part about the missed index seems to irrelevant, since waitedForCompletion is able to suppress the error and return a snapshot of running task which is not possible if getFinishedTaskFromIndex gets called directly from getRunningTaskFromNode.

Resolves #107823

Make sure the `.tasks` index is created before we starting testing task completion
without storing its result. To achieve that, we store a fake task before we start
`waitForCompletionTestCase`.

Resolves #107823
@arteam arteam added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. labels Apr 30, 2024
@elasticsearchmachine elasticsearchmachine added Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.15.0 labels Apr 30, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@arteam
Copy link
Contributor Author

arteam commented May 2, 2024

@elasticmachine update branch

@arteam arteam requested review from idegtiarenko, volodk85 and DaveCTurner and removed request for idegtiarenko and DaveCTurner May 2, 2024 07:26
@arteam
Copy link
Contributor Author

arteam commented May 7, 2024

@elasticmachine update branch

@arteam arteam requested review from idegtiarenko, DaveCTurner, volodk85 and a team and removed request for idegtiarenko, volodk85 and DaveCTurner May 8, 2024 07:56
@henningandersen
Copy link
Contributor

The linked issue says that the tasks index got deleted, but that does not seem to match the resolution here? Can we find out why the tasks index was deleted too soon instead?

@arteam
Copy link
Contributor Author

arteam commented May 13, 2024

@henningandersen I believe the comment in the linked issue is wrong. The index was never deleted, because the test doesn't create the index. The test waits for the a completion of a task and the tasks only completes, because we have special error handling for the case where the index doesn't exist. I guess in some cases the error handling doesn't can't figure out that the root cause was IndexNotFoundException which should be converted to ResourceNotFoundException which is silently ignored.

I believe we shoud just explicitly create the index, because testGetTaskWaitForCompletionWithoutStoringResult is supposed to test task completion, not the error handling for missed indexes which is done in testGetTaskNotFound and testTasksGetWaitForNoTask.

@henningandersen
Copy link
Contributor

@arteam it still smells like we might be covering up for a bug here. AFAICS, we expect the logic to work regardless of whether the index exists or not. Can you elaborate on how the test differentiates between whether the task exists or not? Since it if it is within the actual tasks code, we may want to target that instead (as well as add a dedicated test for it).

@DaveCTurner
Copy link
Contributor

DaveCTurner commented May 15, 2024 via email

@arteam
Copy link
Contributor Author

arteam commented May 21, 2024

@elasticmachine update branch

@arteam
Copy link
Contributor Author

arteam commented May 21, 2024

@henningandersen That was a very good catch! getFinishedTaskFromIndex was called from getRunningTaskFromNode, not from waitedForCompletion. There indeed seems to be a race between creating a get request and unblocking request which are sent asynchronously. I've changed waitForCompletionTestCase to unblock the task only after the request started waiting for the task completion by registering a removal listener. By doing so, we make sure we test the "wait for completion" branch when task is running.

The part about the missed index seems to irrelevant, since waitedForCompletion is able to suppress the error and return a snapshot of running task which is not possible if getFinishedTaskFromIndex gets called directly from getRunningTaskFromNode.

@arteam arteam requested a review from henningandersen May 21, 2024 07:55
@arteam
Copy link
Contributor Author

arteam commented May 24, 2024

@henningandersen Any chance you would be able to get a look at the changes in the PR?

@henningandersen
Copy link
Contributor

There indeed seems to be a race between creating a get request and unblocking request which are sent asynchronously

Did you manage to reproduce this by putting in a sleep somewhere? I'd like to fully understand the situation.

@arteam
Copy link
Contributor Author

arteam commented May 27, 2024

@henningandersen Yes, the error is reproduced trivially if you unblock the request first and add a small delay before calling the waitForCompletion request.

 // Unblock the request so the wait for completion request can finish
client().execute(UNBLOCK_TASK_ACTION, new TestTaskPlugin.UnblockTestTasksRequest()).get();
Thread.sleep(1000);
// Spin up a request to wait for the test task to finish
waitResponseFuture = wait.apply(taskId);

@henningandersen
Copy link
Contributor

trivially if you unblock the request first and add a small delay before calling the waitForCompletion request.

I am not sure I understand why it would be an ok reproduction to swap the order of unblock and wait here, can you elaborate? Is it possible to just add a sleep somewhere else to see it fail?

@arteam
Copy link
Contributor Author

arteam commented May 27, 2024

I am not sure I understand why it would be an ok reproduction to swap the order of unblock and wait here, can you elaborate? Is it possible to just add a sleep somewhere else to see it fail?

@henningandersen I believe the issue is that order is undefined since both operations are run asynchronously. We do not check that the request clusterAdmin().prepareGetTask(id).setWaitForCompletion(true) is finished before unblocking the task. We just get a ActionFuture and immediately call client().execute(UNBLOCK_TASK_ACTION, new TestTaskPlugin.UnblockTestTasksRequest()).get().

So, depending on a race which of one these requests will be processed first, we will get a different result. That's why taskManager.getTask(request.getTaskId().getId()) in TransportGetTaskAction can return null which happens if the unlock request manages to win the race.

@henningandersen
Copy link
Contributor

Thanks, that makes sense. I would have hoped we could put in a simple sleep somewhere to provoke it but I were not successful on that yet.

@henningandersen
Copy link
Contributor

I can reproduce this using -Dtests.seed=F52B12BE60A068C8 and a sleep at the beginning of getRunningTaskFromNode.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Thanks for the extra iterations, this version looks good (have a few smaller comments only).

@Override
public void onRemovedTaskListenerRegistered(RemovedTaskListener removedTaskListener) {
// Unblock the request only after it started waiting for task completion
if (removedTaskListener.toString().startsWith("Completing running task Task{id=" + taskId.getId())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit strange, I think it works without it too, since there should be no other wait for completions going on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@henningandersen There seems to be a bug in TestTaskPlugin#TransportTestTaskAction. It checks whether a task is blocked by running waitUntil for 10 seconds, but doesn't check whether waitUntil finished successfully.

…de/tasks/get/TransportGetTaskAction.java

Co-authored-by: Henning Andersen <[email protected]>
@arteam
Copy link
Contributor Author

arteam commented May 28, 2024

@elasticmachine update branch

@arteam arteam merged commit e622101 into main May 29, 2024
16 checks passed
@arteam arteam deleted the save-fake-tasks-to-create-task-index branch May 29, 2024 07:15
@arteam
Copy link
Contributor Author

arteam commented May 29, 2024

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test Issues or PRs that are addressing/adding tests v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] TasksIT testGetTaskWaitForCompletionWithoutStoringResult failing
5 participants