-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix TasksIT#testGetTaskWaitForCompletionWithoutStoringResult #108094
Conversation
Make sure the `.tasks` index is created before we starting testing task completion without storing its result. To achieve that, we store a fake task before we start `waitForCompletionTestCase`. Resolves #107823
Pinging @elastic/es-distributed (Team:Distributed) |
@elasticmachine update branch |
@elasticmachine update branch |
The linked issue says that the tasks index got deleted, but that does not seem to match the resolution here? Can we find out why the tasks index was deleted too soon instead? |
@henningandersen I believe the comment in the linked issue is wrong. The index was never deleted, because the test doesn't create the index. The test waits for the a completion of a task and the tasks only completes, because we have special error handling for the case where the index doesn't exist. I guess in some cases the error handling doesn't can't figure out that the root cause was I believe we shoud just explicitly create the index, because |
@arteam it still smells like we might be covering up for a bug here. AFAICS, we expect the logic to work regardless of whether the index exists or not. Can you elaborate on how the test differentiates between whether the task exists or not? Since it if it is within the actual tasks code, we may want to target that instead (as well as add a dedicated test for it). |
This reverts commit bf3b27d.
@elasticmachine update branch |
@henningandersen That was a very good catch! The part about the missed index seems to irrelevant, since |
@henningandersen Any chance you would be able to get a look at the changes in the PR? |
Did you manage to reproduce this by putting in a sleep somewhere? I'd like to fully understand the situation. |
@henningandersen Yes, the error is reproduced trivially if you unblock the request first and add a small delay before calling the // Unblock the request so the wait for completion request can finish
client().execute(UNBLOCK_TASK_ACTION, new TestTaskPlugin.UnblockTestTasksRequest()).get();
Thread.sleep(1000);
// Spin up a request to wait for the test task to finish
waitResponseFuture = wait.apply(taskId); |
I am not sure I understand why it would be an ok reproduction to swap the order of unblock and wait here, can you elaborate? Is it possible to just add a sleep somewhere else to see it fail? |
@henningandersen I believe the issue is that order is undefined since both operations are run asynchronously. We do not check that the request So, depending on a race which of one these requests will be processed first, we will get a different result. That's why |
Thanks, that makes sense. I would have hoped we could put in a simple sleep somewhere to provoke it but I were not successful on that yet. |
I can reproduce this using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thanks for the extra iterations, this version looks good (have a few smaller comments only).
.../main/java/org/elasticsearch/action/admin/cluster/node/tasks/get/TransportGetTaskAction.java
Outdated
Show resolved
Hide resolved
@Override | ||
public void onRemovedTaskListenerRegistered(RemovedTaskListener removedTaskListener) { | ||
// Unblock the request only after it started waiting for task completion | ||
if (removedTaskListener.toString().startsWith("Completing running task Task{id=" + taskId.getId())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit strange, I think it works without it too, since there should be no other wait for completions going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@henningandersen There seems to be a bug in TestTaskPlugin#TransportTestTaskAction
. It checks whether a task is blocked by running waitUntil
for 10 seconds, but doesn't check whether waitUntil
finished successfully.
…de/tasks/get/TransportGetTaskAction.java Co-authored-by: Henning Andersen <[email protected]>
@elasticmachine update branch |
This reverts commit f235b87.
Thank you! |
It seems that the failure (the missed index) has always existed in the test scenario and it's supposed to be handled by TransportGetTaskAction.java. We catch
IndexNotFoundException
here and convert it toResourceNotFoundException
. Then we catchResourceNotFoundException
here and return a snapshot of a task as a response.In the stack trace,
getFinishedTaskFromIndex
was called fromgetRunningTaskFromNode
, not fromwaitedForCompletion
due to a race between creating a get request and unblocking request which are sent asynchronously. I've changed thewaitForCompletionTestCase
test method to unblock the task only after the request started waiting for the task completion by registering a removal listener. By doing so, we make sure we test the "wait for completion" branch when task is running.The part about the missed index seems to irrelevant, since
waitedForCompletion
is able to suppress the error and return a snapshot of running task which is not possible ifgetFinishedTaskFromIndex
gets called directly fromgetRunningTaskFromNode
.Resolves #107823