Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AsyncOperator#isFinished must never return true on failure #104029

Merged
merged 3 commits into from
Jan 8, 2024

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Jan 8, 2024

I have several CI instances running ESQL tests over the past week, and they have identified cases where enrich IT tests return OK with some missing results instead of Failure when the enrich lookup hits circuit breakers. This is due to a race condition in isFinished and onFailure within the AsyncOperator. When an async lookup fails, we set the exception and then discard pages. Unfortunately, in the isFinished method, we perform the checks in the same order: first, we check for failure, and then we check for outstanding pages. If there is a long pause between these steps, isFinished might not detect the failure but see no outstanding pages, leading it to return true despite the presence of a failure. This change swaps the order of the checks.

@elasticsearchmachine
Copy link
Collaborator

Hi @dnhatn, I've created a changelog YAML for you.

@dnhatn dnhatn requested review from ChrisHegarty and nik9000 January 8, 2024 06:40
@dnhatn dnhatn marked this pull request as ready for review January 8, 2024 06:41
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jan 8, 2024
Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dnhatn
Copy link
Member Author

dnhatn commented Jan 8, 2024

@ChrisHegarty @nik9000 Thanks for reviewing.

@dnhatn dnhatn merged commit 22934b8 into elastic:main Jan 8, 2024
15 checks passed
@dnhatn dnhatn deleted the fix-async-operator branch January 8, 2024 16:52
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jan 8, 2024
…04029)

Enrich IT tests can return OK with some missing results instead of
Failure when the enrich lookup hits circuit breakers. This is due to a
race condition in isFinished and onFailure within the AsyncOperator.
When an async lookup fails, we set the exception and then discard pages.
Unfortunately, in the isFinished method, we perform the checks in the
same order: first, we check for failure, and then we check for
outstanding pages. If there is a long pause between these steps,
isFinished might not detect the failure but see no outstanding pages,
leading it to return true despite the presence of a failure. This change
swaps the order of the checks.
@elasticsearchmachine
Copy link
Collaborator

elasticsearchmachine commented Jan 8, 2024

Status Branch Result
8.12
8.11

elasticsearchmachine pushed a commit that referenced this pull request Jan 8, 2024
…104070)

Enrich IT tests can return OK with some missing results instead of
Failure when the enrich lookup hits circuit breakers. This is due to a
race condition in isFinished and onFailure within the AsyncOperator.
When an async lookup fails, we set the exception and then discard pages.
Unfortunately, in the isFinished method, we perform the checks in the
same order: first, we check for failure, and then we check for
outstanding pages. If there is a long pause between these steps,
isFinished might not detect the failure but see no outstanding pages,
leading it to return true despite the presence of a failure. This change
swaps the order of the checks.
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jan 8, 2024
…04029)

Enrich IT tests can return OK with some missing results instead of
Failure when the enrich lookup hits circuit breakers. This is due to a
race condition in isFinished and onFailure within the AsyncOperator.
When an async lookup fails, we set the exception and then discard pages.
Unfortunately, in the isFinished method, we perform the checks in the
same order: first, we check for failure, and then we check for
outstanding pages. If there is a long pause between these steps,
isFinished might not detect the failure but see no outstanding pages,
leading it to return true despite the presence of a failure. This change
swaps the order of the checks.
elasticsearchmachine pushed a commit that referenced this pull request Jan 8, 2024
…104079)

Enrich IT tests can return OK with some missing results instead of
Failure when the enrich lookup hits circuit breakers. This is due to a
race condition in isFinished and onFailure within the AsyncOperator.
When an async lookup fails, we set the exception and then discard pages.
Unfortunately, in the isFinished method, we perform the checks in the
same order: first, we check for failure, and then we check for
outstanding pages. If there is a long pause between these steps,
isFinished might not detect the failure but see no outstanding pages,
leading it to return true despite the presence of a failure. This change
swaps the order of the checks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.11.4 v8.12.1 v8.13.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants