Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReplicationOperation should fail gracefully #115341

Merged

Conversation

kingherc
Copy link
Contributor

@kingherc kingherc commented Oct 22, 2024

Problem:
finishAsFailed could be called before, or even asynchronously in the middle of, operations like runPostReplicationActions which try to sync the translog. finishAsFailed immediately triggers the failure of the resultListener which releases the index shard primary operation permit. This means that runPostReplicationActions may try to sync the translog without an operation permit.

Solution:
We refactor the infrastructure of ReplicationOperation regarding pendingActions and the resultListener, by replacing them with a RefCountingListener. This way, if there are failures, the result listener is called once, after all pending actions are done.

For the specific error we got in issue #97183, this means that a call to onNoLongerPrimary (which can happen if we fail to fail a replica shard or mark it as stale) will not immediately release the primary operation permit and the assertion in the translog sync will be honored.

Fixes #97183

Problem:
finishAsFailed could be called asynchronously in the
middle of operations like runPostReplicationActions which try to
sync the translog. finishAsFailed immediately triggers the failure
of the resultListener which releases the index shard primary
operation permit. This means that runPostReplicationActions may
try to sync the translog without an operation permit.

Solution:
We refactor the infrastructure of ReplicationOperation regarding
pendingActions and the resultListener, by replacing them with a
RefCountingListener. This way, if there are async failures, they
are aggregated, and the result listener is called once, after all
mid-way operations are done.

For the specific error we got in issue elastic#97183, this means that
a call to onNoLongerPrimary (which can happen if we fail to fail
a replica shard or mark it as stale) will not immediately release
the primary operation permit and the assertion in the translog sync
will be honored.

Fixes elastic#97183
@kingherc kingherc added :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. >test-failure Triaged test failures from CI Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels Oct 22, 2024
@kingherc kingherc self-assigned this Oct 22, 2024
@kingherc kingherc marked this pull request as ready for review October 22, 2024 17:25
@elasticsearchmachine elasticsearchmachine added the needs:risk Requires assignment of a risk label (low, medium, blocker) label Oct 22, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@Tim-Brooks
Copy link
Contributor

Tim-Brooks commented Oct 28, 2024

I should be able to dig into review this tomorrow.

Copy link
Contributor

@Tim-Brooks Tim-Brooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. All the comments are optional. Just things around listener methods.

Such as moving RefCountingListener to try with resources, and
adding ActionListener.run in a couple of places.
@kingherc
Copy link
Contributor Author

Thanks @Tim-Brooks ! Incorporated your feedback, and also made some nit changes to ensure no exception escapes through (by making use of try-with-resources for the ActionListenerRef and some ActionListener.runs in a couple of places). Feel free to give another quick look if you'd like.

@Tim-Brooks
Copy link
Contributor

Feel free to give another quick look if you'd like.

LGTM

@kingherc kingherc merged commit fc1d9d0 into elastic:main Nov 1, 2024
16 checks passed
@kingherc kingherc deleted the test-failure/97183-translog-shutdown branch November 1, 2024 06:10
kingherc added a commit to kingherc/elasticsearch that referenced this pull request Nov 1, 2024
We introduce ActionListener.run() in order to ensure the
RefCountingListener introduced by PR elastic#115341 , is the single point
that is failed upon exceptions, and no exception escapes through
the ReplicationOperation.execute() method.

Fixes elastic#116071
kingherc added a commit that referenced this pull request Nov 1, 2024
We introduce ActionListener.run() in order to ensure the
RefCountingListener introduced by PR #115341 , is the single point
that is failed upon exceptions, and no exception escapes through
the ReplicationOperation.execute() method.

Fixes #116071
Fixes #116081
Fixes #116073
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Nov 4, 2024
Problem:
finishAsFailed could be called asynchronously in the
middle of operations like runPostReplicationActions which try to
sync the translog. finishAsFailed immediately triggers the failure
of the resultListener which releases the index shard primary
operation permit. This means that runPostReplicationActions may
try to sync the translog without an operation permit.

Solution:
We refactor the infrastructure of ReplicationOperation regarding
pendingActions and the resultListener, by replacing them with a
RefCountingListener. This way, if there are async failures, they
are aggregated, and the result listener is called once, after all
mid-way operations are done.

For the specific error we got in issue elastic#97183, this means that
a call to onNoLongerPrimary (which can happen if we fail to fail
a replica shard or mark it as stale) will not immediately release
the primary operation permit and the assertion in the translog sync
will be honored.

Fixes elastic#97183
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Nov 4, 2024
We introduce ActionListener.run() in order to ensure the
RefCountingListener introduced by PR elastic#115341 , is the single point
that is failed upon exceptions, and no exception escapes through
the ReplicationOperation.execute() method.

Fixes elastic#116071
Fixes elastic#116081
Fixes elastic#116073
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. needs:risk Requires assignment of a risk label (low, medium, blocker) Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] IndexRecoveryIT testPeerRecoveryTrimsLocalTranslog failing
3 participants