ReplicationOperation should fail gracefully #115341

kingherc · 2024-10-22T15:23:22Z

Problem:
finishAsFailed could be called before, or even asynchronously in the middle of, operations like runPostReplicationActions which try to sync the translog. finishAsFailed immediately triggers the failure of the resultListener which releases the index shard primary operation permit. This means that runPostReplicationActions may try to sync the translog without an operation permit.

Solution:
We refactor the infrastructure of ReplicationOperation regarding pendingActions and the resultListener, by replacing them with a RefCountingListener. This way, if there are failures, the result listener is called once, after all pending actions are done.

For the specific error we got in issue #97183, this means that a call to onNoLongerPrimary (which can happen if we fail to fail a replica shard or mark it as stale) will not immediately release the primary operation permit and the assertion in the translog sync will be honored.

Fixes #97183

Problem: finishAsFailed could be called asynchronously in the middle of operations like runPostReplicationActions which try to sync the translog. finishAsFailed immediately triggers the failure of the resultListener which releases the index shard primary operation permit. This means that runPostReplicationActions may try to sync the translog without an operation permit. Solution: We refactor the infrastructure of ReplicationOperation regarding pendingActions and the resultListener, by replacing them with a RefCountingListener. This way, if there are async failures, they are aggregated, and the result listener is called once, after all mid-way operations are done. For the specific error we got in issue elastic#97183, this means that a call to onNoLongerPrimary (which can happen if we fail to fail a replica shard or mark it as stale) will not immediately release the primary operation permit and the assertion in the translog sync will be honored. Fixes elastic#97183

elasticsearchmachine · 2024-10-22T17:26:10Z

Pinging @elastic/es-distributed (Team:Distributed)

…anslog-shutdown

Tim-Brooks · 2024-10-28T23:19:26Z

I should be able to dig into review this tomorrow.

…anslog-shutdown

Tim-Brooks

LGTM. All the comments are optional. Just things around listener methods.

server/src/main/java/org/elasticsearch/action/support/replication/ReplicationOperation.java

…anslog-shutdown

Such as moving RefCountingListener to try with resources, and adding ActionListener.run in a couple of places.

…anslog-shutdown

kingherc · 2024-10-31T18:14:00Z

Thanks @Tim-Brooks ! Incorporated your feedback, and also made some nit changes to ensure no exception escapes through (by making use of try-with-resources for the ActionListenerRef and some ActionListener.runs in a couple of places). Feel free to give another quick look if you'd like.

Tim-Brooks · 2024-10-31T23:17:13Z

Feel free to give another quick look if you'd like.

LGTM

We introduce ActionListener.run() in order to ensure the RefCountingListener introduced by PR elastic#115341 , is the single point that is failed upon exceptions, and no exception escapes through the ReplicationOperation.execute() method. Fixes elastic#116071

We introduce ActionListener.run() in order to ensure the RefCountingListener introduced by PR #115341 , is the single point that is failed upon exceptions, and no exception escapes through the ReplicationOperation.execute() method. Fixes #116071 Fixes #116081 Fixes #116073

Problem: finishAsFailed could be called asynchronously in the middle of operations like runPostReplicationActions which try to sync the translog. finishAsFailed immediately triggers the failure of the resultListener which releases the index shard primary operation permit. This means that runPostReplicationActions may try to sync the translog without an operation permit. Solution: We refactor the infrastructure of ReplicationOperation regarding pendingActions and the resultListener, by replacing them with a RefCountingListener. This way, if there are async failures, they are aggregated, and the result listener is called once, after all mid-way operations are done. For the specific error we got in issue elastic#97183, this means that a call to onNoLongerPrimary (which can happen if we fail to fail a replica shard or mark it as stale) will not immediately release the primary operation permit and the assertion in the translog sync will be honored. Fixes elastic#97183

We introduce ActionListener.run() in order to ensure the RefCountingListener introduced by PR elastic#115341 , is the single point that is failed upon exceptions, and no exception escapes through the ReplicationOperation.execute() method. Fixes elastic#116071 Fixes elastic#116081 Fixes elastic#116073

kingherc self-assigned this Oct 22, 2024

elasticsearchmachine added the v9.0.0 label Oct 22, 2024

kingherc marked this pull request as ready for review October 22, 2024 17:25

elasticsearchmachine added the needs:risk Requires assignment of a risk label (low, medium, blocker) label Oct 22, 2024

kingherc requested review from DaveCTurner and Tim-Brooks October 22, 2024 17:26

kingherc added 2 commits October 23, 2024 18:27

Merge remote-tracking branch 'origin/main' into test-failure/97183-tr…

9dccacb

…anslog-shutdown

Merge remote-tracking branch 'origin/main' into test-failure/97183-tr…

ea7f5d2

…anslog-shutdown

Merge remote-tracking branch 'origin/main' into test-failure/97183-tr…

3afecab

…anslog-shutdown

Tim-Brooks approved these changes Oct 30, 2024

View reviewed changes

kingherc added 3 commits October 31, 2024 10:24

Merge remote-tracking branch 'origin/main' into test-failure/97183-tr…

b603a38

…anslog-shutdown

PR feedback and a bit safer listeners

cedab56

Such as moving RefCountingListener to try with resources, and adding ActionListener.run in a couple of places.

Merge remote-tracking branch 'origin/main' into test-failure/97183-tr…

b04f840

…anslog-shutdown

kingherc merged commit fc1d9d0 into elastic:main Nov 1, 2024
16 checks passed

kingherc deleted the test-failure/97183-translog-shutdown branch November 1, 2024 06:10

kingherc mentioned this pull request Nov 1, 2024

SimpleBlocksIT.testAddBlockWhileDeletingIndices failing #116071

Closed

kingherc mentioned this pull request Nov 1, 2024

ReplicationOperation exceptions should not escape #116074

Merged

This was referenced Nov 1, 2024

[CI] CloseIndexIT testCloseWhileDeletingIndices failing #116081

Closed

[CI] CloseIndexIT testConcurrentClose failing #116073

Closed

[CI] ShardFollowTaskReplicationTests testRetryBulkShardOperations failing #116080

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReplicationOperation should fail gracefully #115341

ReplicationOperation should fail gracefully #115341

kingherc commented Oct 22, 2024 •

edited

Loading

elasticsearchmachine commented Oct 22, 2024

Tim-Brooks commented Oct 28, 2024 •

edited

Loading

Tim-Brooks left a comment

kingherc commented Oct 31, 2024

Tim-Brooks commented Oct 31, 2024

ReplicationOperation should fail gracefully #115341

ReplicationOperation should fail gracefully #115341

Conversation

kingherc commented Oct 22, 2024 • edited Loading

elasticsearchmachine commented Oct 22, 2024

Tim-Brooks commented Oct 28, 2024 • edited Loading

Tim-Brooks left a comment

Choose a reason for hiding this comment

kingherc commented Oct 31, 2024

Tim-Brooks commented Oct 31, 2024

kingherc commented Oct 22, 2024 •

edited

Loading

Tim-Brooks commented Oct 28, 2024 •

edited

Loading