-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ReplicationOperation should fail gracefully #115341
ReplicationOperation should fail gracefully #115341
Conversation
Problem: finishAsFailed could be called asynchronously in the middle of operations like runPostReplicationActions which try to sync the translog. finishAsFailed immediately triggers the failure of the resultListener which releases the index shard primary operation permit. This means that runPostReplicationActions may try to sync the translog without an operation permit. Solution: We refactor the infrastructure of ReplicationOperation regarding pendingActions and the resultListener, by replacing them with a RefCountingListener. This way, if there are async failures, they are aggregated, and the result listener is called once, after all mid-way operations are done. For the specific error we got in issue elastic#97183, this means that a call to onNoLongerPrimary (which can happen if we fail to fail a replica shard or mark it as stale) will not immediately release the primary operation permit and the assertion in the translog sync will be honored. Fixes elastic#97183
Pinging @elastic/es-distributed (Team:Distributed) |
I should be able to dig into review this tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. All the comments are optional. Just things around listener methods.
server/src/main/java/org/elasticsearch/action/support/replication/ReplicationOperation.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/support/replication/ReplicationOperation.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/support/replication/ReplicationOperation.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/support/replication/ReplicationOperation.java
Outdated
Show resolved
Hide resolved
Such as moving RefCountingListener to try with resources, and adding ActionListener.run in a couple of places.
Thanks @Tim-Brooks ! Incorporated your feedback, and also made some nit changes to ensure no exception escapes through (by making use of try-with-resources for the ActionListenerRef and some ActionListener.runs in a couple of places). Feel free to give another quick look if you'd like. |
LGTM |
We introduce ActionListener.run() in order to ensure the RefCountingListener introduced by PR elastic#115341 , is the single point that is failed upon exceptions, and no exception escapes through the ReplicationOperation.execute() method. Fixes elastic#116071
Problem: finishAsFailed could be called asynchronously in the middle of operations like runPostReplicationActions which try to sync the translog. finishAsFailed immediately triggers the failure of the resultListener which releases the index shard primary operation permit. This means that runPostReplicationActions may try to sync the translog without an operation permit. Solution: We refactor the infrastructure of ReplicationOperation regarding pendingActions and the resultListener, by replacing them with a RefCountingListener. This way, if there are async failures, they are aggregated, and the result listener is called once, after all mid-way operations are done. For the specific error we got in issue elastic#97183, this means that a call to onNoLongerPrimary (which can happen if we fail to fail a replica shard or mark it as stale) will not immediately release the primary operation permit and the assertion in the translog sync will be honored. Fixes elastic#97183
We introduce ActionListener.run() in order to ensure the RefCountingListener introduced by PR elastic#115341 , is the single point that is failed upon exceptions, and no exception escapes through the ReplicationOperation.execute() method. Fixes elastic#116071 Fixes elastic#116081 Fixes elastic#116073
Problem:
finishAsFailed could be called before, or even asynchronously in the middle of, operations like runPostReplicationActions which try to sync the translog. finishAsFailed immediately triggers the failure of the resultListener which releases the index shard primary operation permit. This means that runPostReplicationActions may try to sync the translog without an operation permit.
Solution:
We refactor the infrastructure of ReplicationOperation regarding pendingActions and the resultListener, by replacing them with a RefCountingListener. This way, if there are failures, the result listener is called once, after all pending actions are done.
For the specific error we got in issue #97183, this means that a call to onNoLongerPrimary (which can happen if we fail to fail a replica shard or mark it as stale) will not immediately release the primary operation permit and the assertion in the translog sync will be honored.
Fixes #97183