Skip to content

Commit

Permalink
ReplicationOperation should fail gracefully (#115341)
Browse files Browse the repository at this point in the history
Problem:
finishAsFailed could be called asynchronously in the
middle of operations like runPostReplicationActions which try to
sync the translog. finishAsFailed immediately triggers the failure
of the resultListener which releases the index shard primary
operation permit. This means that runPostReplicationActions may
try to sync the translog without an operation permit.

Solution:
We refactor the infrastructure of ReplicationOperation regarding
pendingActions and the resultListener, by replacing them with a
RefCountingListener. This way, if there are async failures, they
are aggregated, and the result listener is called once, after all
mid-way operations are done.

For the specific error we got in issue #97183, this means that
a call to onNoLongerPrimary (which can happen if we fail to fail
a replica shard or mark it as stale) will not immediately release
the primary operation permit and the assertion in the translog sync
will be honored.

Fixes #97183
  • Loading branch information
kingherc authored Nov 1, 2024
1 parent 8eb4d04 commit fc1d9d0
Show file tree
Hide file tree
Showing 2 changed files with 214 additions and 195 deletions.
Loading

0 comments on commit fc1d9d0

Please sign in to comment.