Update checkpoints after post-replication actions, even on failure

A failed post write refresh should not prevent advancing the local checkpoint if the translog operations have been fsynced correctly, hence we should update the checkpoints in all situations. On the other hand, if the fsync failed the local checkpoint won't advance anyway and the engine will fail during the next indexing operation.
fcofdez · Jun 19, 2024 · 08ced55 · 08ced55
1 parent b60d77e
commit 08ced55
Showing 1 changed file with 9 additions and 3 deletions.
diff --git a/server/src/main/java/org/elasticsearch/action/support/replication/ReplicationOperation.java b/server/src/main/java/org/elasticsearch/action/support/replication/ReplicationOperation.java
@@ -187,9 +187,15 @@ public void onResponse(Void aVoid) {
             @Override
             public void onFailure(Exception e) {
                 logger.trace("[{}] op [{}] post replication actions failed for [{}]", primary.routingEntry().shardId(), opType, request);
-                // TODO: fail shard? This will otherwise have the local / global checkpoint info lagging, or possibly have replicas
-                // go out of sync with the primary
-                finishAsFailed(e);
+                // We update the checkpoints since a refresh might fail but the operations could be safely persisted, in the case that the
+                // fsync failed the local checkpoint won't advance and the engine will be marked as failed when the next indexing operation
+                // is appended into the translog.
+                updateCheckPoints(
+                    primary.routingEntry(),
+                    primary::localCheckpoint,
+                    primary::globalCheckpoint,
+                    () -> finishAsFailed(e)
+                );
             }
         });
     }