MasterService does not complete all tasks on shutdown #94930
Labels
>bug
:Distributed Coordination/Cluster Coordination
Cluster formation and cluster state publication, including cluster membership and fault detection.
Team:Distributed
Meta label for distributed team (obsolete)
Today when the
MasterService
shuts down, it fails waiting tasks but does not necessarily fail the ongoing batch of tasks. For instance, we just drop the batch on the floor here:elasticsearch/server/src/main/java/org/elasticsearch/cluster/service/MasterService.java
Lines 209 to 213 in 356e109
and we swallow rejections here:
elasticsearch/server/src/main/java/org/elasticsearch/cluster/service/MasterService.java
Lines 398 to 405 in 356e109
This behaviour has existed for a long time (i.e. it was not introduced by recent changes in the area such as #92021 and #94325) but I still think we should improve it. Note however that it does not work simply to fail the ongoing tasks on rejection: today with acked tasks we call (at most) one of
onAllNodesAcked()
,onAckFailure()
,onAckTimeout()
, orClusterStateTaskListener#onFailure()
, and implementations rely on this fact, but we may experience a rejection exception after acking has completed. I think that means we have to delay the acking until the end of the publication, because the alternative would be to suppressonFailure()
calls for acked tasks which seems like a confusing API choice that will lead to bugs.The text was updated successfully, but these errors were encountered: