MasterService does not complete all tasks on shutdown #94930

DaveCTurner · 2023-03-31T07:27:43Z

Today when the MasterService shuts down, it fails waiting tasks but does not necessarily fail the ongoing batch of tasks. For instance, we just drop the batch on the floor here:

elasticsearch/server/src/main/java/org/elasticsearch/cluster/service/MasterService.java

Lines 209 to 213 in 356e109

    
           if (lifecycle.started() == false) { 
        
               logger.debug("processing [{}]: ignoring, master service not started", summary); 
        
               listener.onResponse(null); 
        
               return; 
        
           }

and we swallow rejections here:

elasticsearch/server/src/main/java/org/elasticsearch/cluster/service/MasterService.java

Lines 398 to 405 in 356e109

    
           assert publicationMayFail() || (exception instanceof EsRejectedExecutionException esre && esre.isExecutorShutdown()) 
        
               : exception; 
        
           clusterStateUpdateStatsTracker.onPublicationFailure( 
        
               threadPool.rawRelativeTimeInMillis(), 
        
               clusterStatePublicationEvent, 
        
               0L 
        
           ); 
        
           handleException(summary, publicationStartTime, newClusterState, exception);

This behaviour has existed for a long time (i.e. it was not introduced by recent changes in the area such as #92021 and #94325) but I still think we should improve it. Note however that it does not work simply to fail the ongoing tasks on rejection: today with acked tasks we call (at most) one of onAllNodesAcked(), onAckFailure(), onAckTimeout(), or ClusterStateTaskListener#onFailure(), and implementations rely on this fact, but we may experience a rejection exception after acking has completed. I think that means we have to delay the acking until the end of the publication, because the alternative would be to suppress onFailure() calls for acked tasks which seems like a confusing API choice that will lead to bugs.

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2023-03-31T07:28:13Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner added >bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Mar 31, 2023

elasticsearchmachine added the Team:Distributed Meta label for distributed team (obsolete) label Mar 31, 2023

DaveCTurner mentioned this issue Mar 31, 2023

Clean up MasterService lifecycle checks #94897

Draft

andreidan mentioned this issue Mar 22, 2024

Fix testDataStreamLifecycleDownsampleRollingRestart #106664

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MasterService does not complete all tasks on shutdown #94930

MasterService does not complete all tasks on shutdown #94930

DaveCTurner commented Mar 31, 2023

elasticsearchmachine commented Mar 31, 2023

MasterService does not complete all tasks on shutdown #94930

MasterService does not complete all tasks on shutdown #94930

Comments

DaveCTurner commented Mar 31, 2023

elasticsearchmachine commented Mar 31, 2023