Improve handling of CCR threadpool rejections #92449
Labels
>bug
:Distributed Indexing/CCR
Issues around the Cross Cluster State Replication features
Team:Distributed (Obsolete)
Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
The CCR threadpool uses a fixed executor with default size 32 and default queue length 100, which means it rejects work if overloaded. However, it does not look like we handle these rejections very gracefully in several spots, even though the overload might be a transient situation:
ShardChangesAction.TransportAction#asyncShardOperation
adds a GCP listener to run on theccr
pool, which if rejected looks like it might suppress some other notifications and propagate up into theReplicationTracker
.ShardFollowTasksExecutor
is aPersistentTasksExecutor
which executes the task on theccr
pool, and on rejection the task is just marked as failed.ShardFollowTasksExecutor#nodeOperation
also just fails the task.ShardFollowNodeTask#scheduleBackgroundRetentionLeaseRenewal
usesscheduleWithFixedDelay
which just stops running the scheduled task on rejection.AutoFollower#finalise
looks like it might call itself ad infinitum on rejection?CcrRepository#restoreShard
usesscheduleWithFixedDelay
which just stops running the scheduled task on rejection. Possibly this is ok? If the restore fails I expect we will retry.The text was updated successfully, but these errors were encountered: