Search Latency Tracking - Coordinator Slow Logs #9642
Labels
enhancement
Enhancement or improvement to existing feature or request
feature
New feature or request
Search
Search query, autocomplete ...etc
v2.12.0
Issues and PRs related to version 2.12.0
Is your feature request related to a problem? Please describe.
As of today, we track search request latencies on a shard level via node stats. After every query/fetch phase is completed on a shard, we note down the time taken for each, keep accumulating those values and maintain an overall average value which is tracked under stats.
But we don’t have a mechanism to track search latencies around coordinator node. Coordinator node plays an important role in fanning out requests to individual shard/data-nodes, aggregating those responses and eventually sending response back to the client. We have seen multiple issues in the past where it becomes hard/impossible to reason latency related issues because of lack of insights into coordinator level stats and we ended up spending a lot of unnecessary time/bandwidth on figuring it out. Clients using search API only rely on overall took time(present as part of search response) which doesn’t offer much insights into time taken by different phases.
Parent RFC: #7334
Describe the solution you'd like
Slow logs at coordinator level: As of now, we only have the capability to enable slow logs at a shard level for desired search phase(query and fetch). See this. Setting this threshold is tricky when customer usually sees latency spikes at a request level. Plus shard level slow logs doesn't offer a holistic view. So as part of this, we will also add capabilities to capture slow logs at a request level along with different search phases from coordinator node perspective.
Additional context
Coordinator slow logs will be governed by cluster settings. We will offer for the following 3 intervals:
Query phaseFetch phaseThe text was updated successfully, but these errors were encountered: