Search Latency Tracking - Per Request Phase Took Time #9650

dzane17 · 2023-08-30T22:49:31Z

Is your feature request related to a problem? Please describe.
As of today, we track search request latencies on a shard level via node stats. After every query/fetch phase is completed on a shard, we note down the time taken for each, keep accumulating those values and maintain an overall average value which is tracked under stats.

But we don’t have a mechanism to track search latencies around coordinator node. Coordinator node plays an important role in fanning out requests to individual shard/data-nodes, aggregating those responses and eventually sending response back to the client. We have seen multiple issues in the past where it becomes hard/impossible to reason latency related issues because of lack of insights into coordinator level stats and we ended up spending a lot of unnecessary time/bandwidth on figuring it out. Clients using search API only rely on overall took time(present as part of search response) which doesn’t offer much insights into time taken by different phases.

Parent RFC: #7334

Describe the solution you'd like
Per Request level tracking: As part of this, we will offer further breakdown of existing took time in search response. To do this, we will introduce a new field(phase_took) in search response which will give more insights/visibility into overall time taken by different search phases(query/fetch/canMatch etc) to the clients.

{
  "took" : 92,
  "phase_took" : {  // new field
    "dfs_prequery" : 0,
    "can_match" : 0,
    "query" : 66,
    "fetch" : 4,
    "expand_search" : 0
  },
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

Additional context
Request phase_took times will be disabled by default since applications will not expect this new response field. Users can be enable the feature via a query parameter OR cluster setting. This gives users flexibility to set at a cluster level while also turning on/off as needed on individual requests.

// Query param
GET /_search?phase_took
GET /_search?phase_took=true
GET /_search?phase_took=false

// Cluster setting
"search.phase_took_enabled"

The text was updated successfully, but these errors were encountered:

msfroh · 2023-09-06T16:23:34Z

@dzane17 -- How does this work with profiling enabled? Does profiling already provide this info (and more)?

Also, if we want to implement this, would it make sense to use the ext field that @austintlee added in #9379 ?

dzane17 · 2023-09-12T20:38:16Z

Hi @msfroh, there are a couple differences:

Shard vs. Coordinator - Profile API operates at a shard level. The query, fetch, & aggregation stats are for a single nodeId/index/shardId (Ex. [2aE02wS1R8q_QFnYu6vDVQ][my-index-000001][0]). We are calculating took times from the coordinator which provides a wholistic view of each phase. This includes any additional time spent on the coordinator node as mentioned in Add coordination time to profiler #705
Performance - Profile API is extremely verbose and can cause a large performance regression so is not meant for everyday use. This feature will be lightweight enough for users to enable for all or as many queries they desire.

dzane17 added enhancement Enhancement or improvement to existing feature or request untriaged labels Aug 30, 2023

tlfeng added the Search Search query, autocomplete ...etc label Sep 5, 2023

github-project-automation bot added this to Search Project Board Sep 5, 2023

github-project-automation bot moved this to 🆕 New in Search Project Board Sep 5, 2023

msfroh removed the untriaged label Sep 6, 2023

dzane17 mentioned this issue Sep 13, 2023

Request level latency tracking dzane17/OpenSearch#1

Closed

dzane17 mentioned this issue Oct 4, 2023

Request level latency tracking #10351

Merged

7 tasks

msfroh closed this as completed in #10351 Oct 13, 2023

github-project-automation bot moved this from 🆕 New to ✅ Done in Search Project Board Oct 13, 2023

kkhatua added backport 2.x Backport to 2.x branch and removed backport 2.x Backport to 2.x branch labels Oct 17, 2023

dzane17 mentioned this issue Oct 30, 2023

phase_took documentation opensearch-project/documentation-website#5154

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search Latency Tracking - Per Request Phase Took Time #9650

Search Latency Tracking - Per Request Phase Took Time #9650

dzane17 commented Aug 30, 2023

msfroh commented Sep 6, 2023

dzane17 commented Sep 12, 2023

Search Latency Tracking - Per Request Phase Took Time #9650

Search Latency Tracking - Per Request Phase Took Time #9650

Comments

dzane17 commented Aug 30, 2023

msfroh commented Sep 6, 2023

dzane17 commented Sep 12, 2023