Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search Latency Tracking - Per Request Phase Took Time #9650

Closed
dzane17 opened this issue Aug 30, 2023 · 2 comments · Fixed by #10351
Closed

Search Latency Tracking - Per Request Phase Took Time #9650

dzane17 opened this issue Aug 30, 2023 · 2 comments · Fixed by #10351
Labels
enhancement Enhancement or improvement to existing feature or request Search Search query, autocomplete ...etc

Comments

@dzane17
Copy link
Contributor

dzane17 commented Aug 30, 2023

Is your feature request related to a problem? Please describe.
As of today, we track search request latencies on a shard level via node stats. After every query/fetch phase is completed on a shard, we note down the time taken for each, keep accumulating those values and maintain an overall average value which is tracked under stats.

But we don’t have a mechanism to track search latencies around coordinator node. Coordinator node plays an important role in fanning out requests to individual shard/data-nodes, aggregating those responses and eventually sending response back to the client. We have seen multiple issues in the past where it becomes hard/impossible to reason latency related issues because of lack of insights into coordinator level stats and we ended up spending a lot of unnecessary time/bandwidth on figuring it out. Clients using search API only rely on overall took time(present as part of search response) which doesn’t offer much insights into time taken by different phases.

Parent RFC: #7334

Describe the solution you'd like
Per Request level tracking: As part of this, we will offer further breakdown of existing took time in search response. To do this, we will introduce a new field(phase_took) in search response which will give more insights/visibility into overall time taken by different search phases(query/fetch/canMatch etc) to the clients.

{
  "took" : 92,
  "phase_took" : {  // new field
    "dfs_prequery" : 0,
    "can_match" : 0,
    "query" : 66,
    "fetch" : 4,
    "expand_search" : 0
  },
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

Additional context
Request phase_took times will be disabled by default since applications will not expect this new response field. Users can be enable the feature via a query parameter OR cluster setting. This gives users flexibility to set at a cluster level while also turning on/off as needed on individual requests.

// Query param
GET /_search?phase_took
GET /_search?phase_took=true
GET /_search?phase_took=false

// Cluster setting
"search.phase_took_enabled"
@dzane17 dzane17 added enhancement Enhancement or improvement to existing feature or request untriaged labels Aug 30, 2023
@tlfeng tlfeng added the Search Search query, autocomplete ...etc label Sep 5, 2023
@msfroh
Copy link
Collaborator

msfroh commented Sep 6, 2023

@dzane17 -- How does this work with profiling enabled? Does profiling already provide this info (and more)?

Also, if we want to implement this, would it make sense to use the ext field that @austintlee added in #9379 ?

@msfroh msfroh removed the untriaged label Sep 6, 2023
@dzane17
Copy link
Contributor Author

dzane17 commented Sep 12, 2023

Hi @msfroh, there are a couple differences:

  1. Shard vs. Coordinator - Profile API operates at a shard level. The query, fetch, & aggregation stats are for a single nodeId/index/shardId (Ex. [2aE02wS1R8q_QFnYu6vDVQ][my-index-000001][0]). We are calculating took times from the coordinator which provides a wholistic view of each phase. This includes any additional time spent on the coordinator node as mentioned in Add coordination time to profiler #705
  2. Performance - Profile API is extremely verbose and can cause a large performance regression so is not meant for everyday use. This feature will be lightweight enough for users to enable for all or as many queries they desire.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search Search query, autocomplete ...etc
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants