[RFC] Tracking Search latency and stats at Coordinator node level #7334

sgup432 · 2023-04-29T03:41:23Z

Problem Statement

As of today, we track search request latencies on a shard level via node stats. After every query/fetch phase is completed on a shard, we note down the time taken for each, keep accumulating those values and maintain an overall average value which is tracked under stats.

But we don’t have a mechanism to track search latencies around coordinator node. Coordinator node plays an important role in fanning out requests to individual shard/data-nodes, aggregating those responses and eventually sending response back to the client. We have seen multiple issues in the past where it becomes hard/impossible to reason latency related issues because of lack of insights into coordinator level stats and we ended up spending a lot of unnecessary time/bandwidth on figuring it out. Clients using search API only rely on overall took time(present as part of search response) which doesn’t offer much insights into time taken by different phases.

Solution

We want to introduce below high level changes to give us more insights into coordinator level latencies.

Per Request level tracking: As part of this, we will offer further breakdown of existing took time in search response. To do this, we will introduce a new field(phase_took) in search response which will give more insights/visibility into overall time taken by different search phases(query/fetch/canMatch etc) to the clients.

{
    "took" : 29321,
    "phase_took" : { // new field
        "fetch" : 29253,
        "query" : 14,
    },
    "timed_out" : false,
    "_shards" : {
        "total" : 5,
        "successful" : 5,
        "skipped" : 0,
        "failed" : 0
    },
    "hits" : {
        "total" : {
        "value" : 0,
        "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
    }
}

Add coordinator node level stats: As part of this, we will also track search stats from coordinator node perspective. Currently we track search shard stats under IndicesStats. We can utilize same stats and also track overall search request level stats ie (total time taken by coordinator + communication time between coordinator and data node + execution time at data node)

"indices" : {
        "search" : {
          "open_contexts" : 1,
          "query_total" : 117336056,
          "query_time_in_millis" : 2244945904,
          "query_current" : 1,
          "fetch_total" : 47060301,
          "fetch_time_in_millis" : 78532219,
          "fetch_current" : 0,
          "scroll_total" : 466,
          "scroll_time_in_millis" : 318294065,
          "scroll_current" : 0,
          "suggest_total" : 0,
          "suggest_time_in_millis" : 0,
          "suggest_current" : 0,
          "request": { // New field. Contains overall search requests stats from coordinator node perspective.
             "query_time_in_millis" : 5244945904,
             "query_current" : 1,
             "query_total" : 7336056,
             "fetch_time_in_millis": 178532219,
             "fetch_current" : 0,
              "fetch_total" : 7060301
          }
        }

Slow logs at coordinator level: As of now, we only have the capability to enable slow logs at a shard level for desired search phase(query and fetch). See this. Setting this threshold is tricky when customer usually sees latency spikes at a request level. Plus shard level slow logs doesn't offer a holistic view.
So as part of this, we will also add capabilities to capture slow logs at a request level along with different search phases from coordinator node perspective.

We can plan to use new prefix index.search.request to achieve this.

// Setting on a whole search request level
index.search.request.slowlog.threshold.overall.warn: 10s
index.search.request.slowlog.threshold.overall.info: 5s
index.search.request.slowlog.threshold.overall.debug: 2s
index.search.request.slowlog.threshold.overall.trace: 500ms

// Setting on query phase level
index.search.request.slowlog.threshold.query.warn: 10s
.....

// Setting on fetch phase level
index.search.request.slowlog.threshold.fetch.warn: 1s
.....

Approach

We want to design this in a way so that it is simplistic, extensible and has negligible performance impact.

Listener based approach

We can utilize a listener based approach which is pretty similar the way current shard level stats and slow log works. Via this mechanism we can solve all the three problems mentioned above.

Here, before the start and end of every phase at a coordinator node level, we will invoke listener which will help us keep track of the took time of individual search phases(query/fetch/canMatch)

High level Listener interface

// Listener to track start and end of phases at a coordinator node level
public interface SearchRequestOperationsListener {
    
    void onRequestStart(SearchPhaseContext context) {}
    void onRequestEnd(SearchPhaseContext context) {}
   
    // Query phase
    void onQueryPhaseStart(SearchPhaseContext context) {}
    void onQueryPhaseEnd(SearchPhaseContext phase, long tookTime) {}
    void onQueryPhaseFailure(SearchPhaseContext context) {}
    
    // Fetch phase
    void onFetchPhaseStart(SearchPhaseContext context) {}
    void onFetchPhaseEnd(SearchPhaseContext phase, long tookTime) {}
    void onFetchPhaseFailure(SearchPhaseContext context) {}
    
        
    // Gives a capability to register multiple listeners and invoke them accordingly.
    final class CompositeListener implements SearchRequestOperationsListener {
        private final List<SearchRequestOperationsListener> listeners;
        private final Logger logger;

        CompositeListener(List<SearchRequestOperationsListener> listeners, Logger logger) {
            this.listeners = listeners;
            this.logger = logger;
        }

       public void onQueryPhaseStart(SearchPhaseContext context) {
           for (SearchRequestOperationsListener listener : listeners) {
               try {
                   listener.onQueryPhaseStart(context);
               } catch (Exception e) {}
           }
       }

       public void onQueryPhaseEnd(SearchPhaseContext context, long tookTime) {
           for (SearchRequestOperationsListener listener : listeners) {
               try {
                   listener.onQueryPhaseEnd(context, tookTime);
               } catch (Exception e) {}
           }
       }
    }
}

Above listener will be invoked from AbstractSearchAsyncAction which serves as an entry point to the search request landing at coordinator node. This action has a mechanism to keep track as to when any phase starts or ends. See this and this.

Below components will subscribe to above listener to achieve our goal.

SearchTimeProvider: Currently keeps track of overall took time in search response. To offer breakdown of each phase took time, we will use same and track those via above listener.
SearchRequestStats: A new class to keep track of coordinator level stats. We will encapsulate it within existing SearchStats. SearchRequestStats will also subscribe to above listener events.
SearchRequestSlowLogs: Same as above. This will also implement listener and via those events will provide the capability to capture slow logs at coordinator/request level.

Scenarios to cover

Search type: We will handle both queryAndThenFetch and DFS search types of part of this.
SearchPhase: We will cover all possible search phases which exists as of today. Basically every search phase implements base class SearchPhase. This includes:
1. CanMatchPreFilterPhase - Only valid for QueryAndThenFetch type.
2. PreQueryPhase/DFSPhase - Only valid for DFS search type.
3. QueryPhase - Only Valid for QueryAndThenFetch type
4. DFSQueryPhase - Only valid for DFS type.
5. FetchSearchPhase - Valid for both types.
6. ExpandSearchPhase - Valid for both types. This is executed if search request has inner hits and needs field collapsing
MultiSearch - Multi search response has individual search request responses. We will accordingly provide phase wise breakdown.

The text was updated successfully, but these errors were encountered:

sgup432 · 2023-05-01T21:09:01Z

@dblock @nknize @Bukhtawar @reta @jainankitk

reta · 2023-05-02T17:55:23Z

@sgup432 thanks for the RFC, I think there is large overlap with an existing issue #705 to extend the query profiling information with coordinator level timings. I honestly believe that such details should be captured on demand with (profile = true) otherwise there is a potential to introduce significant overhead (in terms of different listeners / callbacks / trackers creations and wrappings) on each and every request, even if that is not needed to the end user. That may not be a problem for majority of the queries (or searches, to generalize) but some might be impacted.

sgup432 · 2023-05-02T22:06:54Z

@reta Thanks for comments.
This listener based approach is very lightweight and will have almost negligible performance impact. It is just like starting and ending a timer, that's it. As we are only looking to track search phases took time. It is almost identical as to how we collect shard level stats/latencies where we see no overhead. If you still see a problem, we can add a cluster setting to turn this off in worst case scenario.

Profiler is much more exhaustive and relatively expensive. Also as I see, it is only used to debug certain queries or reproduce issues happened in the past. While we have seen multiple cases where it is impossible to debug latency related issues(which are hard to reproduce) at a particular point in the past as we don't have much insights into coordinator level latencies/stats. And current took time also doesn't provide much insights to the OS clients.

reta · 2023-05-03T12:58:49Z

@sgup432 sure, fair enough, let me ask this question though, let say you added these took times (as per proposal):

"phase_took" : { // new field
        "fetch" : 29253,
        "query" : 14,
    },

How it is helpful? I mean, fetch phase took more time than query phase, but why? Where to go next? With profile, we could and should provide significantly more details not only on phase level but phase level on each shard, + plus coordinator, at least it will be clear where is the potential problem.

This is just my opinion, let see what other folks think.

kkhatua · 2023-05-03T16:41:33Z

Good point, @reta .

What we are often coming across are cases where an application might be deploying a new set of queries in addition to an existing set, and begin to observe end-to-end latency regressions. The took_time by itself is not very useful. But more than fetch, the value lies in the post-fetch activities which are often on the same coordinator node itself.
A classic example would be a workload that traditionally aggregates across, say 7 days worth. A new use-case is now added to the workload which aggregates across 30 days (4x). We likely will not see much change in the query or even fetch phase latencies, but the aggregations will take longer. Overall, we see an increase in general took_time, but the nodes do not show anything else, except higher CPU and memory utilization.

There is also a value-add in the end-user also being able to track latencies as a percentile, assuming their workloads are independently consuming the Opensearch resource.

sgup432 · 2023-05-03T20:37:04Z

@reta In addition to took time, it is important to add coordinator/request level stats as part node stats(as mentioned in the proposal). As of now, only way to build some metrics/dashboards around search latency is only at a shard level(by using node stats API and doing some calculation using query_time_in_millis). There is no way to track coordinator level stats which is critical and usually a blind spot as of now.

Along with that, another important aspect is slow logs at a request level. Consider below scenario:
Some of my search queries are taking more than X seconds overall and I want to debug which ones exactly. Only way to log them is at a shard level by setting desired threshold which is doesn't make sense, tricky and doesn't give a holistic view(from coordinator perspective). I would ideally want to set a slow log at request level like proposed above.
Let me know what you think.

reta · 2023-05-04T12:28:21Z

@sgup432 sure, I am all in for more useful stats (this is separate API(s) though), for search in particular we do have more powerful means to collect them using profiling.

sgup432 · 2023-05-05T20:08:25Z

@reta Yes, we do have profiling. But as mentioned, it doesn't provide a historical view of latency data either at shard or coordinator level. Node stats does that but it lacks coordinator level data.
Profiling only helps to selectively debug certain slow queries when needed.

jainankitk · 2024-07-29T18:56:46Z

Closing this RFC as PRs are already merged!

sgup432 added enhancement Enhancement or improvement to existing feature or request untriaged labels Apr 29, 2023

dbwiddis added RFC Issues requesting major changes and removed untriaged labels May 12, 2023

sgup432 mentioned this issue Jun 30, 2023

Search stats for coordinator node #8381

Open

sgup432 mentioned this issue Jul 24, 2023

Search latency tracking - Coordinator node #8386

Merged

jainankitk mentioned this issue Jul 25, 2023

[RFC] High Level Vision for Core Search in OpenSearch #8879

Open

kkhatua mentioned this issue Aug 18, 2023

[Tracing Instrumentation] Add instrumentation at deep search path #8554

Closed

kkhatua assigned sgup432 Aug 18, 2023

This was referenced Aug 30, 2023

Search Latency Tracking - Coordinator Slow Logs #9642

Closed

Search Latency Tracking - Per Request Phase Took Time #9650

Closed

kkhatua added this to Performance Roadmap Oct 16, 2023

github-project-automation bot moved this to Todo in Performance Roadmap Oct 16, 2023

kkhatua moved this from Todo to In Progress in Performance Roadmap Oct 17, 2023

ansjcy mentioned this issue Nov 13, 2023

[RFC] Real-time Insights into Top N Queries by Latency and Resource Usage #11186

Closed

msfroh mentioned this issue Nov 13, 2023

Configurable query logging using search pipelines #11188

Open

sohami added the Roadmap:Search Project-wide roadmap label label May 14, 2024

getsaurabh02 added this to OpenSearch Roadmap May 31, 2024

github-project-automation bot moved this to Planned work items in OpenSearch Roadmap May 31, 2024

jainankitk closed this as completed Jul 29, 2024

github-project-automation bot moved this from In Progress to Done in Performance Roadmap Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Tracking Search latency and stats at Coordinator node level #7334

[RFC] Tracking Search latency and stats at Coordinator node level #7334

sgup432 commented Apr 29, 2023 •

edited

Loading

sgup432 commented May 1, 2023

reta commented May 2, 2023 •

edited

Loading

sgup432 commented May 2, 2023 •

edited

Loading

reta commented May 3, 2023 •

edited

Loading

kkhatua commented May 3, 2023 •

edited

Loading

sgup432 commented May 3, 2023

reta commented May 4, 2023 •

edited

Loading

sgup432 commented May 5, 2023 •

edited

Loading

jainankitk commented Jul 29, 2024

[RFC] Tracking Search latency and stats at Coordinator node level #7334

[RFC] Tracking Search latency and stats at Coordinator node level #7334

Comments

sgup432 commented Apr 29, 2023 • edited Loading

Problem Statement

Solution

Approach

Listener based approach

Scenarios to cover

sgup432 commented May 1, 2023

reta commented May 2, 2023 • edited Loading

sgup432 commented May 2, 2023 • edited Loading

reta commented May 3, 2023 • edited Loading

kkhatua commented May 3, 2023 • edited Loading

sgup432 commented May 3, 2023

reta commented May 4, 2023 • edited Loading

sgup432 commented May 5, 2023 • edited Loading

jainankitk commented Jul 29, 2024

sgup432 commented Apr 29, 2023 •

edited

Loading

reta commented May 2, 2023 •

edited

Loading

sgup432 commented May 2, 2023 •

edited

Loading

reta commented May 3, 2023 •

edited

Loading

kkhatua commented May 3, 2023 •

edited

Loading

reta commented May 4, 2023 •

edited

Loading

sgup432 commented May 5, 2023 •

edited

Loading