-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Tracking Search latency and stats at Coordinator node level #7334
Comments
@sgup432 thanks for the RFC, I think there is large overlap with an existing issue #705 to extend the query profiling information with coordinator level timings. I honestly believe that such details should be captured on demand with (profile = true) otherwise there is a potential to introduce significant overhead (in terms of different listeners / callbacks / trackers creations and wrappings) on each and every request, even if that is not needed to the end user. That may not be a problem for majority of the queries (or searches, to generalize) but some might be impacted. |
@reta Thanks for comments. Profiler is much more exhaustive and relatively expensive. Also as I see, it is only used to debug certain queries or reproduce issues happened in the past. While we have seen multiple cases where it is impossible to debug latency related issues(which are hard to reproduce) at a particular point in the past as we don't have much insights into coordinator level latencies/stats. And current took time also doesn't provide much insights to the OS clients. |
@sgup432 sure, fair enough, let me ask this question though, let say you added these
How it is helpful? I mean, fetch phase took more time than query phase, but why? Where to go next? With profile, we could and should provide significantly more details not only on phase level but phase level on each shard, + plus coordinator, at least it will be clear where is the potential problem. This is just my opinion, let see what other folks think. |
Good point, @reta . What we are often coming across are cases where an application might be deploying a new set of queries in addition to an existing set, and begin to observe end-to-end latency regressions. The There is also a value-add in the end-user also being able to track latencies as a percentile, assuming their workloads are independently consuming the Opensearch resource. |
@reta In addition to took time, it is important to add coordinator/request level stats as part node stats(as mentioned in the proposal). As of now, only way to build some metrics/dashboards around search latency is only at a shard level(by using node stats API and doing some calculation using Along with that, another important aspect is slow logs at a request level. Consider below scenario: |
@sgup432 sure, I am all in for more useful stats (this is separate API(s) though), for search in particular we do have more powerful means to collect them using profiling. |
@reta Yes, we do have profiling. But as mentioned, it doesn't provide a historical view of latency data either at shard or coordinator level. Node stats does that but it lacks coordinator level data. |
Closing this RFC as PRs are already merged! |
Co-Author: @jainankitk
Problem Statement
As of today, we track search request latencies on a shard level via node stats. After every query/fetch phase is completed on a shard, we note down the time taken for each, keep accumulating those values and maintain an overall average value which is tracked under stats.
But we don’t have a mechanism to track search latencies around coordinator node. Coordinator node plays an important role in fanning out requests to individual shard/data-nodes, aggregating those responses and eventually sending response back to the client. We have seen multiple issues in the past where it becomes hard/impossible to reason latency related issues because of lack of insights into coordinator level stats and we ended up spending a lot of unnecessary time/bandwidth on figuring it out. Clients using search API only rely on overall took time(present as part of search response) which doesn’t offer much insights into time taken by different phases.
Solution
We want to introduce below high level changes to give us more insights into coordinator level latencies.
So as part of this, we will also add capabilities to capture slow logs at a request level along with different search phases from coordinator node perspective.
We can plan to use new prefix index.search.request to achieve this.
Approach
We want to design this in a way so that it is simplistic, extensible and has negligible performance impact.
Listener based approach
We can utilize a listener based approach which is pretty similar the way current shard level stats and slow log works. Via this mechanism we can solve all the three problems mentioned above.
Here, before the start and end of every phase at a coordinator node level, we will invoke listener which will help us keep track of the took time of individual search phases(query/fetch/canMatch)
High level Listener interface
Above listener will be invoked from AbstractSearchAsyncAction which serves as an entry point to the search request landing at coordinator node. This action has a mechanism to keep track as to when any phase starts or ends. See this and this.
Below components will subscribe to above listener to achieve our goal.
Scenarios to cover
Search type: We will handle both queryAndThenFetch and DFS search types of part of this.
SearchPhase: We will cover all possible search phases which exists as of today. Basically every search phase implements base class SearchPhase. This includes:
MultiSearch - Multi search response has individual search request responses. We will accordingly provide phase wise breakdown.
The text was updated successfully, but these errors were encountered: