Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Backpressure in Search Path #1329

Open
tushar-kharbanda72 opened this issue Oct 4, 2021 · 22 comments
Open

[RFC] Backpressure in Search Path #1329

tushar-kharbanda72 opened this issue Oct 4, 2021 · 22 comments
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Roadmap:Stability/Availability/Resiliency Project-wide roadmap label v2.8.0 'Issues and PRs related to version v2.8.0'

Comments

@tushar-kharbanda72
Copy link
Contributor

tushar-kharbanda72 commented Oct 4, 2021

Introduction

BackPressure in Search Path aims to enhance the overall resiliency of OpenSearch. The current protection mechanisms on OS nodes such as ThreadPoolQueueSize and CircuitBreakers are not fully efficient to protect the cluster against traffic surge, partial failures, slow node or a single rogue (resource-guzzling) query. Search BackPressure aims to introduce constructs to have fair rejections, minimal wastage of useful work done, search request cost estimation and ability to stabilise the cluster when under duress.

Problem Statement

In OpenSearch, for Search requests there are few gating mechanisms to prevent a node from meltdown when under a duress situation. These are essentially queue rejections and circuit breakers. However, these gating mechanism are all static in limits, take local decisions, and are often too late to act upon due to configuration issues.

Problem with existing constructs

  • ThreadPool queue size: OpenSearch users today can provide a node level settings to fix the number of tasks that can be queued up for search in the Search ThreadPool queue. With this we see 2 major problems:
    • Queue size doesn’t define actual work to be done by the tasks it contains. A query can have a very high cardinality and might consume a lot of system resources. While other times query could be related to smaller indices, touching very few shards and relatively less data. Query type also plays a huge role.
    • Tuning the queue size based on workload adds a major management overhead. It is also hard to keep it consistent with the change in traffic. Since every customer has different query patterns, coming up with defaults which fits for all is not possible.
  • Circuit breakers: Although we have circuit breakers which track real time memory resource utilisation of requests, they act at a blanket level. It provides mechanism to cut off requests if the defined thresholds are breached. However this lacks fairness, as decision is limited to local node performance, and they lack view on the hot shards or request types as well. Also, Circuit breakers lack prioritizations, as they are unable to expedite requests which are more likely to be completed soon. For example, requests in the fetch phase have more chances of finishing early compared to query phase. Arriving at the right thresholds is again a management overhead for users.

Other problems which OS users can run into

  • Often times, even a Single Rogue query can adversely impact data nodes and can create cascading affect of node failures. There is a need to proactively identify, block or terminate such queries, ensuring other requests could still continue, preventing any further duress on the data nodes.
  • During the distributed execution of search requests, a single slow path can impact the overall service quality and availability of the cluster. Coordinator fans out the request to multiple data nodes and if there’s a degraded node in the path, coordinator keeps on waiting and even attempts retries, while even accepting new requests in the meantime. This creates a build up on the coordinator, just because the rate at which Coordinator accepts request is higher compared to the rate at which it is able to complete the request. There is a need to identify such degraded paths, to proactively short circuit or fail fast requests and prevent cascading node failures. These issues sometimes doesn’t get detected with the ThreadPool queue size comparison in ARS just because the the cost of each task in the queue can significantly vary across different nodes and can end up overwhelming a node which is already under duress.

Proposed Solution

Summary

We propose to introduce BackPressure framework in the search path to address the above concerns and improve the overall resiliency of the OpenSearch cluster. For this we’ll introduce resource tracking framework for end to end memory/cpu tracking of Search request, covering different phases of execution across all shard requests. Adaptive replica selection will consider these resource metrics from different nodes, while ranking the nodes to reduce requests on target nodes already running into resource contention issues. Framework will also use the same resource utilisation metrics to take decisions whether to reject or cancel an in-flight request, with added fairness. Task level prioritizations based on current state of request can be achieved, such as preferring requests already in fetch phase over those yet to be picked for query phase. This will help stabilising the node under duress by completing requests faster which has some useful work already done. In addition to this, framework would have a Search request cost-estimation model to support a pro-active form of backpressure, which takes decision on incoming requests/tasks, if a node will be able to serve the query of certain cost or not. We have divided this proposal into 3 milestones as follows:

Milestone 1

Goals

  • Aims to achieve visibility into resource utilisation of Search requests (Both at Coordinator and Data node)
  • Fan-out the search request to nodes having higher chances of completing the request

1.1 Resource Tracking Framework

Build a resource tracking framework for Search requests at both Rest level (Client requests) and Transport level (Shard search requests), which tracks resource consumption on OpenSearch nodes, for various search operations at different level of granularity:

  • Individual Search Request (Rest) - On the Coordinator Node across phases (such as query and fetch phase) for end to end resource tracking from coordinator perspective.

  • Shard Search Requests (Transport) - On the Data Node per phase, for discrete search task tracking.

  • Shard level Aggregated View - Total consumption of resources mapped to every shard for searches on the node.

  • Node level Aggregated View - Total consumption of resources for all search request on the node.

Characteristics:

  • Resources to track for each of the above proposed metrics:
    • Memory Consumption (JVM)
    • CPU Consumption (CPU Time)
  • Frequent checkpointing and footprint update during the progress of a search phase : The resource tracking should continuously be done as the search request progresses, and need not necessarily wait for a particular phase to complete. This is important as a single phase execution can consume significant amount of resources itself. So, tracking resources within a phase itself as it progresses becomes important for these metrics to represent the accurate state on the node.
  • Data Node feedback to build Coordinator state : Have a capability for data node to piggyback on the response and send its current node utilisation state due to search requests to coordinator.

1.2 Adaptive replica Selection - Factor in node resource utilisation

Currently, for Adaptive replica selection the tasks in search thread-pool queue are factored in while coming up with the ranking for the target shards. Due to the fact that cost of each search request can vary significantly the queue count doesn’t accurately tells what more resources the node have to accept and complete more search requests. To be more accurate on this we can factor in the resource utilisation for search requests on these nodes, and take better routing decisions on the coordinator.

  • This will help avoiding the shards under stress giving them time to recover
  • Increase the chances of request landing on the node which can complete it
  • Queueing and Fast-Fail decision can be taken on the Coordinator node itself if required.

Milestone 2

Goals

  • Improved fairness in Search request rejections
  • Reducing chances of node getting overwhelmed due to Search request load
  • Ability to stabilise overloaded nodes by identifying and cancelling resource guzzling queries.
  • Achieving resiliency with reduced dependency on Circuit breaker and Threadpool queue configurations as the accuracy of rejections due to these depends on user input.

2.1 Server Side rejection of in-coming search requests

Currently, Search rejections are solely based on the number of tasks in queue for Search ThreadPool. That doesn’t provide fairness in rejections as multiple smaller queries can exhaust this limit but are not resource intensive and the node can take in much more requests and vice versa. Essentially count is not the reflection of actual work.

Hence based on metrics in point 1.1 above, we want to build a frame which can perform more informed rejections based on point in time resource utilisation. The new model will take the admission decision for search requests on the node. These admission decisions or rejection limits can have different levels to it:

  • Level 1: At this point system has detected overload due to search requests and it’ll prioritise which requests to accept. Example: It’ll accept fetch requests over Query requests as for Fetch phase we have already done some work to reach at this point whereas Query is going to be more resource intensive and have least wastage of work if rejected. Similar logic can be applied for Force search requests as well.
  • Level 2: At this point we’ll start rejecting all search requests beyond capacity to prevent any impact on the availability of node.

This can be further evolved to support Shard level priority model, where user can set priority on an index or every request, so that framework can consume them for taking admission/rejection decisions.

If user has configured partial results to be true, then upon these rejections and Coordinator’s inability to retry the request on another shard on a different node might result in user’s getting partial response.

The above will provide the required isolation of accounting and fairness in the rejections which is currently not there. This is still a reactive back-pressure mechanism as it only focusses on the current consumption and does not estimate the future work which is to be done for these search requests.

2.2 Server side Cancellation of in-flight search requests based on resource consumption

This is the 3rd level which kicks in after we’re cancelling all search request coming to a node. Here, we take decision to
cancel on-going requests, If the resource limits for that shard/node have started breaching the assigned limits (point 2.1), and there is no recovery seen for a certain time threshold. The BackPressure model should support identification of queries which are most resource guzzling with minimal wasteful work. These can then be cancelled for recovering a node under load and continue doing useful work.

Milestone 3

Goals

  • Have Search request cost estimation capability
  • Have improved capability to take request admission decision based on the request cost and available resources (Think Token buckets). (Refer Section 3.2 C3 paper )

3.1 Query resource consumption estimation

Improve the framework added as part of (point 2) to also estimate the cost of new search queries, based on their potential resource utilisation. It can be achieved by looking at the past query patterns, with similar query constructs and the actual data on shard. This will help in building a pro-active back-pressure model for search requests, where estimates will be compared against the available resources during the admission decision for granular control.

@tushar-kharbanda72 tushar-kharbanda72 added the enhancement Enhancement or improvement to existing feature or request label Oct 4, 2021
@reta
Copy link
Collaborator

reta commented Oct 6, 2021

Thanks for bringing this up, @tushar-kharbanda72 , there are not doubts that applying back pressure on the search flows would lead to more stable clusters. There are quite a few successful implementations with respect to the proposal, fe Akka's Adaptive Load Balancing (https://doc.akka.io/docs/akka/current/cluster-metrics.html#adaptive-load-balancing), which to some extent addresses the same problem (could be good case study).

For this we’ll introduce resource tracking framework for end to end memory/cpu tracking of Search request, covering different phases of execution across all shard requests.

I assume the only way to claim that node uses all resources only for search is when it has a single data role (or alike). In many deployments this is not the case, and the same nodes may be involved in ingestion or/and cluster coordination, etc. How would the backpressure / resource tracking framework deal with that?

Also, should the resource tracking framework consult the overall node load (memory/cpu at least)? The co-located deployments sadly are not that rare :(

Task level prioritizations based on current state of request can be achieved, such as preferring requests already in fetch phase over those yet to be picked for query phase.

I think in general it makes a lot of sense. But here is the counter example (seen in the wild): the query phase take a very long time, following the fetch phase. At that moment, the client has already lost any hope to get the response (may be even gone), but the server is still working on it, the live queries are going to be rejected because the fetch phase for this "dead" query is still ongoing. Would it make sense to take into account the overall age of the query (when kicking off the fetch phase) to weight its completion relevance to the client?

Improve the framework added as part of (point 2) to also estimate the cost of new search queries, based on their potential resource utilisation.

The approach(es) you have mentioned make sense. Probably it would be good to invest into full-fledges query planner which will assign costs to each query element and overall query at the end? May not be easy, for sure, but the iterative approach could be taken to refine it along the way.

Thank you.

@asafm
Copy link

asafm commented Jan 25, 2022

I think in general it makes a lot of sense. But here is the counter example (seen in the wild): the query phase take a very long time, following the fetch phase. At that moment, the client has already lost any hope to get the response (may be even gone), but the server is still working on it, the live queries are going to be rejected because the fetch phase for this "dead" query is still ongoing. Would it make sense to take into account the overall age of the query (when kicking off the fetch phase) to weight its completion relevance to the client?

Perhaps we can check if the client connection is still active, and if not, propagate it to the data nodes to cancel that query.

@reta
Copy link
Collaborator

reta commented Jan 25, 2022

Perhaps we can check if the client connection is still active, and if not, propagate it to the data nodes to cancel that query.
@asafm I think something along these lines, for example interpreting client disconnects as tasks cancellation and propagating that across the nodes (since coordinator is aware of client state and tasks).

@dblock
Copy link
Member

dblock commented Feb 2, 2022

This is a good proposal 👍. Resource consumption is great, but it's a lot of work, so I would in parallel consider simpler approaches.

  • hard limit on number of items in the search queue (exists today) cancels existing queued items (if queue is full, flush it, cancel existing requests, cool down)
  • general resource consumption (e.g. cancel all existing operations and cool down if overall CPU is too high over a window)
  • evaluating queries and measuring their cost (heavy queries or queries known to have high execution cost get rejected without being executed)
  • time in queue over a window (new? overall search duration is increasing)

@sruti1312
Copy link
Contributor

Hi @CEHENKLE
This issue requires active collaboration from multiple developers and performance testing before merging to main branch. Can you help with creating a feature branch for this feature/task-resource-tracking?
Related issue:
#1009

@CEHENKLE
Copy link
Member

Done! https://github.com/opensearch-project/OpenSearch/tree/feature/task-resource-tracking

@tushar-kharbanda72
Copy link
Contributor Author

tushar-kharbanda72 commented Feb 18, 2022

Thanks @dblock for going through the proposal. We have the PRs related to task resource tracking in progress and that should complete soon. Targeting march for that.

Soon we'll fast follow that up with initial low effort rejection strategies which you and others mentioned which we'll include as part of Milestone 2.

Milestone 3 is a lot of work and we'll have to evaluate the need and criticality after we have initial improvements merged in.

@getsaurabh02
Copy link
Member

Adding Meta Issue and the child issues opened previously to this RFC in order to track deliverables:
#1042

@anasalkouz
Copy link
Member

@tushar-kharbanda72 @getsaurabh02 since you have META issue to track the progress of this initiative. Could you close the RFC? and what is the target release for this project.

@elfisher
Copy link

Is this still on track for 2.2?

@rramachand21
Copy link
Member

This (milestone 2) will come in 2.3 - we are merging in the changes for resource tracking framework in 2.2 (milestone 1)

@elfisher
Copy link

Thanks @rramachand21!

@kartg kartg added v2.3.0 'Issues and PRs related to version v2.3.0' and removed v2.2.0 labels Aug 2, 2022
@elfisher
Copy link

@rramachand21 is there an issue cut for milestone 2? I'd like to sync it up with the related documentation issue.

@JeffHuss
Copy link

@rramachand21 please provide a feature document for this so we can get it documented if it's still on track for 2.3. The docs issue is opensearch-project/documentation-website#795

Currently I'm blocked.

@elfisher elfisher added v2.4.0 'Issues and PRs related to version v2.4.0' and removed v2.3.0 'Issues and PRs related to version v2.3.0' labels Sep 7, 2022
@anasalkouz
Copy link
Member

@tushar-kharbanda72 is this still on track for 2.4 release? code freeze on 11/3
Is there anything pending? otherwise, feel free to close it.

@anasalkouz
Copy link
Member

@rramachand21 is this on track for 2.4 release? code freeze today

@minalsha
Copy link
Contributor

@rramachand21 Is there anything pending on this? If not, can we close this issue since 2.4 is already in production and update with latest version that you are tracking?

@rramachand21 rramachand21 added v2.6.0 'Issues and PRs related to version v2.6.0' and removed v2.4.0 'Issues and PRs related to version v2.4.0' labels Jan 17, 2023
@kartg
Copy link
Member

kartg commented Feb 21, 2023

@rramachand21 is this on track for 2.6.0 release? The code freeze for this is today (Feb 21, 2023)

@anasalkouz
Copy link
Member

Hey @tushar-kharbanda72, is this feature still on track for 2.7? The code freeze date is April 17, 2023.

@rramachand21
Copy link
Member

@ketanv3 could you please update the latest on this?

@DarshitChanpura
Copy link
Member

Hi @tushar-kharbanda72 , This issue will be marked for next-release v2.8.0 on (Apr 17) as that is the code-freeze date for v2.7.0. Please let me know if otherwise.

@DarshitChanpura
Copy link
Member

Tagging it to next release: v2.8.0

@DarshitChanpura DarshitChanpura added v2.8.0 'Issues and PRs related to version v2.8.0' and removed v2.7.0 labels Apr 17, 2023
@andrross andrross added Roadmap:Stability/Availability/Resiliency Project-wide roadmap label and removed roadmap labels May 10, 2024
@github-project-automation github-project-automation bot moved this to Planned work items in OpenSearch Roadmap May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Roadmap:Stability/Availability/Resiliency Project-wide roadmap label v2.8.0 'Issues and PRs related to version v2.8.0'
Projects
Status: New
Development

No branches or pull requests