Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc] Server-side rejection of search requests based on resource #795

Closed
hdhalter opened this issue Jul 11, 2022 · 11 comments · Fixed by #1790
Closed

[Doc] Server-side rejection of search requests based on resource #795

hdhalter opened this issue Jul 11, 2022 · 11 comments · Fixed by #1790
Assignees
Labels
2 - In progress Issue/PR: The issue or PR is in progress. v2.4.0 'Issues and PRs related to version v2.4.0'

Comments

@hdhalter
Copy link
Contributor

hdhalter commented Jul 11, 2022

RFC- opensearch-project/OpenSearch#1329
Tracking issue- opensearch-project/OpenSearch#1181
POC's: Prabhu Senthamarai, Suresh N S, Ketan Verma, Pritkumar Ladani

@hdhalter hdhalter added enhancement New feature or request untriaged v2.2.0 and removed enhancement New feature or request labels Jul 11, 2022
@hdhalter hdhalter added this to the v2.2 milestone Jul 11, 2022
@hdhalter
Copy link
Contributor Author

@JeffH-AWS - Hi Jeff, do you mind taking this one? It involves measuring shard resource consumption when running search related tasks. Thanks.

@JeffHuss
Copy link

JeffHuss commented Jul 12, 2022

Sure - is there a specific ask/scope defined somewhere? I'm not really sure what is being requested specifically after glancing at the other issues. Seems like it could be super broad.

@JeffHuss
Copy link

@Naarcha-AWS
Copy link
Collaborator

@JeffH-AWS: Scope and ask still TBD. There's active development going on for Task Consumer Integration, however, we'll have to wait till the PR is merged in to start documenting it.

@JeffHuss
Copy link

I'm going to split this into two issues and work them separately. One item is search back pressure, the other is task consumer integration.

@hdhalter
Copy link
Contributor Author

Sounds good! Feel free to close this as duplicate.

@JeffHuss JeffHuss changed the title [Doc] Improve resliency in memory management [Doc] Improve resiliency in memory management - back pressure in search path Jul 19, 2022
@JeffHuss JeffHuss added the 1 - Backlog Issue: The issue is unassigned or assigned but not started label Jul 20, 2022
@hdhalter hdhalter added v2.3.0 and removed v2.2.0 labels Jul 21, 2022
@hdhalter
Copy link
Contributor Author

I'm switching the label from 2.2 to 2.3, as discussed in the project roadmap meeting.

@Naarcha-AWS Naarcha-AWS removed this from the v2.2 milestone Jul 21, 2022
@hdhalter hdhalter added this to the v 3.0 milestone Aug 1, 2022
@JeffHuss
Copy link

Will start looking at this shortly as we're not in the run-up to the 2.3 release.

@JeffHuss
Copy link

JeffHuss commented Aug 16, 2022

It looks like milestone 2 is supposed to be included in 2.3:

Milestone 2
Goals
Improved fairness in Search request rejections
Reducing chances of node getting overwhelmed due to Search request load
Ability to stabilise overloaded nodes by identifying and cancelling resource guzzling queries.
Achieving resiliency with reduced dependency on Circuit breaker and Threadpool queue configurations as the accuracy of rejections due to these depends on user input.
2.1 Server Side rejection of in-coming search requests
Currently, Search rejections are solely based on the number of tasks in queue for Search ThreadPool. That doesn’t provide fairness in rejections as multiple smaller queries can exhaust this limit but are not resource intensive and the node can take in much more requests and vice versa. Essentially count is not the reflection of actual work.

Hence based on metrics in point 1.1 above, we want to build a frame which can perform more informed rejections based on point in time resource utilisation. The new model will take the admission decision for search requests on the node. These admission decisions or rejection limits can have different levels to it:

Level 1: At this point system has detected overload due to search requests and it’ll prioritise which requests to accept. Example: It’ll accept fetch requests over Query requests as for Fetch phase we have already done some work to reach at this point whereas Query is going to be more resource intensive and have least wastage of work if rejected. Similar logic can be applied for Force search requests as well.
Level 2: At this point we’ll start rejecting all search requests beyond capacity to prevent any impact on the availability of node.
This can be further evolved to support Shard level priority model, where user can set priority on an index or every request, so that framework can consume them for taking admission/rejection decisions.

If user has configured partial results to be true, then upon these rejections and Coordinator’s inability to retry the request on another shard on a different node might result in user’s getting partial response.

The above will provide the required isolation of accounting and fairness in the rejections which is currently not there. This is still a reactive back-pressure mechanism as it only focusses on the current consumption and does not estimate the future work which is to be done for these search requests.

2.2 Server side Cancellation of in-flight search requests based on resource consumption
This is the 3rd level which kicks in after we’re cancelling all search request coming to a node. Here, we take decision to
cancel on-going requests, If the resource limits for that shard/node have started breaching the assigned limits (point 2.1), and there is no recovery seen for a certain time threshold. The BackPressure model should support identification of queries which are most resource guzzling with minimal wasteful work. These can then be cancelled for recovering a node under load and continue doing useful work.
[rramachand21](https://github.com/rramachand21) commented [26 days ago](https://github.com/opensearch-project/OpenSearch/issues/1329#issuecomment-1191618947)
This (milestone 2) will come in 2.3 - we are merging in the changes for resource tracking framework in 2.2 (milestone 1)

@JeffHuss
Copy link

Still trying to get a hold of @rramachand21 for details about milestone 2 from the meta/epic issue.

@JeffHuss JeffHuss added xx-documentation Improvements or additions to documentation feedback needed Needs SME Waiting on input from subject matter expert labels Aug 18, 2022
@Naarcha-AWS Naarcha-AWS modified the milestones: v 3.0, v2.3 Aug 22, 2022
@JeffHuss
Copy link

I still have not received any information from the devs and there hasn't been a response on the feature issue.

@hdhalter hdhalter removed the xx-documentation Improvements or additions to documentation label Aug 29, 2022
@Naarcha-AWS Naarcha-AWS added v2.4.0 'Issues and PRs related to version v2.4.0' and removed v2.3.0 labels Sep 13, 2022
@Naarcha-AWS Naarcha-AWS modified the milestones: v2.3, v2.4 Sep 13, 2022
@Naarcha-AWS Naarcha-AWS removed Needs SME Waiting on input from subject matter expert feedback needed labels Sep 27, 2022
@Naarcha-AWS Naarcha-AWS assigned kolchfa-aws and unassigned JeffHuss Sep 27, 2022
@Naarcha-AWS Naarcha-AWS changed the title [Doc] Improve resiliency in memory management - back pressure in search path [Doc] Server-side cancellation of search requests based on resource Sep 27, 2022
@Naarcha-AWS Naarcha-AWS changed the title [Doc] Server-side cancellation of search requests based on resource [Doc] Server-side rejection of search requests based on resource Sep 27, 2022
@hdhalter hdhalter added 2 - In progress Issue/PR: The issue or PR is in progress. and removed 1 - Backlog Issue: The issue is unassigned or assigned but not started labels Nov 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In progress Issue/PR: The issue or PR is in progress. v2.4.0 'Issues and PRs related to version v2.4.0'
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants