Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Push back on excessive requests for stats #51992

Open
Tracked by #77466
DaveCTurner opened this issue Feb 6, 2020 · 6 comments · Fixed by #83832
Open
Tracked by #77466

Push back on excessive requests for stats #51992

DaveCTurner opened this issue Feb 6, 2020 · 6 comments · Fixed by #83832
Assignees
Labels
:Data Management/Stats Statistics tracking and retrieval APIs resiliency Team:Data Management Meta label for data/management team

Comments

@DaveCTurner
Copy link
Contributor

Today stats requests are processed by the management threadpool, which is also used for important internal management tasks such as syncing global checkpoints and retention leases, ILM and SLM. The management threadpool has an unbounded queue. Some stats take a nontrivial amount of effort to compute and it is certainly possible to request stats more frequently than the cluster can respond. We cannot control the behaviour of clients requesting stats, and I've seen more than a few situations where an errant monitoring system harms the cluster with its request rate (see links in #51915). Since we doggedly enqueue every request it can take a very long time to recover from this situation, and while working through the queue the well-behaved internal management tasks do not run in a timely fashion. The quickest recovery path may be to restart any affected nodes.

I think we should be pushing back against this kind of behaviour to protect the cluster from abusive monitoring clients. We could, for instance, use different threadpools for the internal (and well-behaved) actions from external (and possibly-abusive) ones, and use a bounded queue for the threadpool handling the external actions. Some users of the management threadpool are not clearly one or the other and we'll need to use some judgement to decide whether we need to protect them from abuse - e.g. license management, security cache management. I've yet to see such actions involved in struggling clusters, however, so perhaps either way would be ok.

Relates #51915

@DaveCTurner DaveCTurner added resiliency :Data Management/Stats Statistics tracking and retrieval APIs :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. team-discuss labels Feb 6, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/Stats)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Distributed)

@DaveCTurner
Copy link
Contributor Author

We discussed this today and agreed that it's a valid concern; we could perhaps move the "well-behaved" actions into the GENERIC threadpool and bound the queue on the MANAGEMENT threadpool to push back on anything that's left.

However we don't see too much urgency on this front, and there are more general questions about how we assign work to the various threadpools that subsume this one, so we decided to proceed on the local improvement to stats performance in #51991 and close this issue to indicate that we won't be working on it in the near future.

@DaveCTurner
Copy link
Contributor Author

I'm reopening this as we've seen cases where stats requests can become overwhelming even after the fixes mentioned above. I think we should consider again some ideas for bounding the resources used by stats requests, for instance limiting the number of in-flight stats requests being coordinated by each node. If a single node becomes unresponsive for a few minutes then we could see a couple hundred requests build up and in a decent sized cluster each could consume many MBs of heap on the coordinating node.

Relates #82337 which would fix one particular way to get into this mess.

Relates #77466 since this particularly affects high-shard-count clusters.

@joegallo
Copy link
Contributor

Reopening for reconsideration in light of #85333 (comment).

@joegallo joegallo reopened this May 23, 2022
@pxsalehi pxsalehi removed the :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. label Jul 28, 2022
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Jul 28, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Stats Statistics tracking and retrieval APIs resiliency Team:Data Management Meta label for data/management team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants