Push back on excessive requests for stats #51992

DaveCTurner · 2020-02-06T11:42:24Z

Today stats requests are processed by the management threadpool, which is also used for important internal management tasks such as syncing global checkpoints and retention leases, ILM and SLM. The management threadpool has an unbounded queue. Some stats take a nontrivial amount of effort to compute and it is certainly possible to request stats more frequently than the cluster can respond. We cannot control the behaviour of clients requesting stats, and I've seen more than a few situations where an errant monitoring system harms the cluster with its request rate (see links in #51915). Since we doggedly enqueue every request it can take a very long time to recover from this situation, and while working through the queue the well-behaved internal management tasks do not run in a timely fashion. The quickest recovery path may be to restart any affected nodes.

I think we should be pushing back against this kind of behaviour to protect the cluster from abusive monitoring clients. We could, for instance, use different threadpools for the internal (and well-behaved) actions from external (and possibly-abusive) ones, and use a bounded queue for the threadpool handling the external actions. Some users of the management threadpool are not clearly one or the other and we'll need to use some judgement to decide whether we need to protect them from abuse - e.g. license management, security cache management. I've yet to see such actions involved in struggling clusters, however, so perhaps either way would be ok.

Relates #51915

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-02-06T11:42:28Z

Pinging @elastic/es-core-features (:Core/Features/Stats)

elasticmachine · 2020-02-06T11:42:29Z

Pinging @elastic/es-distributed (:Distributed/Distributed)

DaveCTurner · 2020-02-12T15:02:22Z

We discussed this today and agreed that it's a valid concern; we could perhaps move the "well-behaved" actions into the GENERIC threadpool and bound the queue on the MANAGEMENT threadpool to push back on anything that's left.

However we don't see too much urgency on this front, and there are more general questions about how we assign work to the various threadpools that subsume this one, so we decided to proceed on the local improvement to stats performance in #51991 and close this issue to indicate that we won't be working on it in the near future.

Cause is tracked in elastic#49094 Backport of elastic#51992

Cause is tracked in #49094 Backport of #51992

DaveCTurner · 2022-01-07T14:03:07Z

I'm reopening this as we've seen cases where stats requests can become overwhelming even after the fixes mentioned above. I think we should consider again some ideas for bounding the resources used by stats requests, for instance limiting the number of in-flight stats requests being coordinated by each node. If a single node becomes unresponsive for a few minutes then we could see a couple hundred requests build up and in a decent sized cluster each could consume many MBs of heap on the coordinating node.

Relates #82337 which would fix one particular way to get into this mess.

Relates #77466 since this particularly affects high-shard-count clusters.

Resolves #51992

joegallo · 2022-05-23T17:00:21Z

Reopening for reconsideration in light of #85333 (comment).

elasticsearchmachine · 2022-07-28T12:54:17Z

Pinging @elastic/es-data-management (Team:Data Management)

DaveCTurner added resiliency :Data Management/Stats Statistics tracking and retrieval APIs :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. team-discuss labels Feb 6, 2020

DaveCTurner closed this as completed Feb 12, 2020

original-brownbear mentioned this issue Apr 14, 2020

Push Back on Excessive Snapshot Repository API Requests #55153

Open

jkakavas added a commit to jkakavas/elasticsearch that referenced this issue Jun 9, 2020

Mute failing tests in FIPS mode

e81eee1

Cause is tracked in elastic#49094 Backport of elastic#51992

jkakavas mentioned this issue Jun 9, 2020

Mute failing tests in FIPS mode #57861

Merged

jkakavas added a commit that referenced this issue Jun 9, 2020

Mute failing tests in FIPS mode (#57861)

04181ba

Cause is tracked in #49094 Backport of #51992

DaveCTurner mentioned this issue Jul 28, 2020

Improve robustness of monitoring APIs #55550

Closed

DaveCTurner mentioned this issue Dec 4, 2020

Move RestClusterStateAction Response Serialization to Management Pool #65843

Merged

DaveCTurner mentioned this issue Jan 5, 2021

Cancel task (and descendants) if its originating transport request times out #66992

Open

DaveCTurner mentioned this issue Apr 1, 2021

Push back on HTTP requests to busy dedicated master nodes #70435

Closed

DaveCTurner reopened this Jan 7, 2022

DaveCTurner mentioned this issue Jan 7, 2022

Fix Large Shard Count Scalability Issues #77466

Open

97 tasks

gmarouli self-assigned this Jan 13, 2022

gmarouli removed the team-discuss label Feb 11, 2022

gmarouli mentioned this issue Feb 11, 2022

Push back excessive requests for stats #83832

Merged

gmarouli closed this as completed in #83832 Feb 28, 2022

gmarouli added a commit that referenced this issue Feb 28, 2022

Push back excessive requests for stats (#83832)

ed0bb2a

Resolves #51992

This was referenced Mar 24, 2022

Pushing back on index stats requests can cause ILM rollover-ready checks to pile up #85333

Open

Revert "Push back excessive requests for stats" (#83832) from 8.2 #85504

Merged

joegallo mentioned this issue May 23, 2022

Remove "Push back excessive requests for stats (#83832)" #87054

Merged

joegallo reopened this May 23, 2022

pxsalehi removed the :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. label Jul 28, 2022

elasticsearchmachine added the Team:Data Management Meta label for data/management team label Jul 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Push back on excessive requests for stats #51992

Push back on excessive requests for stats #51992

DaveCTurner commented Feb 6, 2020

elasticmachine commented Feb 6, 2020

elasticmachine commented Feb 6, 2020

DaveCTurner commented Feb 12, 2020

DaveCTurner commented Jan 7, 2022

joegallo commented May 23, 2022

elasticsearchmachine commented Jul 28, 2022

Push back on excessive requests for stats #51992

Push back on excessive requests for stats #51992

Comments

DaveCTurner commented Feb 6, 2020

elasticmachine commented Feb 6, 2020

elasticmachine commented Feb 6, 2020

DaveCTurner commented Feb 12, 2020

DaveCTurner commented Jan 7, 2022

joegallo commented May 23, 2022

elasticsearchmachine commented Jul 28, 2022