Support for timeout in stats API #52616

Bukhtawar · 2020-02-21T05:55:59Z

The GET _stats API broadcasts requests to all nodes in order to collect shard level stats from across the nodes. Now if there is a single node that is problematic(degraded hardware or the kernel unable to schedule tasks during some scenarios, cpu lock-ups etc), this can cause heap to build up on a node handling the REST request as it would not be able to free up memory allocated from the responses of remaining nodes while waiting on the problematic node to respond. Now if there are clients doing a periodic monitoring this might increase GC pressure on the nodes.

Histogram dum from one of the nodes

 num     #instances         #bytes  class name
----------------------------------------------
   1:     230117683    14813806384  [C
   2:     230114240     5522741760  java.lang.String
   3:      77239917     2471677344  java.util.HashMap$Node
   4:      11036035      889480064  [Ljava.util.HashMap$Node;
   5:      10886249      870899920  org.elasticsearch.action.admin.indices.stats.CommonStats
   6:      10944341      700437824  org.elasticsearch.cluster.routing.ShardRouting
   7:      21838600      698835200  java.util.Collections$UnmodifiableMap
   8:      10888927      609779912  java.util.LinkedHashMap
   9:      11072686      531488928  java.util.HashMap
  10:      10885985      522527280  org.elasticsearch.action.admin.indices.stats.ShardStats
  11:      10885985      435439400  org.elasticsearch.index.seqno.SeqNoStats
  12:         30990      392283384  [B
  13:      10885986      348351552  org.elasticsearch.index.seqno.RetentionLeases
  14:      10885985      348351520  org.elasticsearch.index.engine.CommitStats
  15:      11002724      264065376  java.util.Collections$SingletonList
  16:      11002472      264059328  org.elasticsearch.index.Index
  17:      10972905      263349720  org.elasticsearch.index.shard.ShardId
  18:      10944339      262664136  org.elasticsearch.cluster.routing.AllocationId
  19:       5474469      218978760  org.elasticsearch.index.shard.DocsStats
  20:      10885985      174175760  org.elasticsearch.index.seqno.RetentionLeaseStats
  21:       5469798      131275152  org.elasticsearch.index.store.StoreStats

This can be easily reproduced by placing some sleep on TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler#messageReceived and invoking the REST _stats API periodically

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-02-21T09:56:10Z

Pinging @elastic/es-core-features (:Core/Features/Stats)

Bukhtawar · 2020-02-23T11:50:34Z

I'll be happy to work on a PR. Please let me know if that is fine

Bukhtawar · 2020-02-24T13:56:49Z

Today we can get a partial result if the request on any node fails eg. Circuit breaker exception on the transport node so timing out on the slow node would try to mimic a similar behaviour with a FailedNodeException.

{"_shards":{"total":4274,"successful":4241,"failed":33,"failures":[{"shard":27,"index":"test-2020.01.03-19","status":"INTERNAL_SERVER_ERROR","reason":{"type":"failed_node_exception","reason":"Failed node [GVLMLR5-WiysqqK8SBg]","node_id":"GVLMLR5-WiysqqK8SBg","caused_by":{"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<transport_request>] would be [30593795348/28.4gb], which is larger than the limit of [29479606681/27.4gb], real usage: [30593766888/28.4gb], new bytes reserved: [28460/27.7kb]","bytes_wanted":30593795348,"bytes_limit":29479606681}}}

It would be inline with _nodes/stats API that supports a timeout
@DaveCTurner Let me know your thoughts on the same

The heap build up is proportional to the shards in the cluster. I guess passing a TransportRequestOptions should help based on a optional timeout

elasticsearch/server/src/main/java/org/elasticsearch/action/support/broadcast/node/TransportBroadcastByNodeAction.java

Line 313 in b590b49

    
           transportService.sendRequest(node, transportNodeBroadcastAction, nodeRequest, new TransportResponseHandler<NodeResponse>() {

Bukhtawar · 2020-02-28T12:03:14Z

@matriv any updates on this.

DaveCTurner · 2020-02-28T12:15:00Z

The underlying problems you describe (degraded hardware, kernel scheduling issues, CPU lockups) sound like they will have more widespread effects than just the stats APIs, and I would like to understand more clearly why the existing mechanisms for dealing with these problems are not effective. In particular, if a node fails its health checks then it will be removed from the cluster which will unblock everything that's waiting for the broken node to respond.

In other words, how can the node be so broken that it cannot respond to stats calls, whilst still not being broken enough to fail its health checks? How can we strengthen the health checks to detect this?

Bukhtawar · 2020-02-28T13:32:55Z

Thanks @DaveCTurner

In particular, if a node fails its health checks then it will be removed from the cluster which will unblock everything that's waiting for the broken node to respond.

On this particular incident we saw the kernel scheduling issue(INFO: task java:58012 blocked for more than 120 seconds) causing atleast the transport worker threads to get stuck, while the health check pings continued to go through on the generic thread pool. I don't have a thread dump to precisely confirm this though but based on my observation _cat/indices and others were also stuck on this node. The cluster recovered when the bad node had the ES process restarted.

While with degraded hardware the network pings for Leader/Follower checks go through, I/O can be very slow causing the heap build up on the healthy node. While some of the bad hardware would be addressed through #45286 the grey/degraded hardware issues would need other supporting metrics to detect and mitigate.

DaveCTurner · 2020-02-28T13:57:21Z

I'm struggling to align what you are saying with how Elasticsearch works today. For instance, all transport messages go through a transport worker so if the transport worker threads are stuck then health checks will fail. Furthermore in an otherwise-stable cluster the health checks run on transport worker threads and not the generic thread pool. I think we cannot reasonably decide on a course of action until we understand what was really happening in this cluster. It's a shame you don't have a thread dump, as that would have clarified things a lot.

You have already opened a PR to improve the health checks in a way that I think would help in the situation you describe (#52680). If you don't think that's sufficient then maybe the best way forward is to strengthen your proposed checks to cover this failure mode too. I think we should close this issue regarding the stats APIs and continue discussing the question of better health checks on your PR.

Bukhtawar · 2020-02-28T14:37:27Z

Thanks for the clarification but don't we have broadly two categories of transport workers(thread) one doing disk I/O(for stats) and the other only network I/O(pings). While based on what I reported(/_cat/indices and stats API being stuck) it would still be possible that the good node faces a heap build up.

While I agree bad node can be worked through with the PR we are already on, adding a timeout might still potentially be helpful for a unresponsive cluster where some nodes are busy with GC and we might still want to know partial results rather than getting blocked for long espl with client side monitoring which have a granular metric SLA . Let me know what you think.

hydrogen666 · 2020-07-07T11:41:39Z

We are facing similar issue... We have a monitoring plugin collecting stats of every indices periodically by calling GET /_stats api, one data node response indices:monitor/stats[n] very slow due to disk issue but its network connection is OK. It causes our leader master OOM.

It's hard to say whether we will encounter similar issues not caused by disk problems in the future. I think we can have a timeout mechanism for GET /_stats API. @DaveCTurner

matriv added the :Data Management/Stats Statistics tracking and retrieval APIs label Feb 21, 2020

DaveCTurner added the feedback_needed label Feb 28, 2020

DaveCTurner closed this as completed Feb 28, 2020

DaveCTurner mentioned this issue Jul 27, 2020

Client-side stats collection timeouts can result in overloaded master #60188

Closed

DaveCTurner mentioned this issue Jan 5, 2021

Cancel task (and descendants) if its originating transport request times out #66992

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for timeout in stats API #52616

Support for timeout in stats API #52616

Bukhtawar commented Feb 21, 2020

elasticmachine commented Feb 21, 2020

Bukhtawar commented Feb 23, 2020

Bukhtawar commented Feb 24, 2020 •

edited

Loading

Bukhtawar commented Feb 28, 2020

DaveCTurner commented Feb 28, 2020 •

edited

Loading

Bukhtawar commented Feb 28, 2020

DaveCTurner commented Feb 28, 2020

Bukhtawar commented Feb 28, 2020

hydrogen666 commented Jul 7, 2020 •

edited

Loading

Support for timeout in stats API #52616

Support for timeout in stats API #52616

Comments

Bukhtawar commented Feb 21, 2020

elasticmachine commented Feb 21, 2020

Bukhtawar commented Feb 23, 2020

Bukhtawar commented Feb 24, 2020 • edited Loading

Bukhtawar commented Feb 28, 2020

DaveCTurner commented Feb 28, 2020 • edited Loading

Bukhtawar commented Feb 28, 2020

DaveCTurner commented Feb 28, 2020

Bukhtawar commented Feb 28, 2020

hydrogen666 commented Jul 7, 2020 • edited Loading

Bukhtawar commented Feb 24, 2020 •

edited

Loading

DaveCTurner commented Feb 28, 2020 •

edited

Loading

hydrogen666 commented Jul 7, 2020 •

edited

Loading