Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for timeout in stats API #52616

Closed
Bukhtawar opened this issue Feb 21, 2020 · 9 comments
Closed

Support for timeout in stats API #52616

Bukhtawar opened this issue Feb 21, 2020 · 9 comments
Labels
:Data Management/Stats Statistics tracking and retrieval APIs feedback_needed

Comments

@Bukhtawar
Copy link
Contributor

The GET _stats API broadcasts requests to all nodes in order to collect shard level stats from across the nodes. Now if there is a single node that is problematic(degraded hardware or the kernel unable to schedule tasks during some scenarios, cpu lock-ups etc), this can cause heap to build up on a node handling the REST request as it would not be able to free up memory allocated from the responses of remaining nodes while waiting on the problematic node to respond. Now if there are clients doing a periodic monitoring this might increase GC pressure on the nodes.

Histogram dum from one of the nodes

 num     #instances         #bytes  class name
----------------------------------------------
   1:     230117683    14813806384  [C
   2:     230114240     5522741760  java.lang.String
   3:      77239917     2471677344  java.util.HashMap$Node
   4:      11036035      889480064  [Ljava.util.HashMap$Node;
   5:      10886249      870899920  org.elasticsearch.action.admin.indices.stats.CommonStats
   6:      10944341      700437824  org.elasticsearch.cluster.routing.ShardRouting
   7:      21838600      698835200  java.util.Collections$UnmodifiableMap
   8:      10888927      609779912  java.util.LinkedHashMap
   9:      11072686      531488928  java.util.HashMap
  10:      10885985      522527280  org.elasticsearch.action.admin.indices.stats.ShardStats
  11:      10885985      435439400  org.elasticsearch.index.seqno.SeqNoStats
  12:         30990      392283384  [B
  13:      10885986      348351552  org.elasticsearch.index.seqno.RetentionLeases
  14:      10885985      348351520  org.elasticsearch.index.engine.CommitStats
  15:      11002724      264065376  java.util.Collections$SingletonList
  16:      11002472      264059328  org.elasticsearch.index.Index
  17:      10972905      263349720  org.elasticsearch.index.shard.ShardId
  18:      10944339      262664136  org.elasticsearch.cluster.routing.AllocationId
  19:       5474469      218978760  org.elasticsearch.index.shard.DocsStats
  20:      10885985      174175760  org.elasticsearch.index.seqno.RetentionLeaseStats
  21:       5469798      131275152  org.elasticsearch.index.store.StoreStats

This can be easily reproduced by placing some sleep on TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler#messageReceived and invoking the REST _stats API periodically

@matriv matriv added the :Data Management/Stats Statistics tracking and retrieval APIs label Feb 21, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/Stats)

@Bukhtawar
Copy link
Contributor Author

I'll be happy to work on a PR. Please let me know if that is fine

@Bukhtawar
Copy link
Contributor Author

Bukhtawar commented Feb 24, 2020

Today we can get a partial result if the request on any node fails eg. Circuit breaker exception on the transport node so timing out on the slow node would try to mimic a similar behaviour with a FailedNodeException.

{"_shards":{"total":4274,"successful":4241,"failed":33,"failures":[{"shard":27,"index":"test-2020.01.03-19","status":"INTERNAL_SERVER_ERROR","reason":{"type":"failed_node_exception","reason":"Failed node [GVLMLR5-WiysqqK8SBg]","node_id":"GVLMLR5-WiysqqK8SBg","caused_by":{"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<transport_request>] would be [30593795348/28.4gb], which is larger than the limit of [29479606681/27.4gb], real usage: [30593766888/28.4gb], new bytes reserved: [28460/27.7kb]","bytes_wanted":30593795348,"bytes_limit":29479606681}}}

It would be inline with _nodes/stats API that supports a timeout
@DaveCTurner Let me know your thoughts on the same

The heap build up is proportional to the shards in the cluster. I guess passing a TransportRequestOptions should help based on a optional timeout

transportService.sendRequest(node, transportNodeBroadcastAction, nodeRequest, new TransportResponseHandler<NodeResponse>() {

@Bukhtawar
Copy link
Contributor Author

@matriv any updates on this.

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Feb 28, 2020

The underlying problems you describe (degraded hardware, kernel scheduling issues, CPU lockups) sound like they will have more widespread effects than just the stats APIs, and I would like to understand more clearly why the existing mechanisms for dealing with these problems are not effective. In particular, if a node fails its health checks then it will be removed from the cluster which will unblock everything that's waiting for the broken node to respond.

In other words, how can the node be so broken that it cannot respond to stats calls, whilst still not being broken enough to fail its health checks? How can we strengthen the health checks to detect this?

@Bukhtawar
Copy link
Contributor Author

Thanks @DaveCTurner

In particular, if a node fails its health checks then it will be removed from the cluster which will unblock everything that's waiting for the broken node to respond.

On this particular incident we saw the kernel scheduling issue(INFO: task java:58012 blocked for more than 120 seconds) causing atleast the transport worker threads to get stuck, while the health check pings continued to go through on the generic thread pool. I don't have a thread dump to precisely confirm this though but based on my observation _cat/indices and others were also stuck on this node. The cluster recovered when the bad node had the ES process restarted.

While with degraded hardware the network pings for Leader/Follower checks go through, I/O can be very slow causing the heap build up on the healthy node. While some of the bad hardware would be addressed through #45286 the grey/degraded hardware issues would need other supporting metrics to detect and mitigate.

@DaveCTurner
Copy link
Contributor

I'm struggling to align what you are saying with how Elasticsearch works today. For instance, all transport messages go through a transport worker so if the transport worker threads are stuck then health checks will fail. Furthermore in an otherwise-stable cluster the health checks run on transport worker threads and not the generic thread pool. I think we cannot reasonably decide on a course of action until we understand what was really happening in this cluster. It's a shame you don't have a thread dump, as that would have clarified things a lot.

You have already opened a PR to improve the health checks in a way that I think would help in the situation you describe (#52680). If you don't think that's sufficient then maybe the best way forward is to strengthen your proposed checks to cover this failure mode too. I think we should close this issue regarding the stats APIs and continue discussing the question of better health checks on your PR.

@Bukhtawar
Copy link
Contributor Author

Thanks for the clarification but don't we have broadly two categories of transport workers(thread) one doing disk I/O(for stats) and the other only network I/O(pings). While based on what I reported(/_cat/indices and stats API being stuck) it would still be possible that the good node faces a heap build up.

While I agree bad node can be worked through with the PR we are already on, adding a timeout might still potentially be helpful for a unresponsive cluster where some nodes are busy with GC and we might still want to know partial results rather than getting blocked for long espl with client side monitoring which have a granular metric SLA . Let me know what you think.

@hydrogen666
Copy link

hydrogen666 commented Jul 7, 2020

We are facing similar issue... We have a monitoring plugin collecting stats of every indices periodically by calling GET /_stats api, one data node response indices:monitor/stats[n] very slow due to disk issue but its network connection is OK. It causes our leader master OOM.

It's hard to say whether we will encounter similar issues not caused by disk problems in the future. I think we can have a timeout mechanism for GET /_stats API. @DaveCTurner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Stats Statistics tracking and retrieval APIs feedback_needed
Projects
None yet
Development

No branches or pull requests

5 participants