-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for timeout in stats API #52616
Comments
Pinging @elastic/es-core-features (:Core/Features/Stats) |
I'll be happy to work on a PR. Please let me know if that is fine |
Today we can get a partial result if the request on any node fails eg. Circuit breaker exception on the transport node so timing out on the slow node would try to mimic a similar behaviour with a
It would be inline with The heap build up is proportional to the shards in the cluster. I guess passing a Line 313 in b590b49
|
@matriv any updates on this. |
The underlying problems you describe (degraded hardware, kernel scheduling issues, CPU lockups) sound like they will have more widespread effects than just the stats APIs, and I would like to understand more clearly why the existing mechanisms for dealing with these problems are not effective. In particular, if a node fails its health checks then it will be removed from the cluster which will unblock everything that's waiting for the broken node to respond. In other words, how can the node be so broken that it cannot respond to stats calls, whilst still not being broken enough to fail its health checks? How can we strengthen the health checks to detect this? |
Thanks @DaveCTurner
On this particular incident we saw the kernel scheduling issue( While with degraded hardware the network pings for Leader/Follower checks go through, I/O can be very slow causing the heap build up on the healthy node. While some of the bad hardware would be addressed through #45286 the grey/degraded hardware issues would need other supporting metrics to detect and mitigate. |
I'm struggling to align what you are saying with how Elasticsearch works today. For instance, all transport messages go through a transport worker so if the transport worker threads are stuck then health checks will fail. Furthermore in an otherwise-stable cluster the health checks run on transport worker threads and not the generic thread pool. I think we cannot reasonably decide on a course of action until we understand what was really happening in this cluster. It's a shame you don't have a thread dump, as that would have clarified things a lot. You have already opened a PR to improve the health checks in a way that I think would help in the situation you describe (#52680). If you don't think that's sufficient then maybe the best way forward is to strengthen your proposed checks to cover this failure mode too. I think we should close this issue regarding the stats APIs and continue discussing the question of better health checks on your PR. |
Thanks for the clarification but don't we have broadly two categories of transport workers(thread) one doing disk I/O(for stats) and the other only network I/O(pings). While based on what I reported(/_cat/indices and stats API being stuck) it would still be possible that the good node faces a heap build up. While I agree bad node can be worked through with the PR we are already on, adding a timeout might still potentially be helpful for a unresponsive cluster where some nodes are busy with GC and we might still want to know partial results rather than getting blocked for long espl with client side monitoring which have a granular metric SLA . Let me know what you think. |
We are facing similar issue... We have a monitoring plugin collecting stats of every indices periodically by calling It's hard to say whether we will encounter similar issues not caused by disk problems in the future. I think we can have a timeout mechanism for |
The
GET _stats
API broadcasts requests to all nodes in order to collect shard level stats from across the nodes. Now if there is a single node that is problematic(degraded hardware or the kernel unable to schedule tasks during some scenarios, cpu lock-ups etc), this can cause heap to build up on a node handling theREST
request as it would not be able to free up memory allocated from the responses of remaining nodes while waiting on the problematic node to respond. Now if there are clients doing a periodic monitoring this might increase GC pressure on the nodes.Histogram dum from one of the nodes
This can be easily reproduced by placing some sleep on
TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler#messageReceived
and invoking theREST _stats
API periodicallyThe text was updated successfully, but these errors were encountered: