Client-side stats collection timeouts can result in overloaded master #60188

DaveCTurner · 2020-07-27T08:20:40Z

Today the monitoring subsystem collects stats from a cluster with a client-side timeout, e.g.:

elasticsearch/x-pack/plugin/monitoring/src/main/java/org/elasticsearch/xpack/monitoring/collector/cluster/ClusterStatsCollector.java

Line 87 in 8b7c556

    
           () -> client.admin().cluster().prepareClusterStats().get(getCollectionTimeout());

elasticsearch/x-pack/plugin/monitoring/src/main/java/org/elasticsearch/xpack/monitoring/collector/indices/IndexStatsCollector.java

Line 74 in 8b7c556

.get(getCollectionTimeout());

elasticsearch/x-pack/plugin/monitoring/src/main/java/org/elasticsearch/xpack/monitoring/collector/indices/IndexRecoveryCollector.java

Line 73 in 8b7c556

.get(getCollectionTimeout());

elasticsearch/x-pack/plugin/monitoring/src/main/java/org/elasticsearch/xpack/monitoring/collector/ml/JobStatsCollector.java

Line 76 in 8b7c556

.actionGet(getCollectionTimeout());

This timeout is configurable for each collector and defaults to 10 seconds:

elasticsearch/x-pack/plugin/monitoring/src/main/java/org/elasticsearch/xpack/monitoring/collector/Collector.java

Lines 174 to 177 in 8b7c556

    
           protected static Setting<TimeValue> collectionTimeoutSetting(final String settingName) { 
        
               String name = collectionSetting(settingName); 
        
               return timeSetting(name, TimeValue.timeValueSeconds(10), Property.Dynamic, Property.NodeScope); 
        
           }

Handlers of stats requests generally reach out to all the nodes in the cluster, collect their responses, and once all nodes have responded they send a summary of the results to the originating client. These responses can be rather large, perhaps 10s of MBs, and this data all lives on-heap on the coordinating node until every node has responded.

The problem with the client-side timeouts that monitoring uses is that they do not clean up the partial results held on the coordinating node. If one node stops responding to stats requests for a while then monitoring will retry adding new handlers with 10s of MBs more heap usage to the coordinating node every 10 seconds.

I think we should remove these client-side timeouts so that we avoid the accumulation of on-heap junk caused by these retries. If we feel that the timeout/retry behaviour is necessary then I think we should move it into the server so that it can clean up properly on a failure (relates #52616).

elasticmachine · 2020-07-27T14:31:29Z

Pinging @elastic/es-core-features (:Core/Features/Monitoring)

Bukhtawar · 2020-07-28T13:29:28Z

@DaveCTurner Thanks for looking into it. I think the cluster should be able to protect itself independent of the monitoring entity in which case timeouts rescue the coordinating node. We can additionally also cancel tasks on the slow node on timeouts since we don't care about their responses beyond the timeout.

Thoughts?

jakelandis · 2020-08-06T20:04:34Z

Thanks @DaveCTurner for raising this. We discussed this today and agree that this is an issue, but it may be a bit more complex then just removing the client side timeout.

Some context: There is an effort to replace these "internal" collectors with metricbeat based collectors which run out of process and have slightly different characteristics but the same data. Any large efforts should be geared to ensuring that the metricbeat approach is as effective as possible. The "internal" collectors run on each node, and for any of the collections that require fanning out to the individual node, that is handled on the master node. I am not clear on why this decision was made as opposed to simply letting each node handle its own metric avoiding intra-cluster communication all together. (I am sure there is a valid reason, but those decisions pre-date my time here). To the best of my knowledge metricbeat is designed to run as a side car process for each instance of ES and always (?? need to double check ??) uses the ?local flag to avoid intra-cluster communication.

I think if it were not for metricbeat we should revist the decision on why intra-cluster communication is needed at all for internal collection, and go from there. However, I think time might be spent ensuring that the metric beat approach is optimimal and can't result in memory/performance issues akin to the one mentioned here.

One reason that I am cautious (hesitant?) to remove the client side timeouts is what happens on the master node if it sends a request and never hears back ? I don't think we have any implicit transport layer timeouts, do we ? Would we end up blocking indefinitely ?

Most of these actions are backed by TransportBroadcastByNodeAction which does not appear to have any explicit timeouts. I am not terribly familiar with section of the codebase, but it would seem that an explicit timeout there would suitable for more then just monitoring. @DaveCTurner - does adding an explicit time out (calculated per node servicing the request, i.e. server side) configurable via a TransportBroadcastByNodeAction seem like a viable way forward to remove the client side timeouts.

DaveCTurner · 2020-08-09T17:30:50Z

Indeed, the question for Metricbeat (and other HTTP clients) is covered in #55550. Today Metricbeat also times out after 10 seconds and immediately retries, and at least some of its collections run only on the elected master (!) As I noted in #55550 my preference there would be to treat a closed HTTP connection as a cancellation, like we do for searches, since HTTP clients need an end-to-end timeout anyway and this would let us respect that.

There are a couple of differences between the internal monitoring client and an external HTTP client like Metricbeat that need a different approach for the monitoring client. Firstly, there's no equivalent to "close the HTTP connection" to signal that the request is no longer needed. Secondly, transport requests always get an eventual response, even without an explicit per-request timeout (mod bugs ofc) because they react to the health of the underlying TCP connection (as determined by all the other requests on the same connection, or keepalives in the case of extremely low traffic). (That said, HTTP requests also always get a response in some form because of the same TCP-level mechanisms and the fact that the coordinating node always responds as long as the network is healthy (mod bugs ofc), assuming that they configure keepalives properly, but that's a separate debate)

I'm -0 on moving the timeout to TransportBroadcastByNodeAction -- on the one hand, it'd keep stats flowing even if one node was struggling to respond, and would mean that each data point was collected at roughly the same time across the whole cluster. On the other hand if we timed out node-level requests then we'd report incomplete stats so cluster-wide totals would behave strangely. Given that Metricbeat takes an all-or-nothing approach behaviour with even a single slow node, I think we should do the same. Also transport-level timeouts only take effect on the requester so the node-level requests would still pile up on that one struggling node and make things worse. I think support for proper cancellation would be preferable, and then on a timeout the monitoring collector could cancel the request and all its children just like we propose for HTTP clients.

NB -0 means I have only a very slight preference for not doing it. The linked issues would certainly have been less severe if TransportBroadcastByNodeAction had had a timeout, and it'd be a relatively small change.

DaveCTurner · 2020-08-10T20:08:57Z

Adrien just pointed out something that I had missed earlier: if we simply remove these timeouts then that'll halt all the other collectors too, not just the one that's blocked on a slow node, because we run the collections in series, block the collecting thread to wait for each response in turn, and don't schedule the next collection until the previous cycle has completed 😢

haveTryTwo · 2020-09-22T19:13:19Z

we have met same problems when a datanode has been restarted and recoveried, the master node is not able to collect indice/stats and cluster/stats and index/recovery, from the datanode, and then oom, but network and disk IO are both good on the datanode with io stat information; and when a new master has been elected, there is no such problem and nothing has been done with datanode. Is there some problem on connection which collecting stats? And some scene has been reappeared?

With elastic#66993 there is now support for coordinator-side timeouts on a `BroadcastRequest`, which includes requests for node stats and recoveries. This commit adjusts Monitoring to use these coordinator-side timeouts where applicable, which will prevent partial stats responses from accumulating on the master while one or more nodes are not responding quickly enough. It also enhances the message logged on a timeout to include the IDs of the nodes which did not respond in time. Closes elastic#60188.

With #66993 there is now support for coordinator-side timeouts on a `BroadcastRequest`, which includes requests for node stats and recoveries. This commit adjusts Monitoring to use these coordinator-side timeouts where applicable, which will prevent partial stats responses from accumulating on the master while one or more nodes are not responding quickly enough. It also enhances the message logged on a timeout to include the IDs of the nodes which did not respond in time. Closes #60188.

DaveCTurner added >bug :Data Management/Monitoring Team:Data Management Meta label for data/management team and removed Team:Data Management Meta label for data/management team :Data Management/Monitoring labels Jul 27, 2020

jakelandis added the team-discuss label Jul 27, 2020

DaveCTurner mentioned this issue Jul 28, 2020

Improve robustness of monitoring APIs #55550

Closed

DaveCTurner mentioned this issue Jan 5, 2021

Cancel task (and descendants) if its originating transport request times out #66992

Open

DaveCTurner mentioned this issue Jan 6, 2021

Move monitoring collection timeouts to coordinator #67084

Merged

DaveCTurner closed this as completed in #67084 Jan 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client-side stats collection timeouts can result in overloaded master #60188

Client-side stats collection timeouts can result in overloaded master #60188

DaveCTurner commented Jul 27, 2020 •

edited

Loading

elasticmachine commented Jul 27, 2020

Bukhtawar commented Jul 28, 2020

jakelandis commented Aug 6, 2020

DaveCTurner commented Aug 9, 2020 •

edited

Loading

DaveCTurner commented Aug 10, 2020

haveTryTwo commented Sep 22, 2020

Client-side stats collection timeouts can result in overloaded master #60188

Client-side stats collection timeouts can result in overloaded master #60188

Comments

DaveCTurner commented Jul 27, 2020 • edited Loading

elasticmachine commented Jul 27, 2020

Bukhtawar commented Jul 28, 2020

jakelandis commented Aug 6, 2020

DaveCTurner commented Aug 9, 2020 • edited Loading

DaveCTurner commented Aug 10, 2020

haveTryTwo commented Sep 22, 2020

DaveCTurner commented Jul 27, 2020 •

edited

Loading

DaveCTurner commented Aug 9, 2020 •

edited

Loading