Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client-side stats collection timeouts can result in overloaded master #60188

Closed
DaveCTurner opened this issue Jul 27, 2020 · 6 comments · Fixed by #67084
Closed

Client-side stats collection timeouts can result in overloaded master #60188

DaveCTurner opened this issue Jul 27, 2020 · 6 comments · Fixed by #67084

Comments

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Jul 27, 2020

Today the monitoring subsystem collects stats from a cluster with a client-side timeout, e.g.:

() -> client.admin().cluster().prepareClusterStats().get(getCollectionTimeout());

This timeout is configurable for each collector and defaults to 10 seconds:

protected static Setting<TimeValue> collectionTimeoutSetting(final String settingName) {
String name = collectionSetting(settingName);
return timeSetting(name, TimeValue.timeValueSeconds(10), Property.Dynamic, Property.NodeScope);
}

Handlers of stats requests generally reach out to all the nodes in the cluster, collect their responses, and once all nodes have responded they send a summary of the results to the originating client. These responses can be rather large, perhaps 10s of MBs, and this data all lives on-heap on the coordinating node until every node has responded.

The problem with the client-side timeouts that monitoring uses is that they do not clean up the partial results held on the coordinating node. If one node stops responding to stats requests for a while then monitoring will retry adding new handlers with 10s of MBs more heap usage to the coordinating node every 10 seconds.

I think we should remove these client-side timeouts so that we avoid the accumulation of on-heap junk caused by these retries. If we feel that the timeout/retry behaviour is necessary then I think we should move it into the server so that it can clean up properly on a failure (relates #52616).

@DaveCTurner DaveCTurner added >bug :Data Management/Monitoring Team:Data Management Meta label for data/management team and removed Team:Data Management Meta label for data/management team :Data Management/Monitoring labels Jul 27, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/Monitoring)

@Bukhtawar
Copy link
Contributor

@DaveCTurner Thanks for looking into it. I think the cluster should be able to protect itself independent of the monitoring entity in which case timeouts rescue the coordinating node. We can additionally also cancel tasks on the slow node on timeouts since we don't care about their responses beyond the timeout.

Thoughts?

@jakelandis
Copy link
Contributor

Thanks @DaveCTurner for raising this. We discussed this today and agree that this is an issue, but it may be a bit more complex then just removing the client side timeout.

Some context: There is an effort to replace these "internal" collectors with metricbeat based collectors which run out of process and have slightly different characteristics but the same data. Any large efforts should be geared to ensuring that the metricbeat approach is as effective as possible. The "internal" collectors run on each node, and for any of the collections that require fanning out to the individual node, that is handled on the master node. I am not clear on why this decision was made as opposed to simply letting each node handle its own metric avoiding intra-cluster communication all together. (I am sure there is a valid reason, but those decisions pre-date my time here). To the best of my knowledge metricbeat is designed to run as a side car process for each instance of ES and always (?? need to double check ??) uses the ?local flag to avoid intra-cluster communication.

I think if it were not for metricbeat we should revist the decision on why intra-cluster communication is needed at all for internal collection, and go from there. However, I think time might be spent ensuring that the metric beat approach is optimimal and can't result in memory/performance issues akin to the one mentioned here.

One reason that I am cautious (hesitant?) to remove the client side timeouts is what happens on the master node if it sends a request and never hears back ? I don't think we have any implicit transport layer timeouts, do we ? Would we end up blocking indefinitely ?

Most of these actions are backed by TransportBroadcastByNodeAction which does not appear to have any explicit timeouts. I am not terribly familiar with section of the codebase, but it would seem that an explicit timeout there would suitable for more then just monitoring. @DaveCTurner - does adding an explicit time out (calculated per node servicing the request, i.e. server side) configurable via a TransportBroadcastByNodeAction seem like a viable way forward to remove the client side timeouts.

@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Aug 9, 2020

Indeed, the question for Metricbeat (and other HTTP clients) is covered in #55550. Today Metricbeat also times out after 10 seconds and immediately retries, and at least some of its collections run only on the elected master (!) As I noted in #55550 my preference there would be to treat a closed HTTP connection as a cancellation, like we do for searches, since HTTP clients need an end-to-end timeout anyway and this would let us respect that.

There are a couple of differences between the internal monitoring client and an external HTTP client like Metricbeat that need a different approach for the monitoring client. Firstly, there's no equivalent to "close the HTTP connection" to signal that the request is no longer needed. Secondly, transport requests always get an eventual response, even without an explicit per-request timeout (mod bugs ofc) because they react to the health of the underlying TCP connection (as determined by all the other requests on the same connection, or keepalives in the case of extremely low traffic). (That said, HTTP requests also always get a response in some form because of the same TCP-level mechanisms and the fact that the coordinating node always responds as long as the network is healthy (mod bugs ofc), assuming that they configure keepalives properly, but that's a separate debate)

I'm -0 on moving the timeout to TransportBroadcastByNodeAction -- on the one hand, it'd keep stats flowing even if one node was struggling to respond, and would mean that each data point was collected at roughly the same time across the whole cluster. On the other hand if we timed out node-level requests then we'd report incomplete stats so cluster-wide totals would behave strangely. Given that Metricbeat takes an all-or-nothing approach behaviour with even a single slow node, I think we should do the same. Also transport-level timeouts only take effect on the requester so the node-level requests would still pile up on that one struggling node and make things worse. I think support for proper cancellation would be preferable, and then on a timeout the monitoring collector could cancel the request and all its children just like we propose for HTTP clients.

NB -0 means I have only a very slight preference for not doing it. The linked issues would certainly have been less severe if TransportBroadcastByNodeAction had had a timeout, and it'd be a relatively small change.

@DaveCTurner
Copy link
Contributor Author

Adrien just pointed out something that I had missed earlier: if we simply remove these timeouts then that'll halt all the other collectors too, not just the one that's blocked on a slow node, because we run the collections in series, block the collecting thread to wait for each response in turn, and don't schedule the next collection until the previous cycle has completed 😢

@haveTryTwo
Copy link

we have met same problems when a datanode has been restarted and recoveried, the master node is not able to collect indice/stats and cluster/stats and index/recovery, from the datanode, and then oom, but network and disk IO are both good on the datanode with io stat information; and when a new master has been elected, there is no such problem and nothing has been done with datanode. Is there some problem on connection which collecting stats? And some scene has been reappeared?

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jan 6, 2021
With elastic#66993 there is now support for coordinator-side timeouts on a
`BroadcastRequest`, which includes requests for node stats and
recoveries. This commit adjusts Monitoring to use these coordinator-side
timeouts where applicable, which will prevent partial stats responses
from accumulating on the master while one or more nodes are not
responding quickly enough. It also enhances the message logged on a
timeout to include the IDs of the nodes which did not respond in time.

Closes elastic#60188.
DaveCTurner added a commit that referenced this issue Jan 11, 2021
With #66993 there is now support for coordinator-side timeouts on a
`BroadcastRequest`, which includes requests for node stats and
recoveries. This commit adjusts Monitoring to use these coordinator-side
timeouts where applicable, which will prevent partial stats responses
from accumulating on the master while one or more nodes are not
responding quickly enough. It also enhances the message logged on a
timeout to include the IDs of the nodes which did not respond in time.

Closes #60188.
DaveCTurner added a commit that referenced this issue Jan 11, 2021
With #66993 there is now support for coordinator-side timeouts on a
`BroadcastRequest`, which includes requests for node stats and
recoveries. This commit adjusts Monitoring to use these coordinator-side
timeouts where applicable, which will prevent partial stats responses
from accumulating on the master while one or more nodes are not
responding quickly enough. It also enhances the message logged on a
timeout to include the IDs of the nodes which did not respond in time.

Closes #60188.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants