Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve robustness of monitoring APIs #55550

Closed
imotov opened this issue Apr 21, 2020 · 6 comments
Closed

Improve robustness of monitoring APIs #55550

imotov opened this issue Apr 21, 2020 · 6 comments

Comments

@imotov
Copy link
Contributor

imotov commented Apr 21, 2020

We observed some cases (#50241 for example) where a data node responding slowly can cause accumulation of ResponseContexts for indices:monitor/recovery[n], indices:monitor/stats[n], cluster:monitor/stats[n] and cluster:monitor/xpack/ml/job/stats/get[n] which correspond to _xpack/usage and _nodes/stats calls.

We would like to improve robustness of stats and usage call in case of a slowly responding data nodes by

  1. introducing timeout on stats and usage APIs and/or
  2. making stats and usage APIs tasks cancellable and cancel them if the REST client disconnects
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/Monitoring)

@imotov
Copy link
Contributor Author

imotov commented Apr 22, 2020

Just an additional data point. It looks like the metricbeat's elaticsearch module has 10s timeout. So if we don't deliver stats in 10 seconds, it doesn't care anymore.

@DaveCTurner
Copy link
Contributor

Option 2 means that clients can safely implement their own end-to-end timeouts, which means we don't need option 1; conversely if we do option 1 then that still requires clients to implement their own timeouts to protect against requests getting lost on the way to Elasticsearch. I therefore prefer option 2 alone.

Relates #60188 which is pretty much the same issue but for the transport-like client that the internal monitoring system uses, which wouldn't be solved by cancelling tasks on a client disconnection since the internal client doesn't disconnect like that.

Also relates #51992.

@dhwanilpatel
Copy link

Hello Elastic team,

Is team working on this or expecting any community contribution?
If team is already working, Can you please provide any timeline for it? It will be very helpful. Thanks!

@DaveCTurner
Copy link
Contributor

Copying here the info that @Bukhtawar intends to work on cancellation of the various stats APIs, see #66992 (comment).

@Bukhtawar you might be interested in #67413 which implements the right sort of behaviour on a different API.

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Feb 8, 2021
Today `GET _cluster/stats` can be quite expensive, and is typically
retrieved periodically by monitoring systems (e.g. Metricbeat) that
implement a client-side timeout. When the client times out it closes the
HTTP connection in use. With this commit we react to the close of the
HTTP connection by cancelling the ongoing stats request, avoiding
unnecessary duplicated work.

Relates elastic#55550
DaveCTurner added a commit that referenced this issue Feb 10, 2021
Today `GET _cluster/stats` can be quite expensive, and is typically
retrieved periodically by monitoring systems (e.g. Metricbeat) that
implement a client-side timeout. When the client times out it closes the
HTTP connection in use. With this commit we react to the close of the
HTTP connection by cancelling the ongoing stats request, avoiding
unnecessary duplicated work.

Relates #55550
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Feb 10, 2021
Today `GET _cluster/stats` can be quite expensive, and is typically
retrieved periodically by monitoring systems (e.g. Metricbeat) that
implement a client-side timeout. When the client times out it closes the
HTTP connection in use. With this commit we react to the close of the
HTTP connection by cancelling the ongoing stats request, avoiding
unnecessary duplicated work.

Relates elastic#55550
DaveCTurner added a commit that referenced this issue Feb 10, 2021
Today `GET _cluster/stats` can be quite expensive, and is typically
retrieved periodically by monitoring systems (e.g. Metricbeat) that
implement a client-side timeout. When the client times out it closes the
HTTP connection in use. With this commit we react to the close of the
HTTP connection by cancelling the ongoing stats request, avoiding
unnecessary duplicated work.

Relates #55550
Backport of #68676
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Feb 18, 2021
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Feb 18, 2021
DaveCTurner added a commit that referenced this issue Feb 25, 2021
DaveCTurner added a commit that referenced this issue Feb 25, 2021
@DaveCTurner
Copy link
Contributor

As of 7.13 the following APIs can be cancelled by a client-side timeout:

  • GET _cluster/state
  • GET _cluster/stats
  • GET _stats
  • GET _segments
  • GET _cat/segments
  • GET _recovery
  • GET _cat/recovery

I spoke with the ML folks and we concluded that cluster:monitor/xpack/ml/job/stats/get shouldn't cause an excess of junk on the coordinating node since the per-node responses are pretty small; if we saw them building up it was probably due to something else clogging up the MANAGEMENT threads.

I also took a quick look at GET _xpack/usage and didn't see anything that would cause problems: this is a TransportMasterNodeAction so it doesn't accumulate responses from multiple nodes, and also the response appears to be fairly small.

I believe that's all the APIs that need to be addressed for this issue so I am closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants