-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve robustness of monitoring APIs #55550
Comments
Pinging @elastic/es-core-features (:Core/Features/Monitoring) |
Just an additional data point. It looks like the metricbeat's elaticsearch module has 10s timeout. So if we don't deliver stats in 10 seconds, it doesn't care anymore. |
Option 2 means that clients can safely implement their own end-to-end timeouts, which means we don't need option 1; conversely if we do option 1 then that still requires clients to implement their own timeouts to protect against requests getting lost on the way to Elasticsearch. I therefore prefer option 2 alone. Relates #60188 which is pretty much the same issue but for the transport-like client that the internal monitoring system uses, which wouldn't be solved by cancelling tasks on a client disconnection since the internal client doesn't disconnect like that. Also relates #51992. |
Hello Elastic team, Is team working on this or expecting any community contribution? |
Copying here the info that @Bukhtawar intends to work on cancellation of the various stats APIs, see #66992 (comment). @Bukhtawar you might be interested in #67413 which implements the right sort of behaviour on a different API. |
Today `GET _cluster/stats` can be quite expensive, and is typically retrieved periodically by monitoring systems (e.g. Metricbeat) that implement a client-side timeout. When the client times out it closes the HTTP connection in use. With this commit we react to the close of the HTTP connection by cancelling the ongoing stats request, avoiding unnecessary duplicated work. Relates elastic#55550
Today `GET _cluster/stats` can be quite expensive, and is typically retrieved periodically by monitoring systems (e.g. Metricbeat) that implement a client-side timeout. When the client times out it closes the HTTP connection in use. With this commit we react to the close of the HTTP connection by cancelling the ongoing stats request, avoiding unnecessary duplicated work. Relates #55550
Today `GET _cluster/stats` can be quite expensive, and is typically retrieved periodically by monitoring systems (e.g. Metricbeat) that implement a client-side timeout. When the client times out it closes the HTTP connection in use. With this commit we react to the close of the HTTP connection by cancelling the ongoing stats request, avoiding unnecessary duplicated work. Relates elastic#55550
Today `GET _cluster/stats` can be quite expensive, and is typically retrieved periodically by monitoring systems (e.g. Metricbeat) that implement a client-side timeout. When the client times out it closes the HTTP connection in use. With this commit we react to the close of the HTTP connection by cancelling the ongoing stats request, avoiding unnecessary duplicated work. Relates #55550 Backport of #68676
As of 7.13 the following APIs can be cancelled by a client-side timeout:
I spoke with the ML folks and we concluded that I also took a quick look at I believe that's all the APIs that need to be addressed for this issue so I am closing this. |
We observed some cases (#50241 for example) where a data node responding slowly can cause accumulation of ResponseContexts for
indices:monitor/recovery[n]
,indices:monitor/stats[n]
,cluster:monitor/stats[n]
andcluster:monitor/xpack/ml/job/stats/get[n]
which correspond to_xpack/usage
and_nodes/stats
calls.We would like to improve robustness of stats and usage call in case of a slowly responding data nodes by
The text was updated successfully, but these errors were encountered: