Improve robustness of monitoring APIs #55550

imotov · 2020-04-21T17:26:06Z

We observed some cases (#50241 for example) where a data node responding slowly can cause accumulation of ResponseContexts for indices:monitor/recovery[n], indices:monitor/stats[n], cluster:monitor/stats[n] and cluster:monitor/xpack/ml/job/stats/get[n] which correspond to _xpack/usage and _nodes/stats calls.

We would like to improve robustness of stats and usage call in case of a slowly responding data nodes by

introducing timeout on stats and usage APIs and/or
making stats and usage APIs tasks cancellable and cancel them if the REST client disconnects

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-04-21T17:26:07Z

Pinging @elastic/es-core-features (:Core/Features/Monitoring)

imotov · 2020-04-22T19:20:28Z

Just an additional data point. It looks like the metricbeat's elaticsearch module has 10s timeout. So if we don't deliver stats in 10 seconds, it doesn't care anymore.

DaveCTurner · 2020-07-28T15:35:22Z

Option 2 means that clients can safely implement their own end-to-end timeouts, which means we don't need option 1; conversely if we do option 1 then that still requires clients to implement their own timeouts to protect against requests getting lost on the way to Elasticsearch. I therefore prefer option 2 alone.

Relates #60188 which is pretty much the same issue but for the transport-like client that the internal monitoring system uses, which wouldn't be solved by cancelling tasks on a client disconnection since the internal client doesn't disconnect like that.

Also relates #51992.

dhwanilpatel · 2020-12-28T09:11:14Z

Hello Elastic team,

Is team working on this or expecting any community contribution?
If team is already working, Can you please provide any timeline for it? It will be very helpful. Thanks!

DaveCTurner · 2021-01-14T14:12:54Z

Copying here the info that @Bukhtawar intends to work on cancellation of the various stats APIs, see #66992 (comment).

@Bukhtawar you might be interested in #67413 which implements the right sort of behaviour on a different API.

Today `GET _cluster/stats` can be quite expensive, and is typically retrieved periodically by monitoring systems (e.g. Metricbeat) that implement a client-side timeout. When the client times out it closes the HTTP connection in use. With this commit we react to the close of the HTTP connection by cancelling the ongoing stats request, avoiding unnecessary duplicated work. Relates elastic#55550

Today `GET _cluster/stats` can be quite expensive, and is typically retrieved periodically by monitoring systems (e.g. Metricbeat) that implement a client-side timeout. When the client times out it closes the HTTP connection in use. With this commit we react to the close of the HTTP connection by cancelling the ongoing stats request, avoiding unnecessary duplicated work. Relates #55550

Today `GET _cluster/stats` can be quite expensive, and is typically retrieved periodically by monitoring systems (e.g. Metricbeat) that implement a client-side timeout. When the client times out it closes the HTTP connection in use. With this commit we react to the close of the HTTP connection by cancelling the ongoing stats request, avoiding unnecessary duplicated work. Relates elastic#55550

Today `GET _cluster/stats` can be quite expensive, and is typically retrieved periodically by monitoring systems (e.g. Metricbeat) that implement a client-side timeout. When the client times out it closes the HTTP connection in use. With this commit we react to the close of the HTTP connection by cancelling the ongoing stats request, avoiding unnecessary duplicated work. Relates #55550 Backport of #68676

Relates elastic#55550

Relates #55550

DaveCTurner · 2021-02-25T12:07:18Z

As of 7.13 the following APIs can be cancelled by a client-side timeout:

GET _cluster/state
GET _cluster/stats
GET _stats
GET _segments
GET _cat/segments
GET _recovery
GET _cat/recovery

I spoke with the ML folks and we concluded that cluster:monitor/xpack/ml/job/stats/get shouldn't cause an excess of junk on the coordinating node since the per-node responses are pretty small; if we saw them building up it was probably due to something else clogging up the MANAGEMENT threads.

I also took a quick look at GET _xpack/usage and didn't see anything that would cause problems: this is a TransportMasterNodeAction so it doesn't accumulate responses from multiple nodes, and also the response appears to be fairly small.

I believe that's all the APIs that need to be addressed for this issue so I am closing this.

imotov added resiliency :Data Management/Monitoring labels Apr 21, 2020

imotov mentioned this issue Apr 23, 2020

[Metricbeat] Exponential backoff for http timeout in elasticsearch module elastic/beats#17948

Open

rjernst added the Team:Data Management Meta label for data/management team label May 4, 2020

DaveCTurner mentioned this issue Aug 9, 2020

Client-side stats collection timeouts can result in overloaded master #60188

Closed

DaveCTurner mentioned this issue Jan 6, 2021

Cancel task (and descendants) if its originating transport request times out #66992

Open

DaveCTurner mentioned this issue Jan 14, 2021

Add Circuit breaker on Transport ResponseHandlers #66196

Closed

DaveCTurner mentioned this issue Feb 8, 2021

Make GET _cluster/stats cancellable #68676

Merged

DaveCTurner mentioned this issue Feb 10, 2021

Make GET _cluster/stats cancellable #68820

Merged

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Feb 18, 2021

Make indices stats requests cancellable

bbfa047

Relates elastic#55550

DaveCTurner mentioned this issue Feb 18, 2021

Make indices stats requests cancellable #69174

Merged

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Feb 18, 2021

Make recovery APIs cancellable

1ea1426

Relates elastic#55550

DaveCTurner mentioned this issue Feb 18, 2021

Make recovery APIs cancellable #69177

Merged

stevejgordon mentioned this issue Feb 22, 2021

7.12.0 Meta Ticket elastic/elasticsearch-net#5337

Closed

34 tasks

DaveCTurner added a commit that referenced this issue Feb 25, 2021

Make recovery APIs cancellable (#69177)

61e6734

Relates #55550

DaveCTurner added a commit that referenced this issue Feb 25, 2021

Make recovery APIs cancellable (#69177)

9262389

Relates #55550

DaveCTurner added a commit that referenced this issue Feb 25, 2021

Make indices stats requests cancellable (#69174)

d847647

Relates #55550

DaveCTurner added a commit that referenced this issue Feb 25, 2021

Make indices stats requests cancellable (#69174)

7d11fe6

Relates #55550

DaveCTurner closed this as completed Feb 25, 2021

DaveCTurner mentioned this issue Mar 1, 2021

BaseNodesRequest default timeout is null, but MasterNodeRequest masterNodeTimeout default is 30s; #50641

Closed

stevejgordon mentioned this issue Apr 21, 2021

7.13.0 Meta Ticket elastic/elasticsearch-net#5584

Closed

62 tasks

DaveCTurner mentioned this issue Jan 7, 2022

Stats actions should discard intermediate state on cancellation #82337

Closed

DaveCTurner mentioned this issue Jun 22, 2022

Make GetTrainedModelsStatsAction cancellable #87931

Closed

DaveCTurner mentioned this issue Feb 6, 2023

Add circuit breaker for response sizes #67478

Open

DaveCTurner mentioned this issue Sep 16, 2024

Implement remote cluster CCS telemetry #112478

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve robustness of monitoring APIs #55550

Improve robustness of monitoring APIs #55550

imotov commented Apr 21, 2020

elasticmachine commented Apr 21, 2020

imotov commented Apr 22, 2020

DaveCTurner commented Jul 28, 2020

dhwanilpatel commented Dec 28, 2020

DaveCTurner commented Jan 14, 2021

DaveCTurner commented Feb 25, 2021

Improve robustness of monitoring APIs #55550

Improve robustness of monitoring APIs #55550

Comments

imotov commented Apr 21, 2020

elasticmachine commented Apr 21, 2020

imotov commented Apr 22, 2020

DaveCTurner commented Jul 28, 2020

dhwanilpatel commented Dec 28, 2020

DaveCTurner commented Jan 14, 2021

DaveCTurner commented Feb 25, 2021