Improve pending indexing metrics and back pressure #59263

Tim-Brooks · 2020-07-08T23:57:26Z

Currently indexing back pressure is limited to the size of the write queue. This does not effectively reflect the amount of outstanding indexing work for a node. We would like to add new mechanisms which better reflect the amount of outstanding work.

Target 7.9

Indexing metrics and back pressure

In 7.9 we are adding metrics about the number of indexing request bytes outstanding at each point in the indexing process (coordinating, primary, and replication). These metrics will be exposed in the node stats API. Additionally, we will introduce a new setting indexing_pressure.memory.limit which allows a maximum number of bytes to be outstanding. This setting will be 10% of the heap by default. Once 10% of a node's heap is consumed by outstanding indexing bytes, we will start rejecting new coordinating and primary requests.

Additionally, since a failed replication operation can fail a replica, we will assign 1.5X limit for the number of replication bytes. Additionally, only replication bytes can trigger this limit. So if replication bytes increase to high levels, the node will stop accepting new coordinating and primary operations until the replication work load has dropped.

7.9 Node stats API with human readable enabled

      "indexing_pressure": {
        "memory": {
          "current": {
            "combined_coordinating_and_primary": "0b",
            "combined_coordinating_and_primary_in_bytes": 0,
            "coordinating": "0b",
            "coordinating_in_bytes": 0,
            "primary": "0b",
            "primary_in_bytes": 0,
            "replica": "0b",
            "replica_in_bytes": 0,
            "all": "0b",
            "all_in_bytes": 0
          },
          "total": {
            "combined_coordinating_and_primary": "8.1kb",
            "combined_coordinating_and_primary_in_bytes": 8325,
            "coordinating": "8.1kb",
            "coordinating_in_bytes": 8325,
            "primary": "10.4kb",
            "primary_in_bytes": 10725,
            "replica": "0b",
            "replica_in_bytes": 0,
            "all": "8.1kb",
            "all_in_bytes": 8325,
            "coordinating_rejections": 0,
            "primary_rejections": 0,
            "replica_rejections": 0
          }
        }
      }

Replication Retries

In order to mitigate the potential of transient disruptions failing a replica, we will enable replication retries at the primary level. When an operation fails because of connection error, circuit breaking, rejected, etc we the primary will retry until the new timeout setting is exhausted (indices.replication.retry_timeout).

Enable replication retries (Retry failed replication due to transient errors #55633)
Add documentation.

Target 7.10

Evaluate mechanisms for back presssure related to the CPU cost of indexing

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-07-08T23:57:27Z

Pinging @elastic/es-distributed (:Distributed/CRUD)

This commit increases the default write queue size to 10000. This is to allow a greater number of pending indexing requests. This work is safe as we have added additional memory limits. Relates to #59263.

This commit increases the default write queue size to 10000. This is to allow a greater number of pending indexing requests. This work is safe as we have added additional memory limits. Relates to elastic#59263.

This commit increases the default write queue size to 10000. This is to allow a greater number of pending indexing requests. This work is safe as we have added additional memory limits. Relates to #59263.

This is related to elastic#59263.

AnthonyFoiani-at · 2021-06-23T19:23:11Z

Hi! I'm curious if the related CPU-based enhancement ever landed in 7.10?

Target 7.10: Evaluate mechanisms for back presssure related to the CPU cost of indexing

howardhuanghua · 2021-09-01T09:30:11Z

Hi @tbrooks8, could I know that why do we set indexing_pressure.memory.limit as static? Could users change it dynamically for different scenarios?

Tim-Brooks added >enhancement :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. labels Jul 8, 2020

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jul 8, 2020

Tim-Brooks added the Meta label Jul 8, 2020

Tim-Brooks changed the title ~~Improve Indexing Back pressure~~ Improve pending indexing metrics and back pressure Jul 8, 2020

ywelsch added release highlight v7.9.0 and removed v7.9.0 labels Jul 14, 2020

Tim-Brooks mentioned this issue Jul 14, 2020

Increase default write queue size #59464

Merged

This was referenced Jul 14, 2020

Increase default write queue size #59559

Merged

Implement rejections in WriteMemoryLimits #58885

Merged

henningandersen mentioned this issue Aug 6, 2020

Push back on bulk operations #51035

Closed

Tim-Brooks added a commit to Tim-Brooks/elasticsearch that referenced this issue Aug 13, 2020

Add documentation about replica retries.

39e9189

This is related to elastic#59263.

Tim-Brooks mentioned this issue Aug 13, 2020

Add documentation about replica retries. #61125

Closed

DustinChaloupka mentioned this issue Aug 18, 2020

Upgrade to Elasticsearch 7.9.0 vvanholl/elasticsearch-prometheus-exporter#285

Merged

mayya-sharipova mentioned this issue Oct 26, 2020

Request to support quota ratelimit #64102

Closed

DaveCTurner mentioned this issue Jul 30, 2021

Should the write queue size be a byte size rather than a number of items? #51336

Closed

turnUpTheChill mentioned this issue Nov 12, 2021

adding code to ingest indexing pressure stats elastic/beats#28479

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve pending indexing metrics and back pressure #59263

Improve pending indexing metrics and back pressure #59263

Tim-Brooks commented Jul 8, 2020 •

edited

Loading

elasticmachine commented Jul 8, 2020

AnthonyFoiani-at commented Jun 23, 2021

howardhuanghua commented Sep 1, 2021 •

edited

Loading

Improve pending indexing metrics and back pressure #59263

Improve pending indexing metrics and back pressure #59263

Comments

Tim-Brooks commented Jul 8, 2020 • edited Loading

Target 7.9

Indexing metrics and back pressure

7.9 Node stats API with human readable enabled

Replication Retries

Target 7.10

elasticmachine commented Jul 8, 2020

AnthonyFoiani-at commented Jun 23, 2021

howardhuanghua commented Sep 1, 2021 • edited Loading

Tim-Brooks commented Jul 8, 2020 •

edited

Loading

howardhuanghua commented Sep 1, 2021 •

edited

Loading