Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve pending indexing metrics and back pressure #59263

Open
11 of 13 tasks
Tim-Brooks opened this issue Jul 8, 2020 · 3 comments
Open
11 of 13 tasks

Improve pending indexing metrics and back pressure #59263

Tim-Brooks opened this issue Jul 8, 2020 · 3 comments
Labels
:Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. >enhancement Meta release highlight Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@Tim-Brooks
Copy link
Contributor

Tim-Brooks commented Jul 8, 2020

Currently indexing back pressure is limited to the size of the write queue. This does not effectively reflect the amount of outstanding indexing work for a node. We would like to add new mechanisms which better reflect the amount of outstanding work.

Target 7.9

Indexing metrics and back pressure

In 7.9 we are adding metrics about the number of indexing request bytes outstanding at each point in the indexing process (coordinating, primary, and replication). These metrics will be exposed in the node stats API. Additionally, we will introduce a new setting indexing_pressure.memory.limit which allows a maximum number of bytes to be outstanding. This setting will be 10% of the heap by default. Once 10% of a node's heap is consumed by outstanding indexing bytes, we will start rejecting new coordinating and primary requests.

Additionally, since a failed replication operation can fail a replica, we will assign 1.5X limit for the number of replication bytes. Additionally, only replication bytes can trigger this limit. So if replication bytes increase to high levels, the node will stop accepting new coordinating and primary operations until the replication work load has dropped.

7.9 Node stats API with human readable enabled

      "indexing_pressure": {
        "memory": {
          "current": {
            "combined_coordinating_and_primary": "0b",
            "combined_coordinating_and_primary_in_bytes": 0,
            "coordinating": "0b",
            "coordinating_in_bytes": 0,
            "primary": "0b",
            "primary_in_bytes": 0,
            "replica": "0b",
            "replica_in_bytes": 0,
            "all": "0b",
            "all_in_bytes": 0
          },
          "total": {
            "combined_coordinating_and_primary": "8.1kb",
            "combined_coordinating_and_primary_in_bytes": 8325,
            "coordinating": "8.1kb",
            "coordinating_in_bytes": 8325,
            "primary": "10.4kb",
            "primary_in_bytes": 10725,
            "replica": "0b",
            "replica_in_bytes": 0,
            "all": "8.1kb",
            "all_in_bytes": 8325,
            "coordinating_rejections": 0,
            "primary_rejections": 0,
            "replica_rejections": 0
          }
        }
      }

Replication Retries

In order to mitigate the potential of transient disruptions failing a replica, we will enable replication retries at the primary level. When an operation fails because of connection error, circuit breaking, rejected, etc we the primary will retry until the new timeout setting is exhausted (indices.replication.retry_timeout).

Target 7.10

  • Evaluate mechanisms for back presssure related to the CPU cost of indexing
@Tim-Brooks Tim-Brooks added >enhancement :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. labels Jul 8, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/CRUD)

@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jul 8, 2020
@Tim-Brooks Tim-Brooks added the Meta label Jul 8, 2020
@Tim-Brooks Tim-Brooks changed the title Improve Indexing Back pressure Improve pending indexing metrics and back pressure Jul 8, 2020
Tim-Brooks added a commit that referenced this issue Jul 14, 2020
This commit increases the default write queue size to 10000. This is to
allow a greater number of pending indexing requests. This work is safe
as we have added additional memory limits. Relates to #59263.
Tim-Brooks added a commit to Tim-Brooks/elasticsearch that referenced this issue Jul 14, 2020
This commit increases the default write queue size to 10000. This is to
allow a greater number of pending indexing requests. This work is safe
as we have added additional memory limits. Relates to elastic#59263.
Tim-Brooks added a commit that referenced this issue Jul 15, 2020
This commit increases the default write queue size to 10000. This is to
allow a greater number of pending indexing requests. This work is safe
as we have added additional memory limits. Relates to #59263.
Tim-Brooks added a commit to Tim-Brooks/elasticsearch that referenced this issue Aug 13, 2020
@AnthonyFoiani-at
Copy link

Hi! I'm curious if the related CPU-based enhancement ever landed in 7.10?

Target 7.10: Evaluate mechanisms for back presssure related to the CPU cost of indexing

@howardhuanghua
Copy link
Contributor

howardhuanghua commented Sep 1, 2021

Hi @tbrooks8, could I know that why do we set indexing_pressure.memory.limit as static? Could users change it dynamically for different scenarios?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. >enhancement Meta release highlight Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

No branches or pull requests

5 participants