Update segment replication backpressure #3839

kolchfa-aws · 2023-04-24T17:17:08Z

Fixes #3695

Checklist

[ x] By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Fanit Kolchina <[email protected]>

natebower

@kolchfa-aws Just a few changes. Thanks!

natebower · 2023-04-24T17:25:33Z

_tuning-your-cluster/availability-and-recovery/segment-replication/backpressure.md


-Segment replication back-pressure is a per-shard level rejection mechanism that dynamically rejects indexing requests when the number of replica shards in your cluster are falling behind the number of primary shards. With Segment replication back-pressure, indexing requests are rejected when more than half of the replication group is stale, which is defined by the `MAX_ALLOWED_STALE_SHARDS` field. A replica is considered stale if it is behind by more than the defined `MAX_INDEXING_CHECKPOINTS` field, and its current replication lag is over the defined `MAX_REPLICATION_TIME_SETTING` field.
+Segment replication backpressure is a shard-level rejection mechanism that dynamically rejects indexing requests as replica shards in your cluster fall behind primary shards. With segment replication backpressure, indexing requests are rejected when the percentage of stale shards in the replication group exceeds `MAX_ALLOWED_STALE_SHARDS` (50 percent by default). A replica is considered stale if it is behind the primary shard by the number of checkpoints that exceeds the `MAX_INDEXING_CHECKPOINTS` setting and its current replication lag is greater than the defined `MAX_REPLICATION_TIME_SETTING` field.


Suggested change

Segment replication backpressure is a shard-level rejection mechanism that dynamically rejects indexing requests as replica shards in your cluster fall behind primary shards. With segment replication backpressure, indexing requests are rejected when the percentage of stale shards in the replication group exceeds `MAX_ALLOWED_STALE_SHARDS` (50 percent by default). A replica is considered stale if it is behind the primary shard by the number of checkpoints that exceeds the `MAX_INDEXING_CHECKPOINTS` setting and its current replication lag is greater than the defined `MAX_REPLICATION_TIME_SETTING` field.

Segment replication backpressure is a shard-level rejection mechanism that dynamically rejects indexing requests as replica shards in your cluster fall behind primary shards. With segment replication backpressure, indexing requests are rejected when the percentage of stale shards in the replication group exceeds `MAX_ALLOWED_STALE_SHARDS` (50% by default). A replica is considered stale if it is behind the primary shard by the number of checkpoints that exceeds the `MAX_INDEXING_CHECKPOINTS` setting and its current replication lag is greater than the defined `MAX_REPLICATION_TIME_SETTING` field.

natebower · 2023-04-24T17:26:29Z

_tuning-your-cluster/availability-and-recovery/segment-replication/backpressure.md

-MAX_INDEXING_CHECKPOINTS | Integer | The maximum number of indexing checkpoints that a replica shard can fall behind when copying from primary. Once `MAX_INDEXING_CHECKPOINTS` is breached along with `MAX_REPLICATION_TIME_SETTING`, the segment replication back-pressure mechanism is triggered. Default is `4` checkpoints.
-MAX_ALLOWED_STALE_SHARDS | Floating point | The maximum number of stale replica shards that can exist in a replication group. Once `MAX_ALLOWED_STALE_SHARDS` is breached, the segment replication back-pressure mechanism is triggered. Default is `.5`, which is 50% of a replication group.
+SEGMENT_REPLICATION_INDEXING_PRESSURE_ENABLED | Boolean | Enables the segment replication backpressure mechanism. Default is `false`.
+MAX_REPLICATION_TIME_SETTING | Time unit | The maximum time that a replica shard can take to copy from primary. Once `MAX_REPLICATION_TIME_SETTING` is breached along with `MAX_INDEXING_CHECKPOINTS`, the segment replication backpressure mechanism is initiated. Default is `5 minutes`.


Suggested change

MAX_REPLICATION_TIME_SETTING | Time unit | The maximum time that a replica shard can take to copy from primary. Once `MAX_REPLICATION_TIME_SETTING` is breached along with `MAX_INDEXING_CHECKPOINTS`, the segment replication backpressure mechanism is initiated. Default is `5 minutes`.

MAX_REPLICATION_TIME_SETTING | Time unit | The maximum amount of time that a replica shard can take to copy from primary. Once `MAX_REPLICATION_TIME_SETTING` is breached along with `MAX_INDEXING_CHECKPOINTS`, the segment replication backpressure mechanism is initiated. Default is `5 minutes`.

natebower · 2023-04-24T17:27:04Z

_tuning-your-cluster/availability-and-recovery/segment-replication/backpressure.md

-MAX_ALLOWED_STALE_SHARDS | Floating point | The maximum number of stale replica shards that can exist in a replication group. Once `MAX_ALLOWED_STALE_SHARDS` is breached, the segment replication back-pressure mechanism is triggered. Default is `.5`, which is 50% of a replication group.
+SEGMENT_REPLICATION_INDEXING_PRESSURE_ENABLED | Boolean | Enables the segment replication backpressure mechanism. Default is `false`.
+MAX_REPLICATION_TIME_SETTING | Time unit | The maximum time that a replica shard can take to copy from primary. Once `MAX_REPLICATION_TIME_SETTING` is breached along with `MAX_INDEXING_CHECKPOINTS`, the segment replication backpressure mechanism is initiated. Default is `5 minutes`.
+MAX_INDEXING_CHECKPOINTS | Integer | The maximum number of indexing checkpoints that a replica shard can fall behind when copying from primary. Once `MAX_INDEXING_CHECKPOINTS` is breached along with `MAX_REPLICATION_TIME_SETTING`, the segment replication backpressure mechanism is initiated. Default is `4` checkpoints.


when copying from the primary shard?

Signed-off-by: Fanit Kolchina <[email protected]>

hdhalter

Looks good!

Signed-off-by: Fanit Kolchina <[email protected]>

natebower

@kolchfa-aws Approved with noted changes. Thanks!

natebower · 2023-05-02T12:59:09Z

_tuning-your-cluster/availability-and-recovery/segment-replication/index.md

+1. Enabling segment replication for an existing index requires [reindexing](https://github.com/opensearch-project/OpenSearch/issues/3685).
+1. Rolling upgrades are not currently supported. Full cluster restarts are required when upgrading indexes using segment replication. See [Issue 3881](https://github.com/opensearch-project/OpenSearch/issues/3881).
+1. [Cross-cluster replication](https://github.com/opensearch-project/OpenSearch/issues/4090) does not currently use segment replication to copy between clusters.
+1. Increased network congestion on primary shards. See [Issue - Optimize network bandwidth on primary shards](https://github.com/opensearch-project/OpenSearch/issues/4245).


This should ideally be a complete sentence.

natebower · 2023-05-02T13:01:59Z

_tuning-your-cluster/availability-and-recovery/segment-replication/index.md

    </tr>
+</table>
+
+As the size of the workload increases, the benefits of segment replication are amplified because the replicas are not required to index the larger dataset. In general, segment replication leads to a higher throughput at a lower resource cost than document replication in all cluster configurations, not accounting for replication lag. 


The end of the first sentence reads as though the replicas are taking the action. Is that what we mean, or do we mean "the replicas are not required in order to index the larger dataset"? In the second sentence, we could probably just say "leads to higher throughput at lower resource cost".

Yes, in the first sentence we do mean the replicas do not have to take action.

natebower · 2023-05-02T13:06:54Z

_tuning-your-cluster/availability-and-recovery/segment-replication/index.md


-Your results may vary based on the cluster topology, hardware used, shard count, and merge settings. 
-{: .note }
+As the number of replicas grows, the time it takes for primary shards to keep replicas up to date (known as the _replication lag_) increases. This is because with segment replication the segment files are copied directly from primary shards to replicas. 


"As the number of replicas increases, the amount of time required for primary shards to keep replicas up to date (known as the replication lag) also increases. This is because segment replication copies the segment files directly from primary shards to replicas."

natebower · 2023-05-02T13:07:45Z

_tuning-your-cluster/availability-and-recovery/segment-replication/index.md

-1. Integration with remote-backed storage as the source of replication is [currently unsupported](https://github.com/opensearch-project/OpenSearch/issues/4448). 
-1. Read-after-write guarantees: The `wait_until` refresh policy is not compatible with segment replication. If you use the `wait_until` refresh policy while ingesting documents, you'll get a response only after the primary node has refreshed and made those documents searchable. Replica shards will respond only after having written to their local translog. We are exploring other mechanisms for providing read-after-write guarantees. For more information, see the corresponding [GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/6046).  
-1. System indexes will continue to use document replication internally until read-after-write guarantees are available. In this case, document replication does not hinder the overall performance because there are few system indexes.
+The benchmarking results show a non-zero error rate as the number of replicas increases. The error rate indicates that the [segment replication backpressure]({{site.urs}}{{site.baseurl}}/tuning-your-cluster/availability-and-recovery/segment-replication/backpressure/) mechanism is initiated at the times when replicas cannot keep up with the primary shard. However, the error rate is offset by the significant CPU and memory gains that segment replication provides.


I would remove "at the times"

Signed-off-by: Fanit Kolchina <[email protected]>

* Update segment replication backpressure Signed-off-by: Fanit Kolchina <[email protected]> * Implemented editorial comments Signed-off-by: Fanit Kolchina <[email protected]> * Added benchmarks Signed-off-by: Fanit Kolchina <[email protected]> * Implemented editorial comments Signed-off-by: Fanit Kolchina <[email protected]> --------- Signed-off-by: Fanit Kolchina <[email protected]>

This reverts commit 3eb7a5b.

* Update segment replication backpressure Signed-off-by: Fanit Kolchina <[email protected]> * Implemented editorial comments Signed-off-by: Fanit Kolchina <[email protected]> * Added benchmarks Signed-off-by: Fanit Kolchina <[email protected]> * Implemented editorial comments Signed-off-by: Fanit Kolchina <[email protected]> --------- Signed-off-by: Fanit Kolchina <[email protected]>

Update segment replication backpressure

a2e64bf

Signed-off-by: Fanit Kolchina <[email protected]>

kolchfa-aws self-assigned this Apr 24, 2023

natebower reviewed Apr 24, 2023

View reviewed changes

Implemented editorial comments

de203f5

Signed-off-by: Fanit Kolchina <[email protected]>

kolchfa-aws added the v2.7.0 label Apr 25, 2023

Naarcha-AWS approved these changes Apr 25, 2023

View reviewed changes

hdhalter approved these changes Apr 25, 2023

View reviewed changes

kolchfa-aws added the 6 - Done but waiting to merge PR: The work is done and ready to merge label Apr 25, 2023

Added benchmarks

3656c0d

Signed-off-by: Fanit Kolchina <[email protected]>

kolchfa-aws requested review from cwillum, vagimeli, ananzh, seanneumann and AMoo-Miki as code owners April 27, 2023 23:02

natebower approved these changes May 2, 2023

View reviewed changes

kolchfa-aws added 3 commits May 2, 2023 09:42

Implemented editorial comments

d480cca

Signed-off-by: Fanit Kolchina <[email protected]>

Merge branch 'main' into segrep-issue-update

364dc1f

Merge branch 'main' into segrep-issue-update

ddc5f24

kolchfa-aws merged commit 1ff9a3a into main May 2, 2023

vagimeli added a commit that referenced this pull request May 4, 2023

Revert "Update segment replication backpressure (#3839)"

8d7bd52

This reverts commit 3eb7a5b.

Naarcha-AWS deleted the segrep-issue-update branch March 28, 2024 23:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update segment replication backpressure #3839

Update segment replication backpressure #3839

kolchfa-aws commented Apr 24, 2023

natebower left a comment

natebower Apr 24, 2023

natebower Apr 24, 2023

natebower Apr 24, 2023

hdhalter left a comment

natebower left a comment

natebower May 2, 2023

natebower May 2, 2023

kolchfa-aws May 2, 2023

natebower May 2, 2023

natebower May 2, 2023


		Segment replication back-pressure is a per-shard level rejection mechanism that dynamically rejects indexing requests when the number of replica shards in your cluster are falling behind the number of primary shards. With Segment replication back-pressure, indexing requests are rejected when more than half of the replication group is stale, which is defined by the `MAX_ALLOWED_STALE_SHARDS` field. A replica is considered stale if it is behind by more than the defined `MAX_INDEXING_CHECKPOINTS` field, and its current replication lag is over the defined `MAX_REPLICATION_TIME_SETTING` field.
		Segment replication backpressure is a shard-level rejection mechanism that dynamically rejects indexing requests as replica shards in your cluster fall behind primary shards. With segment replication backpressure, indexing requests are rejected when the percentage of stale shards in the replication group exceeds `MAX_ALLOWED_STALE_SHARDS` (50 percent by default). A replica is considered stale if it is behind the primary shard by the number of checkpoints that exceeds the `MAX_INDEXING_CHECKPOINTS` setting and its current replication lag is greater than the defined `MAX_REPLICATION_TIME_SETTING` field.

	MAX_REPLICATION_TIME_SETTING \| Time unit \| The maximum time that a replica shard can take to copy from primary. Once `MAX_REPLICATION_TIME_SETTING` is breached along with `MAX_INDEXING_CHECKPOINTS`, the segment replication backpressure mechanism is initiated. Default is `5 minutes`.
	MAX_REPLICATION_TIME_SETTING \| Time unit \| The maximum amount of time that a replica shard can take to copy from primary. Once `MAX_REPLICATION_TIME_SETTING` is breached along with `MAX_INDEXING_CHECKPOINTS`, the segment replication backpressure mechanism is initiated. Default is `5 minutes`.

Update segment replication backpressure #3839

Update segment replication backpressure #3839

Conversation

kolchfa-aws commented Apr 24, 2023

Checklist

natebower left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hdhalter left a comment

Choose a reason for hiding this comment

natebower left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment