From a2e64bf50748f65dddfe47a03280c85f5e207540 Mon Sep 17 00:00:00 2001 From: Fanit Kolchina Date: Mon, 24 Apr 2023 13:15:59 -0400 Subject: [PATCH 1/4] Update segment replication backpressure Signed-off-by: Fanit Kolchina --- .../segment-replication/backpressure.md | 21 +++++++++---------- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/_tuning-your-cluster/availability-and-recovery/segment-replication/backpressure.md b/_tuning-your-cluster/availability-and-recovery/segment-replication/backpressure.md index 1ac4496e07..3efad0d56f 100644 --- a/_tuning-your-cluster/availability-and-recovery/segment-replication/backpressure.md +++ b/_tuning-your-cluster/availability-and-recovery/segment-replication/backpressure.md @@ -7,26 +7,26 @@ has_children: false grand_parent: Availability and Recovery --- -## Segment replication back-pressure +## Segment replication backpressure -Segment replication back-pressure is a per-shard level rejection mechanism that dynamically rejects indexing requests when the number of replica shards in your cluster are falling behind the number of primary shards. With Segment replication back-pressure, indexing requests are rejected when more than half of the replication group is stale, which is defined by the `MAX_ALLOWED_STALE_SHARDS` field. A replica is considered stale if it is behind by more than the defined `MAX_INDEXING_CHECKPOINTS` field, and its current replication lag is over the defined `MAX_REPLICATION_TIME_SETTING` field. +Segment replication backpressure is a shard-level rejection mechanism that dynamically rejects indexing requests as replica shards in your cluster fall behind primary shards. With segment replication backpressure, indexing requests are rejected when the percentage of stale shards in the replication group exceeds `MAX_ALLOWED_STALE_SHARDS` (50 percent by default). A replica is considered stale if it is behind the primary shard by the number of checkpoints that exceeds the `MAX_INDEXING_CHECKPOINTS` setting and its current replication lag is greater than the defined `MAX_REPLICATION_TIME_SETTING` field. -Replica shards are also monitored to determine whether the shards are stuck or are lagging for an extended period of time. When replica shards are stuck or lagging for more than double the amount of time defined by the `MAX_REPLICATION_TIME_SETTING` field, the shards are removed and then replaced with new replica shards. +Replica shards are also monitored to determine whether the shards are stuck or lagging for an extended period of time. When replica shards are stuck or lagging for more than double the amount of time defined by the `MAX_REPLICATION_TIME_SETTING` field, the shards are removed and replaced with new replica shards. ## Request fields -Segment replication back-pressure is enabled by default. The following are dynamic cluster settings, and can be enabled or disabled using the [cluster settings]({{site.url}}{{site.baseurl}}/api-reference/cluster-api/cluster-settings/) API endpoint. +Segment replication backpressure is disabled by default. To enable it, set `SEGMENT_REPLICATION_INDEXING_PRESSURE_ENABLED` to `true`. You can update the following dynamic cluster settings using the [cluster settings]({{site.url}}{{site.baseurl}}/api-reference/cluster-api/cluster-settings/) API endpoint. Field | Data type | Description :--- | :--- | :--- -SEGMENT_REPLICATION_INDEXING_PRESSURE_ENABLED | Boolean | Enables the segment replication back-pressure mechanism. Default is `true`. -MAX_REPLICATION_TIME_SETTING | Time unit | The maximum time that a replica shard can take to copy from primary. Once `MAX_REPLICATION_TIME_SETTING` is breached along with `MAX_INDEXING_CHECKPOINTS`, the segment replication back-pressure mechanism is triggered. Default is `5 minutes`. -MAX_INDEXING_CHECKPOINTS | Integer | The maximum number of indexing checkpoints that a replica shard can fall behind when copying from primary. Once `MAX_INDEXING_CHECKPOINTS` is breached along with `MAX_REPLICATION_TIME_SETTING`, the segment replication back-pressure mechanism is triggered. Default is `4` checkpoints. -MAX_ALLOWED_STALE_SHARDS | Floating point | The maximum number of stale replica shards that can exist in a replication group. Once `MAX_ALLOWED_STALE_SHARDS` is breached, the segment replication back-pressure mechanism is triggered. Default is `.5`, which is 50% of a replication group. +SEGMENT_REPLICATION_INDEXING_PRESSURE_ENABLED | Boolean | Enables the segment replication backpressure mechanism. Default is `false`. +MAX_REPLICATION_TIME_SETTING | Time unit | The maximum time that a replica shard can take to copy from primary. Once `MAX_REPLICATION_TIME_SETTING` is breached along with `MAX_INDEXING_CHECKPOINTS`, the segment replication backpressure mechanism is initiated. Default is `5 minutes`. +MAX_INDEXING_CHECKPOINTS | Integer | The maximum number of indexing checkpoints that a replica shard can fall behind when copying from primary. Once `MAX_INDEXING_CHECKPOINTS` is breached along with `MAX_REPLICATION_TIME_SETTING`, the segment replication backpressure mechanism is initiated. Default is `4` checkpoints. +MAX_ALLOWED_STALE_SHARDS | Floating point | The maximum number of stale replica shards that can exist in a replication group. Once `MAX_ALLOWED_STALE_SHARDS` is breached, the segment replication backpressure mechanism is initiated. Default is `.5`, which is 50% of a replication group. ## Path and HTTP methods -You can use the segment replication API endpoint to retrieve segment replication back-pressure metrics. +You can use the segment replication API endpoint to retrieve segment replication backpressure metrics as follows: ```bash GET _cat/segment_replication @@ -40,5 +40,4 @@ shardId target_node target_host checkpoints_behind bytes_behind cur [index-1][0] runTask-1 127.0.0.1 0 0b 0s 7ms 0 ``` -- `checkpoints_behind` and `current_lag` directly correlate with `MAX_INDEXING_CHECKPOINTS` and `MAX_REPLICATION_TIME_SETTING`. -- `checkpoints_behind` and `current_lag` metrics are taken into consideration when triggering segment replication back-pressure. +The `checkpoints_behind` and `current_lag` metrics are taken into consideration when initiating segment replication backpressure. They are checked against `MAX_INDEXING_CHECKPOINTS` and `MAX_REPLICATION_TIME_SETTING`, respectively. From de203f581353b632837b09c5184c8f65bec2229e Mon Sep 17 00:00:00 2001 From: Fanit Kolchina Date: Mon, 24 Apr 2023 13:31:51 -0400 Subject: [PATCH 2/4] Implemented editorial comments Signed-off-by: Fanit Kolchina --- .../segment-replication/backpressure.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_tuning-your-cluster/availability-and-recovery/segment-replication/backpressure.md b/_tuning-your-cluster/availability-and-recovery/segment-replication/backpressure.md index 3efad0d56f..1b17b081c5 100644 --- a/_tuning-your-cluster/availability-and-recovery/segment-replication/backpressure.md +++ b/_tuning-your-cluster/availability-and-recovery/segment-replication/backpressure.md @@ -9,7 +9,7 @@ grand_parent: Availability and Recovery ## Segment replication backpressure -Segment replication backpressure is a shard-level rejection mechanism that dynamically rejects indexing requests as replica shards in your cluster fall behind primary shards. With segment replication backpressure, indexing requests are rejected when the percentage of stale shards in the replication group exceeds `MAX_ALLOWED_STALE_SHARDS` (50 percent by default). A replica is considered stale if it is behind the primary shard by the number of checkpoints that exceeds the `MAX_INDEXING_CHECKPOINTS` setting and its current replication lag is greater than the defined `MAX_REPLICATION_TIME_SETTING` field. +Segment replication backpressure is a shard-level rejection mechanism that dynamically rejects indexing requests as replica shards in your cluster fall behind primary shards. With segment replication backpressure, indexing requests are rejected when the percentage of stale shards in the replication group exceeds `MAX_ALLOWED_STALE_SHARDS` (50% by default). A replica is considered stale if it is behind the primary shard by the number of checkpoints that exceeds the `MAX_INDEXING_CHECKPOINTS` setting and its current replication lag is greater than the defined `MAX_REPLICATION_TIME_SETTING` field. Replica shards are also monitored to determine whether the shards are stuck or lagging for an extended period of time. When replica shards are stuck or lagging for more than double the amount of time defined by the `MAX_REPLICATION_TIME_SETTING` field, the shards are removed and replaced with new replica shards. @@ -20,7 +20,7 @@ Segment replication backpressure is disabled by default. To enable it, set `SEGM Field | Data type | Description :--- | :--- | :--- SEGMENT_REPLICATION_INDEXING_PRESSURE_ENABLED | Boolean | Enables the segment replication backpressure mechanism. Default is `false`. -MAX_REPLICATION_TIME_SETTING | Time unit | The maximum time that a replica shard can take to copy from primary. Once `MAX_REPLICATION_TIME_SETTING` is breached along with `MAX_INDEXING_CHECKPOINTS`, the segment replication backpressure mechanism is initiated. Default is `5 minutes`. +MAX_REPLICATION_TIME_SETTING | Time unit | The maximum amount of time that a replica shard can take to copy from the primary shard. Once `MAX_REPLICATION_TIME_SETTING` is breached along with `MAX_INDEXING_CHECKPOINTS`, the segment replication backpressure mechanism is initiated. Default is `5 minutes`. MAX_INDEXING_CHECKPOINTS | Integer | The maximum number of indexing checkpoints that a replica shard can fall behind when copying from primary. Once `MAX_INDEXING_CHECKPOINTS` is breached along with `MAX_REPLICATION_TIME_SETTING`, the segment replication backpressure mechanism is initiated. Default is `4` checkpoints. MAX_ALLOWED_STALE_SHARDS | Floating point | The maximum number of stale replica shards that can exist in a replication group. Once `MAX_ALLOWED_STALE_SHARDS` is breached, the segment replication backpressure mechanism is initiated. Default is `.5`, which is 50% of a replication group. From 3656c0d1840f69f7ca5e37d944ce8dbd6b090902 Mon Sep 17 00:00:00 2001 From: Fanit Kolchina Date: Thu, 27 Apr 2023 19:02:10 -0400 Subject: [PATCH 3/4] Added benchmarks Signed-off-by: Fanit Kolchina --- _sass/custom/custom.scss | 8 + .../segment-replication/index.md | 359 ++++++++++++------ 2 files changed, 246 insertions(+), 121 deletions(-) diff --git a/_sass/custom/custom.scss b/_sass/custom/custom.scss index 4a0f89b1ba..d368a3c257 100755 --- a/_sass/custom/custom.scss +++ b/_sass/custom/custom.scss @@ -119,6 +119,14 @@ code { box-shadow: 0 1px 2px rgba(0, 0, 0, 0.12), 0 3px 10px rgba(0, 0, 0, 0.08); } +.td-custom { + @extend td; + + &:first-of-type { + border-left: $border $border-color; + } +} + img { @extend .panel; } diff --git a/_tuning-your-cluster/availability-and-recovery/segment-replication/index.md b/_tuning-your-cluster/availability-and-recovery/segment-replication/index.md index 2229366d81..eb1275bf07 100644 --- a/_tuning-your-cluster/availability-and-recovery/segment-replication/index.md +++ b/_tuning-your-cluster/availability-and-recovery/segment-replication/index.md @@ -4,6 +4,7 @@ title: Segment replication nav_order: 70 has_children: true parent: Availability and Recovery +datatable: true redirect_from: - /opensearch/segment-replication/ - /opensearch/segment-replication/index/ @@ -64,185 +65,301 @@ curl -X PUT "$host/_cluster/settings?pretty" -H 'Content-Type: application/json' ``` {% include copy-curl.html %} -## Comparing replication benchmarks +## Considerations + +When using segment replication, consider the following: + +1. Enabling segment replication for an existing index requires [reindexing](https://github.com/opensearch-project/OpenSearch/issues/3685). +1. Rolling upgrades are not currently supported. Full cluster restarts are required when upgrading indexes using segment replication. See [Issue 3881](https://github.com/opensearch-project/OpenSearch/issues/3881). +1. [Cross-cluster replication](https://github.com/opensearch-project/OpenSearch/issues/4090) does not currently use segment replication to copy between clusters. +1. Increased network congestion on primary shards. See [Issue - Optimize network bandwidth on primary shards](https://github.com/opensearch-project/OpenSearch/issues/4245). +1. Integration with remote-backed storage as the source of replication is [currently unsupported](https://github.com/opensearch-project/OpenSearch/issues/4448). +1. Read-after-write guarantees: The `wait_until` refresh policy is not compatible with segment replication. If you use the `wait_until` refresh policy while ingesting documents, you'll get a response only after the primary node has refreshed and made those documents searchable. Replica shards will respond only after having written to their local translog. We are exploring other mechanisms for providing read-after-write guarantees. For more information, see the corresponding [GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/6046). +1. System indexes will continue to use document replication internally until read-after-write guarantees are available. In this case, document replication does not hinder the overall performance because there are few system indexes. + +## Benchmarks During initial benchmarks, segment replication users reported 40% higher throughput than when using document replication with the same cluster setup. -The following benchmarks were collected with [OpenSearch-benchmark](https://github.com/opensearch-project/opensearch-benchmark) using the [`nyc_taxi`](https://github.com/topics/nyc-taxi-dataset) dataset. +The following benchmarks were collected with [OpenSearch-benchmark](https://github.com/opensearch-project/opensearch-benchmark) using the [`stackoverflow`](https://www.kaggle.com/datasets/stackoverflow/stackoverflow) and [`nyc_taxi`](https://github.com/topics/nyc-taxi-dataset) datasets. + +The benchmarks demonstrate the effect of the following configurations on segment replication: + +- [The workload size](#increasing-the-workload-size) + +- [The number of primary shards](#increasing-the-number-of-primary-shards) + +- [The number of replicas](#increasing-the-number-of-replicas) + +Your results may vary based on the cluster topology, hardware used, shard count, and merge settings. +{: .note } + +### Increasing the workload size + +The following table lists benchmarking results for the `nyc_taxi` dataset with the following configuration: + +- 10 m5.xlarge data nodes -The test run was performed on a 10-node (m5.xlarge) cluster with 10 shards and 5 replicas. Each shard was about 3.2GBs in size. +- 40 primary shards, 1 replica each (80 shards total) -The benchmarking results are listed in the following table. +- 4 primary shards and 4 replica shards per node + + + - - - + + + + + + - - - - - + + + + + + + + - - - - - + + + + + + + + - - - - + + + + + + + - - - - - - - - - - - + + + + + + + - - - - + + + + + + + + +
40 GB primary shard, 80 GB total240 GB primary shard, 480 GB total
Document ReplicationSegment ReplicationPercent differenceDocument ReplicationSegment ReplicationPercent differenceDocument ReplicationSegment ReplicationPercent difference
Test execution time (minutes)118.0098.0027%Store size85.278191.2268N/A515.726558.039N/A
Index Throughput (number of requests per second)p071917.20105740.7047.03%Index throughput (number of requests per second)Minimum148,134185,09224.95%100,140168,33568.10%
p5077392.90110988.7043.41%Median160,110189,79918.54%106,642170,57359.95%
p10093032.90123131.5032.35%
Query Throughput (number of requests per second)p01.7481.744-0.23%Maximum175,196190,7578.88%108,583172,50758.87%
p501.7541.7530%Error rate0.00%0.00%0.00%0.00%0.00%0.00%
+ +As the size of the workload increases, the benefits of segment replication are amplified because the replicas are not required to index the larger dataset. In general, segment replication leads to a higher throughput at a lower resource cost than document replication in all cluster configurations, not accounting for replication lag. + +### Increasing the number of primary shards + +The following table lists benchmarking results for the `nyc_taxi` dataset for 40 and 100 primary shards. + +{::nomarkdown} + + + + - - - - + + + + + + + + - - - - - + + + + + + + + - - - - + + + + + + + - - - - + + + + + + + - - - - + + + + + + + + +
40 primary shards, 1 replica100 primary shards, 1 replica
p1001.7691.768-0.06%Document ReplicationSegment ReplicationPercent differenceDocument ReplicationSegment ReplicationPercent difference
CPU (%)p5037.1925.579-31.22%Index throughput (number of requests per second)Minimum148,134185,09224.95%151,404167,3919.55%
p9094.0088.00-6.38%Median160,110189,79918.54%154,796172,99510.52%
p991001000%Maximum175,196190,7578.88%166,173174,6554.86%
p100100.00100.000%Error rate0.00%0.00%0.00%0.00%0.00%0.00%
+{:/} + +As the number of primary shards increases, the benefits of segment replication over document replication decrease. While segment replication is still beneficial with a larger number of primary shards, the difference in performance becomes less pronounced because there are more primary shards per node that must copy segment files across the cluster. + +### Increasing the number of replicas + +The following table lists benchmarking results for the `stackoverflow` dataset for 1 and 9 replicas. + +{::nomarkdown} + + + + - - - - - + + + + + + + + - - - - + + + + + + + + - - - - + + + + + + + - - - - + + + + + + + + - - - - - + + + + + + + - - - - + + + + + + + - - - - + + + + + + + - - - - + + + + + + + + - - - - - + + + + + + + - - - - + + + + + + + - - - - + + + + + + + - - - - + + + + + + + +
10 primary shards, 1 replica10 primary shards, 9 replicas
Memory (%)p503024.241-19.20%Document ReplicationSegment ReplicationPercent differenceDocument ReplicationSegment ReplicationPercent difference
p9061.0055.00-9.84%Index throughput (number of requests per second)Median72,598.1090,776.1025.04%16,537.0014,429.80−12.74%
p997262-13.89%%Maximum86,130.8096,471.0012.01%21,472.4038,235.0078.07%
p10080.0067.00-16.25%CPU usage (%)p501718.85710.92%69.8578.833−87.36%
Index Latency (ms)p50803647.90-19.32%p907682.1338.07%9986.4−12.73%
p901215.50908.60-25.25%p991001000%1001000%
p999738.701565-83.93%p1001001000%1001000%
p10021559.602747.70-87.26%Memory usage (%)p503523−34.29%4240−4.76%
Query Latency (ms)p5036.20937.7994.39%p905957−3.39%59636.78%
p9042.97160.82341.54%p996961−11.59%66706.06%
p9950.60470.07238.47%p1007262−13.89%69724.35%
p10052.88373.91139.76%Error rate0.00%0.00%0.00%0.00%2.30%2.30%
+{:/} -Your results may vary based on the cluster topology, hardware used, shard count, and merge settings. -{: .note } +As the number of replicas grows, the time it takes for primary shards to keep replicas up to date (known as the _replication lag_) increases. This is because with segment replication the segment files are copied directly from primary shards to replicas. -## Other considerations - -When using segment replication, consider the following: - -1. Enabling segment replication for an existing index requires [reindexing](https://github.com/opensearch-project/OpenSearch/issues/3685). -1. Rolling upgrades are not currently supported. Full cluster restarts are required when upgrading indexes using segment replication. See [Issue 3881](https://github.com/opensearch-project/OpenSearch/issues/3881). -1. [Cross-cluster replication](https://github.com/opensearch-project/OpenSearch/issues/4090) does not currently use segment replication to copy between clusters. -1. Increased network congestion on primary shards. See [Issue - Optimize network bandwidth on primary shards](https://github.com/opensearch-project/OpenSearch/issues/4245). -1. Integration with remote-backed storage as the source of replication is [currently unsupported](https://github.com/opensearch-project/OpenSearch/issues/4448). -1. Read-after-write guarantees: The `wait_until` refresh policy is not compatible with segment replication. If you use the `wait_until` refresh policy while ingesting documents, you'll get a response only after the primary node has refreshed and made those documents searchable. Replica shards will respond only after having written to their local translog. We are exploring other mechanisms for providing read-after-write guarantees. For more information, see the corresponding [GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/6046). -1. System indexes will continue to use document replication internally until read-after-write guarantees are available. In this case, document replication does not hinder the overall performance because there are few system indexes. +The benchmarking results show a non-zero error rate as the number of replicas increases. The error rate indicates that the [segment replication backpressure]({{site.urs}}{{site.baseurl}}/tuning-your-cluster/availability-and-recovery/segment-replication/backpressure/) mechanism is initiated at the times when replicas cannot keep up with the primary shard. However, the error rate is offset by the significant CPU and memory gains that segment replication provides. ## Next steps From d480ccaba13e5118fb9a38f8106b3aaeb678a6c1 Mon Sep 17 00:00:00 2001 From: Fanit Kolchina Date: Tue, 2 May 2023 09:42:10 -0400 Subject: [PATCH 4/4] Implemented editorial comments Signed-off-by: Fanit Kolchina --- .../segment-replication/index.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/_tuning-your-cluster/availability-and-recovery/segment-replication/index.md b/_tuning-your-cluster/availability-and-recovery/segment-replication/index.md index eb1275bf07..8d83a046ae 100644 --- a/_tuning-your-cluster/availability-and-recovery/segment-replication/index.md +++ b/_tuning-your-cluster/availability-and-recovery/segment-replication/index.md @@ -72,8 +72,8 @@ When using segment replication, consider the following: 1. Enabling segment replication for an existing index requires [reindexing](https://github.com/opensearch-project/OpenSearch/issues/3685). 1. Rolling upgrades are not currently supported. Full cluster restarts are required when upgrading indexes using segment replication. See [Issue 3881](https://github.com/opensearch-project/OpenSearch/issues/3881). 1. [Cross-cluster replication](https://github.com/opensearch-project/OpenSearch/issues/4090) does not currently use segment replication to copy between clusters. -1. Increased network congestion on primary shards. See [Issue - Optimize network bandwidth on primary shards](https://github.com/opensearch-project/OpenSearch/issues/4245). -1. Integration with remote-backed storage as the source of replication is [currently unsupported](https://github.com/opensearch-project/OpenSearch/issues/4448). +1. Segment replication leads to increased network congestion on primary shards. See [Issue - Optimize network bandwidth on primary shards](https://github.com/opensearch-project/OpenSearch/issues/4245). +1. Integration with remote-backed storage as the source of replication is [currently not supported](https://github.com/opensearch-project/OpenSearch/issues/4448). 1. Read-after-write guarantees: The `wait_until` refresh policy is not compatible with segment replication. If you use the `wait_until` refresh policy while ingesting documents, you'll get a response only after the primary node has refreshed and made those documents searchable. Replica shards will respond only after having written to their local translog. We are exploring other mechanisms for providing read-after-write guarantees. For more information, see the corresponding [GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/6046). 1. System indexes will continue to use document replication internally until read-after-write guarantees are available. In this case, document replication does not hinder the overall performance because there are few system indexes. @@ -168,7 +168,7 @@ The following table lists benchmarking results for the `nyc_taxi` dataset with t -As the size of the workload increases, the benefits of segment replication are amplified because the replicas are not required to index the larger dataset. In general, segment replication leads to a higher throughput at a lower resource cost than document replication in all cluster configurations, not accounting for replication lag. +As the size of the workload increases, the benefits of segment replication are amplified because the replicas are not required to index the larger dataset. In general, segment replication leads to higher throughput at lower resource cost than document replication in all cluster configurations, not accounting for replication lag. ### Increasing the number of primary shards @@ -357,9 +357,9 @@ The following table lists benchmarking results for the `stackoverflow` dataset f {:/} -As the number of replicas grows, the time it takes for primary shards to keep replicas up to date (known as the _replication lag_) increases. This is because with segment replication the segment files are copied directly from primary shards to replicas. +As the number of replicas increases, the amount of time required for primary shards to keep replicas up to date (known as the _replication lag_) also increases. This is because segment replication copies the segment files directly from primary shards to replicas. -The benchmarking results show a non-zero error rate as the number of replicas increases. The error rate indicates that the [segment replication backpressure]({{site.urs}}{{site.baseurl}}/tuning-your-cluster/availability-and-recovery/segment-replication/backpressure/) mechanism is initiated at the times when replicas cannot keep up with the primary shard. However, the error rate is offset by the significant CPU and memory gains that segment replication provides. +The benchmarking results show a non-zero error rate as the number of replicas increases. The error rate indicates that the [segment replication backpressure]({{site.urs}}{{site.baseurl}}/tuning-your-cluster/availability-and-recovery/segment-replication/backpressure/) mechanism is initiated when replicas cannot keep up with the primary shard. However, the error rate is offset by the significant CPU and memory gains that segment replication provides. ## Next steps