Add cluster awareness and decommission docs (#2438)

* Add cluster awareness and decommission docs Signed-off-by: Naarcha-AWS <[email protected]> * Update _api-reference/cluster-awareness.md Co-authored-by: Bukhtawar Khan <[email protected]> * Edit technical feedback Signed-off-by: Naarcha-AWS <[email protected]> * Add new cluster awareness examples Signed-off-by: Naarcha-AWS <[email protected]> * Add technical feedback Signed-off-by: Naarcha-AWS <[email protected]> * Update _api-reference/cluster-awareness.md Co-authored-by: Alice Williams <[email protected]> * Update _api-reference/cluster-awareness.md Co-authored-by: Alice Williams <[email protected]> * Add Caroline's feedback Signed-off-by: Naarcha-AWS <[email protected]> * Add one more tweak Signed-off-by: Naarcha-AWS <[email protected]> * Update _ml-commons-plugin/cluster-settings.md Co-authored-by: Heather Halter <[email protected]> * Update _ml-commons-plugin/cluster-settings.md Co-authored-by: Heather Halter <[email protected]> * Update _api-reference/cluster-awareness.md Co-authored-by: Nate Bower <[email protected]> * Update _api-reference/cluster-awareness.md Co-authored-by: Nate Bower <[email protected]> * Update _api-reference/cluster-awareness.md Co-authored-by: Nate Bower <[email protected]> * Update _ml-commons-plugin/cluster-settings.md Co-authored-by: Nate Bower <[email protected]> * Update _api-reference/cluster-awareness.md Co-authored-by: Nate Bower <[email protected]> * Update _api-reference/cluster-awareness.md Co-authored-by: Nate Bower <[email protected]> * Update _api-reference/cluster-awareness.md Co-authored-by: Nate Bower <[email protected]> * Update _api-reference/cluster-awareness.md Co-authored-by: Nate Bower <[email protected]> * Update _api-reference/cluster-awareness.md Co-authored-by: Nate Bower <[email protected]> * Update _ml-commons-plugin/cluster-settings.md Co-authored-by: Nate Bower <[email protected]> * Update _api-reference/cluster-awareness.md Co-authored-by: Nate Bower <[email protected]> * Update _api-reference/cluster-awareness.md Co-authored-by: Nate Bower <[email protected]> * Update _api-reference/cluster-decommission.md Co-authored-by: Nate Bower <[email protected]> * Update _api-reference/cluster-awareness.md Co-authored-by: Nate Bower <[email protected]> * Update _api-reference/cluster-decommission.md Co-authored-by: Nate Bower <[email protected]> * Add editoiral feedback Signed-off-by: Naarcha-AWS <[email protected]> * Fix typos Signed-off-by: Naarcha-AWS <[email protected]> * Final editorial note Signed-off-by: Naarcha-AWS <[email protected]> Signed-off-by: Naarcha-AWS <[email protected]> Co-authored-by: Bukhtawar Khan <[email protected]> Co-authored-by: Alice Williams <[email protected]> Co-authored-by: Heather Halter <[email protected]> Co-authored-by: Nate Bower <[email protected]>
opensearch-project · Jan 25, 2023 · 891be1f · 891be1f
1 parent c5cf556
commit 891be1f
Show file tree

Hide file tree

Showing 6 changed files with 245 additions and 9 deletions.
diff --git a/_api-reference/cluster-awareness.md b/_api-reference/cluster-awareness.md
@@ -0,0 +1,121 @@
+---
+layout: default
+title: Cluster routing and awareness
+nav_order: 16
+---
+
+# Cluster routing and awareness
+
+To control the distribution of search or HTTP traffic, you can use the weights per awareness attribute to control the distribution of search or HTTP traffic across zones. This is commonly used for zonal deployments, heterogeneous instances, and routing traffic away from zones during zonal failure.
+
+## HTTP and path methods
+
+```
+PUT /_cluster/routing/awareness/<attribute>/weights
+GET /_cluster/routing/awareness/<attribute>/weights?local
+GET /_cluster/routing/awareness/<attribute>/weights
+```
+
+## Path parameters
+
+Parameter | Type | Description
+:--- | :--- | :---
+attribute | String | The name of the awareness attribute, usually `zone`. The attribute name must match the values listed in the request body when assigning weights to zones.
+
+## Request body parameters
+
+Parameter | Type | Description
+:--- | :--- | :---
+weights | JSON object | Assigns weights to attributes within the request body of the PUT request. Weights can be set in any ratio, for example, 2:3:5. In a 2:3:5 ratio with 3 zones, for every 100 requests sent to the cluster, each zone would receive either 20, 30, or 50 search requests in a random order. When assigned a weight of `0`, the zone does not receive any search traffic. 
+_version | String | Implements optimistic concurrency control (OCC) through versioning. The parameter uses simple versioning, such as `1`, and increments upward based on each subsequent modification. This allows any servers from which a request originates to validate whether or not a zone has been modified. 
+
+
+In the following example request body, `zone_1` and `zone_2` receive 50 requests each, whereas `zone_3` is prevented from receiving requests:
+
+```
+{ 
+      "weights":
+      {
+        "zone_1": "5", 
+        "zone_2": "5", 
+        "zone_3": "0"
+      }
+      "_version" : 1
+}
+```
+
+## Example: Weighted round robin search
+
+The following example request creates a round robin shard allocation for search traffic by using an undefined ratio:
+
+### Request
+
+PUT /_cluster/routing/awareness/zone/weights
+{ 
+      "weights":
+      {
+        "zone_1": "1", 
+        "zone_2": "1", 
+        "zone_3": "0"
+      }
+      "_version" : 1
+}
+
+### Response
+
+```
+{
+     "acknowledged": true
+}
+```
+
+
+## Example: Getting weights for all zones
+
+The following example request gets weights for all zones.
+
+### Request
+
+```
+GET /_cluster/routing/awareness/zone/weights
+```
+
+### Response
+
+OpenSearch responds with the weight of each zone:
+
+```json
+{
+      "weights":
+      {
+
+        "zone_1": "1.0", 
+        "zone_2": "1.0", 
+        "zone_3": "0.0"
+      },
+      "_version":1
+}
+```
+
+## Example: Deleting weights
+
+You can remove your weight ratio for each zone using the `DELETE` method.
+
+### Request
+
+```
+DELETE /_cluster/routing/awareness/zone/weights
+```
+
+### Response
+
+```json
+{
+   "_version":1
+}
+```
+
+## Next steps
+
+- For more information about zone commissioning, see [Cluster decommission]({{site.url}}{{site.baseurl}}/api-reference/cluster-decommission/).
+- For more information about allocation awareness, see [Cluster formation]({{site.url}}{{site.baseurl}}/opensearch/cluster/#advanced-step-6-configure-shard-allocation-awareness-or-forced-awareness).
diff --git a/_api-reference/cluster-decommission.md b/_api-reference/cluster-decommission.md
@@ -0,0 +1,80 @@
+---
+layout: default
+title: Cluster decommission 
+nav_order: 20
+---
+
+# Cluster decommission
+
+The cluster decommission operation adds support decommissioning based on awareness. It greatly benefits multi-zone deployments, where awareness attributes, such as `zones`, can aid in applying new upgrades to a cluster in a controlled fashion. This is especially useful during outages, in which case, you can decommission the unhealthy zone to prevent replication requests from stalling and prevent your request backlog from becoming too large.
+
+For more information about allocation awareness, see [Shard allocation awareness]({{site.url}}{{site.baseurl}}//opensearch/cluster/#shard-allocation-awareness).
+
+
+## HTTP and Path methods
+
+```
+PUT  /_cluster/decommission/awareness/{awareness_attribute_name}/{awareness_attribute_value}
+GET  /_cluster/decommission/awareness/{awareness_attribute_name}/_status
+DELETE /_cluster/decommission/awareness
+```
+
+## URL parameters
+
+Parameter | Type | Description
+:--- | :--- | :---
+awareness_attribute_name | String | The name of awareness attribute, usually `zone`.
+awareness_attribute_value | String | The value of the awareness attribute. For example, if you have shards allocated in two different zones, you can give each zone a value of `zone-a` or `zoneb`. The cluster decommission operation decommissions the zone listed in the method.
+
+
+## Example: Decommissioning and recommissioning a zone
+
+You can use the following example requests to decommission and recommission a zone:
+
+### Request
+
+The following example request decommissions `zone-a`:
+
+```
+PUT /_cluster/decommission/awareness/<zone>/<zone-a>
+```
+
+If you want to recommission a decommissioned zone, you can use the `DELETE` method:
+
+```
+DELETE /_cluster/decommission/awareness
+```
+
+### Response
+
+
+```json
+{
+      "acknowledged": true
+}
+```
+
+## Example: Getting zone decommission status
+
+The following example requests returns the decommission status of all zones.
+
+### Request
+
+```
+GET /_cluster/decommission/awareness/zone/_status
+```
+
+
+### Response
+
+```json
+{
+     "zone-1": "INIT | DRAINING | IN_PROGRESS | SUCCESSFUL | FAILED"
+}
+```
+
+
+## Next steps
+
+- For more information about zone awareness and weight, see [Cluster awareness]({{site.url}}{{site.baseurl}}/api-reference/cluster-awareness/).
+- For more information about allocation awareness, see [Cluster formation]({{site.url}}{{site.baseurl}}/opensearch/cluster/#advanced-step-6-configure-shard-allocation-awareness-or-forced-awareness).
diff --git a/_api-reference/cluster-health.md b/_api-reference/cluster-health.md
@@ -1,7 +1,7 @@
 ---
 layout: default
 title: Cluster health
-nav_order: 16
+nav_order: 17
 ---
 
 # Cluster health
@@ -47,6 +47,7 @@ wait_for_events | Enum | Wait until all currently queued events with the given p
 wait_for_no_relocating_shards | Boolean | Whether to wait until there are no relocating shards in the cluster. Default is false.
 wait_for_no_initializing_shards | Boolean | Whether to wait until there are no initializing shards in the cluster. Default is false.
 wait_for_status | Enum | Wait until the cluster health reaches the specified status or better. Supported values are `green`, `yellow`, and `red`.
+weights | JSON object | Assigns weights to attributes within the request body of the PUT request. Weights can be set in any ration, for example, 2:3:5. In a 2:3:5 ratio with three zones, for every 100 requests sent to the cluster, each zone would receive either 20, 30, or 50 search requests in a random order. When assigned a weight of `0`, the zone does not receive any search traffic. 
 
 #### Sample request
 

diff --git a/_api-reference/cluster-settings.md b/_api-reference/cluster-settings.md
@@ -1,7 +1,7 @@
 ---
 layout: default
 title: Cluster settings
-nav_order: 17
+nav_order: 18
 ---
 
 # Cluster settings

diff --git a/_api-reference/count.md b/_api-reference/count.md
@@ -1,7 +1,7 @@
 ---
 layout: default
 title: Count
-nav_order: 20
+nav_order: 21
 ---
 
 # Count

diff --git a/_ml-commons-plugin/cluster-settings.md b/_ml-commons-plugin/cluster-settings.md
@@ -12,7 +12,7 @@ To enhance and customize your OpenSearch cluster for machine learning (ML), you
 
 ## Run tasks and models on ML nodes only
 
-If `true`, ML Commons tasks and models run machine learning (ML) tasks on ML nodes only. If `false`, tasks and models run on ML nodes first. If no ML nodes exist, tasks and models run on data nodes. Don't set as `false` on a production cluster. 
+If `true`, ML Commons tasks and models run machine learning (ML) tasks on ML nodes only. If `false`, tasks and models run on ML nodes first. If no ML nodes exist, tasks and models run on data nodes. We recommend that you do not set this value to "false" on production clusters. 
 
 ### Setting
 
@@ -27,7 +27,7 @@ plugins.ml_commons.only_run_on_ml_node: true
 
 ## Dispatch tasks to ML node 
 
-`round_robin` dispatches ML tasks to ML nodes using round robin routing. `least_load` gathers all ML nodes' runtime information, such as JVM heap memory usage and running tasks, then dispatches tasks to the ML node with the least load.
+`round_robin` dispatches ML tasks to ML nodes using round robin routing. `least_load` gathers runtime information from all ML nodes, like JVM heap memory usage and running tasks, and then dispatches the tasks to the ML node with the lowest load.
 
 
 ### Setting
@@ -43,7 +43,9 @@ plugins.ml_commons.task_dispatch_policy: round_robin
 - Value range: `round_robin` or `least_load`
 
 
-## Set sync up job intervals 
+## Set sync job intervals 
+
+When returning runtime information with the [profile API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api#profile), ML Commons will run a regular job to sync newly loaded or unloaded models on each node. When set to `0`, ML Commons immediately stops sync up jobs.
 
 When returning runtime information with the [profile API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api#profile), ML Commons will run a regular sync up job to sync up newly loaded or unloaded models on each node. When set to `0`, ML Commons immediately stops sync up jobs.
 
@@ -60,7 +62,7 @@ plugins.ml_commons.sync_up_job_interval_in_seconds: 10
 
 ## Predict monitoring requests
 
-Controls how many predict requests are monitored on one node. If set to `0`, OpenSearch clears all monitoring predict requests in the node's cache, and does not monitor predict requests from that point forward.
+Controls how many upload model tasks can run in parallel on one node. If set to `0`, you cannot upload models to any node.
 
 ### Setting
 
@@ -92,7 +94,7 @@ plugins.ml_commons.max_upload_model_tasks_per_node: 10
 
 ## Load model tasks per node
 
-Controls how many load model tasks can run in parallel on one node. If set to `0`, you cannot load models to any node.
+Controls how many load model tasks can run in parallel on one node. If set to 0, you cannot load models to any node.
 
 ### Setting
 
@@ -107,7 +109,7 @@ plugins.ml_commons.max_load_model_tasks_per_node: 10
 
 ## Add trusted URL
 
-The default value allows uploading a model file from any `http`, `https`, `ftp`, or local file. You can change this value to restrict trusted model URL.
+The default value allows you to upload a model file from any http/https/ftp/local file. You can change this value to restrict trusted model URLs.
 
 
 ### Setting
@@ -120,3 +122,35 @@ plugins.ml_commons.trusted_url_regex: ^(https?\|ftp\|file)://[-a-zA-Z0-9+&@#/%?=
 
 - Default value: `^(https?\|ftp\|file)://[-a-zA-Z0-9+&@#/%?=~_\|!:,.;]*[-a-zA-Z0-9+&@#/%=~_\|]`
 - Value range: Java regular expression (regex) string
+
+## Assign task timeout
+
+Assigns how long in seconds an ML task will live. After the timeout, the task will fail.
+
+### Setting
+
+```
+plugins.ml_commons.ml_task_timeout_in_seconds: 600
+```
+
+### Values
+
+- Default value: 600
+- Value range: [1, 86400]
+
+## Set native memory threshold 
+
+Sets a circuit breaker that checks all system memory usage before running an ML task. If the native memory exceeds the threshold, OpenSearch throws an exception and stops running any ML task. 
+
+Values are based on the percentage of memory available. When set to `0`, no ML tasks will run. When set to `100`, the circuit breaker closes and no threshold exists.
+
+### Setting
+
+```
+plugins.ml_commons.native_memory_threshold: 90
+```
+
+### Values
+
+- Default value: 90
+- Value range: [0, 100]