Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add wlm feature overview #8632

Merged
Merged
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
---
layout: default
title: Workload management
nav_order: 70
has_children: true
parent: Availability and recovery
---

Introduced 2.18
{: .label .label-purple }

# Workload management

Workload management allows users to group and search network traffic, isolating system resources to prevent the overuse of network resources by specific requests. It offers the following benefits:
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

- Tenant-level admission control and reactive query management. When resource usage exceeds configured limits, it automatically identifies and cancels demanding queries, ensuring fair resource distribution.

- Tenant-level isolation within the cluster for search workloads, operating at a node level.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

## Installing workload management

To install workload management, use the following command:

```json
./bin/opensearch-plugin install workload-management
```
{% include copy-curl.html %}

## Permissions

Only users with administator-level permissions can use workload management.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

## Query groups

A _query group_ is a logical group of tasks with defined resource limits. System administrators can dynamically manage query groups using the Workload management APIs. These query groups can be used to make search requests with resource limits.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

### Operating modes

The following operating modes determine the operating-level for the query group:
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

- **Disabled mode**: Workload management is disabled.

- **Enabled mode**: Workload management is enabled and will cause cancellations and rejection once the query group’s configured thresholds are reached.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

- **Monitor_only mode** (Default): Workload management will monitor tasks but it will not cancel/reject any queries.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

### Example request

The following example adds a query group with the named `analytics`:
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

```json
PUT _wlm/query_group
{
“name”: “analytics”,
“resiliency_mode”: “enforced”,
“resource_limits”: {
“cpu”: 0.4,
“memory”: 0.2
}
}
```
{% include copy-curl.html %}

When creating a query group, make sure that the sum of the resource limits for a single resource, such as `cpu` or `memory`, does not exceed `1`.

### Example response

OpenSearch responds with the set resource limits and the `_id` for the query group:

```json
{
"_id":"preXpc67RbKKeCyka72_Gw",
"name":"analytics",
"resiliency_mode":"enforced",
"resource_limits":{
"cpu":0.4,
"memory":0.2
},
"updated_at":1726270184642
}
```

## Using `queryGroupID`

You can associate a query request with a `queryGroupID` to manage and allocate resources within the limits defined by the query group. By utilizing this ID, requests are routed and tracked under the query group, ensuring resource quotas and task limits are maintained.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

natebower marked this conversation as resolved.
Show resolved Hide resolved
The following example query uses the `queryGroupId` to ensure that the query stays under that query group's resource limits:
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

```json
{
"_id":"preXpc67RbKKeCyka72_Gw",
"name":"analytics",
"resiliency_mode":"enforced",
"resource_limits":{
"cpu":0.4,
"memory":0.2
},
"updated_at":1726270184642
}
```
{% include copy-curl.html %}
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

## Workload management settings

The are following settings can be used to customize workload management using the `_cluster/settings` API:
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

| **Setting name** | **Description** |
| :--- | :--- |
| `wlm.query_group.duress_streak` | Determines the node duress threshold. Once the threshold is reached, the node is marked as `in duress`. |
| `wlm.query_group.enforcement_interval` | Defines the monitoring interval. |
| `wlm.query_group.mode` | Defines the [operating mode](#operating-modes). |
| `wlm.query_group.node.memory_rejection_threshold` | Defines the query group level `memory` threshold. When the threshold is reached, the request is rejected. |
| `wlm.query_group.node.cpu_rejection_threshold` | Defines query group level `cpu` threshold. When the threshold is reached, the request is rejected. |
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
| `wlm.query_group.node.memory_cancellation_threshold` | Controls whether the node is considered in duress when the `cpu` threshold is reached and the effective request cancellation threshold based on `memory` usage. |
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
| `wlm.query_group.node.cpu_cancellation_threshold` | Controls whether the node is considered in duress when the `cpu` threshold is reached and the effective request cancellation threshold on `cpu` usage. |
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

natebower marked this conversation as resolved.
Show resolved Hide resolved
When setting rejection and cancellation settings thresholds, remember that the rejection threshold for a resource should always be less than the cancellation threshold.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

### Operating modes
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

The following operating modes determine the operating-level for the query group:
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

- **Disabled mode**: Workload management is disabled.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

- **Enabled mode**: Workload management is enabled and will cause cancellations and rejection once the query group’s configured thresholds are reached.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

- **Monitor_only mode** (Default): Workload management will monitor tasks but it will not cancel/reject any queries.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The content I deleted already appears on lines 37-45.

## Workload management stats API
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

The Workload management stats API returns workload management metrics for a query group, using the following method:
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

```json
GET _wlm/stats
```
{% include copy-curl.html %}

### Example response

```json
{
“_nodes”: {
“total”: 1,
“successful”: 1,
“failed”: 0
},
“cluster_name”: “XXXXXXYYYYYYYY”,
“A3L9EfBIQf2anrrUhh_goA”: {
“query_groups”: {
“16YGxFlPRdqIO7K4EACJlw”: {
“total_completions”: 33570,
“total_rejections”: 0,
“total_cancellations”: 0,
“cpu”: {
“current_usage”: 0.03319935314357281,
“cancellations”: 0,
“rejections”: 0
},
“memory”: {
“current_usage”: 0.002306486276211217,
“cancellations”: 0,
“rejections”: 0
}
},
“DEFAULT_QUERY_GROUP”: {
“total_completions”: 42572,
“total_rejections”: 0,
“total_cancellations”: 0,
“cpu”: {
“current_usage”: 0,
“cancellations”: 0,
“rejections”: 0
},
“memory”: {
“current_usage”: 0,
“cancellations”: 0,
“rejections”: 0
}
}
}
}
}
```
{% include copy-curl.html %}

### Response body fields

| Field name | Description |
|:----|:--- |
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
| `total_completions` | The total number of request completions in this `query_group` at the given node. This includes all shard-level and coordinator-level requests. |

Check failure on line 190 in _tuning-your-cluster/availability-and-recovery/workload-management/wlm-feature-overview.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: ato. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: ato. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_tuning-your-cluster/availability-and-recovery/workload-management/wlm-feature-overview.md", "range": {"start": {"line": 190, "column": 149}}}, "severity": "ERROR"}
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
| `total_rejections` | The total number request rejections in this `query_group` at the given node. This includes all shard-level and coordinator-level requests. |
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
| `total_cancellations` | The total number of cancellations in this `query_group` at the given node. This includes all shard-level and coordinator-level requests. |
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
| `cpu` | The `cpu` resource type stats for the `query_group` |
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
| `memory` | The `memory` resource type stats for the `query_group` |
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

### Resource type stats
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

| Field name | Description |
| :--- | :---- |
| `current_usage` |The resource usage for `query_group` at the given node based on the last run of the monitoring thread. This value is updated based on the `wlm.query_group.enforcement_interval`. |
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
| `cancellations` | The cancellation count as a result of the cancellation threshold being reached. |
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
| `rejections` | The rejection count as a result of the cancellation threshold being reached. |
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

Loading