From 9b354eb15e519b33ea702f8e7cc67bde1aa4fb6e Mon Sep 17 00:00:00 2001 From: Heather Halter Date: Tue, 24 Oct 2023 17:54:13 -0700 Subject: [PATCH 1/4] added searchbp metrics Signed-off-by: Heather Halter --- _monitoring-your-cluster/pa/reference.md | 184 ++++++++++++++++++++--- 1 file changed, 165 insertions(+), 19 deletions(-) diff --git a/_monitoring-your-cluster/pa/reference.md b/_monitoring-your-cluster/pa/reference.md index 4d9e85328b..98a0f17118 100644 --- a/_monitoring-your-cluster/pa/reference.md +++ b/_monitoring-your-cluster/pa/reference.md @@ -821,27 +821,173 @@ The following metrics are relevant to the cluster as a whole and do not require +## Relevant dimensions: `NodeID`, `searchbp_mode` + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
MetricDescription
searchbp_shard_stats_cancellationCount + The number of tasks marked for cancellation on the shard task. +
searchbp_shard_stats_limitReachedCount + The number of times when the cancellable task total exceeded the set cancellation threshold on the shard task. +
searchbp_shard_stats_resource_heap_usage_cancellationCount + The number of tasks marked for cancellation because of excessive heap usage since the node last restarted on the shard task. +
searchbp_shard_stats_resource_heap_usage_currentMax + The maximum heap usage for tasks currently running on the shard task. +
searchbp_shard_stats_resource_heap_usage_rollingAvg + The rolling average heap usage for the _n_ most recent tasks on the shard task. The default value for _n_ is 100. +
searchbp_shard_stats_resource_cpu_usage_cancellationCount + The number of tasks marked for cancellation because of excessive CPU usage since the node last restarted on the shard task. +
searchbp_shard_stats_resource_cpu_usage_currentMax + The maximum CPU time for all tasks currently running on the node on the shard task. +
searchbp_shard_stats_resource_cpu_usage_currentAvg + The average CPU time for all tasks currently running on the node on the shard task. +
searchbp_shard_stats_resource_elaspedtime_usage_cancellationCount + The number of tasks marked for cancellation because of excessive elapsed time since the node last restarted on the shard task. +
searchbp_shard_stats_resource_elaspedtime_usage_currentMax + The maximum elapsed time for all tasks currently running on the node on the shard task. +
searchbp_shard_stats_resource_elaspedtime_usage_currentAvg + The average elapsed time for all tasks currently running on the node on the shard task. +
searchbp_task_stats_cancellationCount + The number of tasks marked for cancellation on the search task level. +
searchbp_task_stats_limitReachedCount + The number of times when the cancellable task total exceeded the set cancellation threshold on the search task level. +
searchbp_task_stats_resource_heap_usage_cancellationCount + The number of tasks marked for cancellation because of excessive heap usage since the node last restarted on the search task level. +
searchbp_task_stats_resource_heap_usage_currentMax + The maximum heap usage for tasks currently running on the search task level. +
searchbp_task_stats_resource_heap_usage_rollingAvg + The rolling average heap usage for the _n_ most recent tasks on the search task level. The default value for _n_ is 10. +
searchbp_task_stats_resource_cpu_usage_cancellationCount + The number of tasks marked for cancellation because of excessive CPU usage since the node last restarted on the search task level. +
searchbp_task_stats_resource_cpu_usage_currentMax + The maximum CPU time for all tasks currently running on the node on the search task level. +
searchbp_task_stats_resource_cpu_usage_currentAvg + The average CPU time for all tasks currently running on the node on the search task level. +
searchbp_task_stats_resource_elaspedtime_usage_cancellationCount + The number of tasks marked for cancellation because of excessive elapsed time since the node last restarted on the search task level. +
searchbp_task_stats_resource_elaspedtime_usage_currentMax + The maximum elapsed time for all tasks currently running on the node on the search task level. +
searchbp_task_stats_resource_elaspedtime_usage_currentAvg + The average elapsed time for all tasks currently running on the node on the search task level. +
+ ## Dimensions reference | Dimension | Return values | |----------------------|-------------------------------------------------| -| ShardID | The ID of the shard, for example, `1`. | -| IndexName | The name of the index, for example, `my-index`. | -| Operation | The type of operation, for example, `shardbulk`. | -| ShardRole | The shard role, for example, `primary` or `replica`. | -| Exception | OpenSearch exceptions, for example, `org.opensearch.index_not_found_exception`. | -| Indices | The list of indexes in the request URL. | -| HTTPRespCode | The response code from OpenSearch, for example, `200`. | -| MemType | The memory type, for example, `totYoungGC`, `totFullGC`, `Survivor`, `PermGen`, `OldGen`, `Eden`, `NonHeap`, or `Heap`. | -| DiskName | The name of the disk, for example, `sda1`. | -| DestAddr | The destination address, for example, `010015AC`. | -| Direction | The direction, for example, `in` or `out`. | -| ThreadPoolType | The OpenSearch thread pools, for example, `index`, `search`, or `snapshot`. | -| CBType | The circuit breaker type, for example, `accounting`, `fielddata`, `in_flight_requests`, `parent`, or `request`. | -| ClusterManagerTaskInsertOrder| The order in which the task was inserted, for example, `3691`. | -| ClusterManagerTaskPriority | The priority of the task, for example, `URGENT`. OpenSearch executes higher-priority tasks before lower-priority ones, regardless of `insert_order`. | -| ClusterManagerTaskType | The task type, for example, `shard-started`, `create-index`, `delete-index`, `refresh-mapping`, `put-mapping`, `CleanupSnapshotRestoreState`, or `Update snapshot state`. | -| ClusterManagerTaskMetadata | The metadata for the task (if any). | -| CacheType | The cache type, for example, `Field_Data_Cache`, `Shard_Request_Cache`, or `Node_Query_Cache`. | - +| `ShardID` | The ID of the shard, for example, `1`. | +| `IndexName` | The name of the index, for example, `my-index`. | +| `Operation` | The type of operation, for example, `shardbulk`. | +| `ShardRole` | The shard role, for example, `primary` or `replica`. | +| `Exception` | OpenSearch exceptions, for example, `org.opensearch.index_not_found_exception`. | +| `Indices` | The list of indexes in the request URL. | +| `HTTPRespCode` | The response code from OpenSearch, for example, `200`. | +| `MemType` | The memory type, for example, `totYoungGC`, `totFullGC`, `Survivor`, `PermGen`, `OldGen`, `Eden`, `NonHeap`, or `Heap`. | +| `DiskName` | The name of the disk, for example, `sda1`. | +| `DestAddr` | The destination address, for example, `010015AC`. | +| `Direction` | The direction, for example, `in` or `out`. | +| `ThreadPoolType` | The OpenSearch thread pools, for example, `index`, `search`, or `snapshot`. | +| `CBType` | The circuit breaker type, for example, `accounting`, `fielddata`, `in_flight_requests`, `parent`, or `request`. | +| `ClusterManagerTaskInsertOrder`| The order in which the task was inserted, for example, `3691`. | +| `ClusterManagerTaskPriority` | The priority of the task, for example, `URGENT`. OpenSearch executes higher-priority tasks before lower-priority ones, regardless of `insert_order`. | +| `ClusterManagerTaskType` | The task type, for example, `shard-started`, `create-index`, `delete-index`, `refresh-mapping`, `put-mapping`, `CleanupSnapshotRestoreState`, or `Update snapshot state`. | +| `ClusterManagerTaskMetadata` | The metadata for the task (if any). | +| `CacheType` | The cache type, for example, `Field_Data_Cache`, `Shard_Request_Cache`, or `Node_Query_Cache`. | +| `NodeID` | The ID of the node. | +| `Searchbp_mode` | The search backpressure mode, for example, `monitor_only` (default), `enforced`, or `disabled`. | From 9e55fc164ccfe8c663f6e9723d443e0d71656aa6 Mon Sep 17 00:00:00 2001 From: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Date: Mon, 5 Feb 2024 15:34:18 -0600 Subject: [PATCH 2/4] Update reference.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --- _monitoring-your-cluster/pa/reference.md | 46 ++++++++++++------------ 1 file changed, 23 insertions(+), 23 deletions(-) diff --git a/_monitoring-your-cluster/pa/reference.md b/_monitoring-your-cluster/pa/reference.md index 45f50e37d0..89201bb293 100644 --- a/_monitoring-your-cluster/pa/reference.md +++ b/_monitoring-your-cluster/pa/reference.md @@ -754,133 +754,133 @@ The following metrics are relevant to the cluster as a whole and do not require - searchbp_shard_stats_cancellationCount + SearchBP_Shard_Stats_CancellationCount - The number of tasks marked for cancellation on the shard task. + The number of tasks marked for cancellation at the shard task. - searchbp_shard_stats_limitReachedCount + SearchBP_Shard_Stats_LimitReachedCount The number of times when the cancellable task total exceeded the set cancellation threshold on the shard task. - searchbp_shard_stats_resource_heap_usage_cancellationCount + SearchBP_Shard_Stats_Resource_Heap_Usage_CancellationCount The number of tasks marked for cancellation because of excessive heap usage since the node last restarted on the shard task. - searchbp_shard_stats_resource_heap_usage_currentMax + SearchBP_Shard_Stats_Resource_Heap_Usage_CurrentMax The maximum heap usage for tasks currently running on the shard task. - searchbp_shard_stats_resource_heap_usage_rollingAvg + SearchBP_Shard_Stats_Resource_Heap_Usage_RollingAvg The rolling average heap usage for the _n_ most recent tasks on the shard task. The default value for _n_ is 100. - searchbp_shard_stats_resource_cpu_usage_cancellationCount + SearchBP_Shard_Stats_Resource_CPU_Usage_CancellationCount The number of tasks marked for cancellation because of excessive CPU usage since the node last restarted on the shard task. - searchbp_shard_stats_resource_cpu_usage_currentMax + SearchBP_Shard_Stats_Resource_CPU_Usage_CurrentMax The maximum CPU time for all tasks currently running on the node on the shard task. - searchbp_shard_stats_resource_cpu_usage_currentAvg + SearchBP_Shard_Stats_Resource_CPU_Usage_CurrentAvg The average CPU time for all tasks currently running on the node on the shard task. - searchbp_shard_stats_resource_elaspedtime_usage_cancellationCount + SearchBP_Shard_Stats_Resource_ElaspedTime_Usage_CancellationCount The number of tasks marked for cancellation because of excessive elapsed time since the node last restarted on the shard task. - searchbp_shard_stats_resource_elaspedtime_usage_currentMax + SearchBP_Shard_Stats_Resource_ElaspedTime_Usage_CurrentMax The maximum elapsed time for all tasks currently running on the node on the shard task. - searchbp_shard_stats_resource_elaspedtime_usage_currentAvg + SearchBP_Shard_Stats_Resource_ElaspedTime_Usage_CurrentAvg The average elapsed time for all tasks currently running on the node on the shard task. - searchbp_task_stats_cancellationCount + Searchbp_Task_Stats_CancellationCount The number of tasks marked for cancellation on the search task level. - searchbp_task_stats_limitReachedCount + SearchBP_Task_Stats_LimitReachedCount The number of times when the cancellable task total exceeded the set cancellation threshold on the search task level. - searchbp_task_stats_resource_heap_usage_cancellationCount + SearchBP_Task_Stats_Resource_Heap_Usage_CancellationCount The number of tasks marked for cancellation because of excessive heap usage since the node last restarted on the search task level. - searchbp_task_stats_resource_heap_usage_currentMax + SearchBP_Task_Stats_Resource_Heap_Usage_CurrentMax The maximum heap usage for tasks currently running on the search task level. - searchbp_task_stats_resource_heap_usage_rollingAvg + SearchBP_Task_Stats_Resource_Heap_Usage_RollingAvg The rolling average heap usage for the _n_ most recent tasks on the search task level. The default value for _n_ is 10. - searchbp_task_stats_resource_cpu_usage_cancellationCount + SearchBP_Task_Stats_Resource_CPU_Usage_CancellationCount The number of tasks marked for cancellation because of excessive CPU usage since the node last restarted on the search task level. - searchbp_task_stats_resource_cpu_usage_currentMax + SearchBP_Task_Stats_Resource_CPU_Usage_CurrentMax The maximum CPU time for all tasks currently running on the node on the search task level. - searchbp_task_stats_resource_cpu_usage_currentAvg + SearchBP_Task_Stats_Resource_CPU_Usage_CurrentAvg The average CPU time for all tasks currently running on the node on the search task level. - searchbp_task_stats_resource_elaspedtime_usage_cancellationCount + SearchBP_Task_Stats_Resource_ElaspedTime_Usage_CancellationCount The number of tasks marked for cancellation because of excessive elapsed time since the node last restarted on the search task level. - searchbp_task_stats_resource_elaspedtime_usage_currentMax + SearchBP_Task_Stats_Resource_ElaspedTime_Usage_CurrentMax The maximum elapsed time for all tasks currently running on the node on the search task level. - searchbp_task_stats_resource_elaspedtime_usage_currentAvg + SearchBP_Task_Stats_Resource_ElaspedTime_Usage_CurrentAvg The average elapsed time for all tasks currently running on the node on the search task level. From a78e7b4791f892d4d8de1b4387505a7df8e2ed78 Mon Sep 17 00:00:00 2001 From: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Date: Mon, 5 Feb 2024 15:50:04 -0600 Subject: [PATCH 3/4] Update reference.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --- _monitoring-your-cluster/pa/reference.md | 42 ++++++++++++------------ 1 file changed, 21 insertions(+), 21 deletions(-) diff --git a/_monitoring-your-cluster/pa/reference.md b/_monitoring-your-cluster/pa/reference.md index 89201bb293..6d6eccf5f8 100644 --- a/_monitoring-your-cluster/pa/reference.md +++ b/_monitoring-your-cluster/pa/reference.md @@ -756,91 +756,91 @@ The following metrics are relevant to the cluster as a whole and do not require SearchBP_Shard_Stats_CancellationCount - The number of tasks marked for cancellation at the shard task. + The number of tasks marked for cancellation at the shard task level. SearchBP_Shard_Stats_LimitReachedCount - The number of times when the cancellable task total exceeded the set cancellation threshold on the shard task. + The number of times when the cancellable task total exceeded the set cancellation threshold at the shard task level. SearchBP_Shard_Stats_Resource_Heap_Usage_CancellationCount - The number of tasks marked for cancellation because of excessive heap usage since the node last restarted on the shard task. + The number of tasks marked for cancellation because of excessive heap usage since the node last restarted at the shard task level. SearchBP_Shard_Stats_Resource_Heap_Usage_CurrentMax - The maximum heap usage for tasks currently running on the shard task. + The maximum heap usage for tasks currently running at the shard task level. SearchBP_Shard_Stats_Resource_Heap_Usage_RollingAvg - The rolling average heap usage for the _n_ most recent tasks on the shard task. The default value for _n_ is 100. + The rolling average heap usage for the _n_ most recent tasks at the shard task level. The default value for _n_ is 100. SearchBP_Shard_Stats_Resource_CPU_Usage_CancellationCount - The number of tasks marked for cancellation because of excessive CPU usage since the node last restarted on the shard task. + The number of tasks marked for cancellation because of excessive CPU usage since the node last restarted at the shard task level. SearchBP_Shard_Stats_Resource_CPU_Usage_CurrentMax - The maximum CPU time for all tasks currently running on the node on the shard task. + The maximum CPU time for all tasks currently running on the node at the shard task level. SearchBP_Shard_Stats_Resource_CPU_Usage_CurrentAvg - The average CPU time for all tasks currently running on the node on the shard task. + The average CPU time for all tasks currently running on the node at the shard task level. SearchBP_Shard_Stats_Resource_ElaspedTime_Usage_CancellationCount - The number of tasks marked for cancellation because of excessive elapsed time since the node last restarted on the shard task. + The number of tasks marked for cancellation because of excessive time elapsed since the node last restarted at the shard task level. SearchBP_Shard_Stats_Resource_ElaspedTime_Usage_CurrentMax - The maximum elapsed time for all tasks currently running on the node on the shard task. + The maximum time elapsed for all tasks currently running on the node at the shard task level. SearchBP_Shard_Stats_Resource_ElaspedTime_Usage_CurrentAvg - The average elapsed time for all tasks currently running on the node on the shard task. + The average time elapsed for all tasks currently running on the node at the shard task level. Searchbp_Task_Stats_CancellationCount - The number of tasks marked for cancellation on the search task level. + The number of tasks marked for cancellation at the search task level. SearchBP_Task_Stats_LimitReachedCount - The number of times when the cancellable task total exceeded the set cancellation threshold on the search task level. + The number of times when the cancellable task total exceeded the set cancellation threshold at the search task level. SearchBP_Task_Stats_Resource_Heap_Usage_CancellationCount - The number of tasks marked for cancellation because of excessive heap usage since the node last restarted on the search task level. + The number of tasks marked for cancellation because of excessive heap usage since the node last restarted at the search task level. SearchBP_Task_Stats_Resource_Heap_Usage_CurrentMax - The maximum heap usage for tasks currently running on the search task level. + The maximum heap usage for tasks currently running at the search task level. @@ -852,37 +852,37 @@ The following metrics are relevant to the cluster as a whole and do not require SearchBP_Task_Stats_Resource_CPU_Usage_CancellationCount - The number of tasks marked for cancellation because of excessive CPU usage since the node last restarted on the search task level. + The number of tasks marked for cancellation because of excessive CPU usage since the node last restarted at the search task level. SearchBP_Task_Stats_Resource_CPU_Usage_CurrentMax - The maximum CPU time for all tasks currently running on the node on the search task level. + The maximum CPU time for all tasks currently running on the node at the search task level. SearchBP_Task_Stats_Resource_CPU_Usage_CurrentAvg - The average CPU time for all tasks currently running on the node on the search task level. + The average CPU time for all tasks currently running on the node at the search task level. SearchBP_Task_Stats_Resource_ElaspedTime_Usage_CancellationCount - The number of tasks marked for cancellation because of excessive elapsed time since the node last restarted on the search task level. + The number of tasks marked for cancellation because of excessive time elapsed since the node last restarted at the search task level. SearchBP_Task_Stats_Resource_ElaspedTime_Usage_CurrentMax - The maximum elapsed time for all tasks currently running on the node on the search task level. + The maximum time elapsed for all tasks currently running on the node at the search task level. SearchBP_Task_Stats_Resource_ElaspedTime_Usage_CurrentAvg - The average elapsed time for all tasks currently running on the node on the search task level. + The average time elapsed for all tasks currently running on the node at the search task level. From d4ac065b793baf8ffdb95bc12f3dea4ec737d6af Mon Sep 17 00:00:00 2001 From: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Date: Tue, 6 Feb 2024 15:58:16 -0600 Subject: [PATCH 4/4] Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --- _monitoring-your-cluster/pa/reference.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/_monitoring-your-cluster/pa/reference.md b/_monitoring-your-cluster/pa/reference.md index 6d6eccf5f8..8b076b1ba5 100644 --- a/_monitoring-your-cluster/pa/reference.md +++ b/_monitoring-your-cluster/pa/reference.md @@ -762,7 +762,7 @@ The following metrics are relevant to the cluster as a whole and do not require SearchBP_Shard_Stats_LimitReachedCount - The number of times when the cancellable task total exceeded the set cancellation threshold at the shard task level. + The number of times that the cancellable task total exceeded the set cancellation threshold at the shard task level. @@ -780,7 +780,7 @@ The following metrics are relevant to the cluster as a whole and do not require SearchBP_Shard_Stats_Resource_Heap_Usage_RollingAvg - The rolling average heap usage for the _n_ most recent tasks at the shard task level. The default value for _n_ is 100. + The rolling average heap usage for the _n_ most recent tasks at the shard task level. The default value for _n_ is `100`. @@ -828,7 +828,7 @@ The following metrics are relevant to the cluster as a whole and do not require SearchBP_Task_Stats_LimitReachedCount - The number of times when the cancellable task total exceeded the set cancellation threshold at the search task level. + The number of times that the cancellable task total exceeded the set cancellation threshold at the search task level. @@ -846,7 +846,7 @@ The following metrics are relevant to the cluster as a whole and do not require SearchBP_Task_Stats_Resource_Heap_Usage_RollingAvg - The rolling average heap usage for the _n_ most recent tasks on the search task level. The default value for _n_ is 10. + The rolling average heap usage for the _n_ most recent tasks at the search task level. The default value for _n_ is `10`. @@ -899,7 +899,7 @@ The following metrics are relevant to the cluster as a whole and do not require | `ShardRole` | The shard role, for example, `primary` or `replica`. | | `Exception` | OpenSearch exceptions, for example, `org.opensearch.index_not_found_exception`. | | `Indices` | The list of indexes in the request URL. | -| `HTTPRespCode` | The response code from OpenSearch, for example, `200`. | +| `HTTPRespCode` | The OpenSearch response code, for example, `200`. | | `MemType` | The memory type, for example, `totYoungGC`, `totFullGC`, `Survivor`, `PermGen`, `OldGen`, `Eden`, `NonHeap`, or `Heap`. | | `DiskName` | The name of the disk, for example, `sda1`. | | `DestAddr` | The destination address, for example, `010015AC`. |