Skip to content

Commit

Permalink
Introduce max headroom for disk watermark stages (#88639)
Browse files Browse the repository at this point in the history
Introduce max headroom settings for the low, high, and flood disk watermark stages, similar to the existing max headroom setting for the flood stage of the frozen tier. Introduce new max headrooms in HealthMetadata and in ReactiveStorageDeciderService. Add multiple tests in DiskThresholdDeciderUnitTests, DiskThresholdDeciderTests and DiskThresholdMonitorTests. Moreover, addition & subtraction for ByteSizeValue, and min.
  • Loading branch information
kingherc authored Sep 19, 2022
1 parent fa654b9 commit 34471b1
Show file tree
Hide file tree
Showing 22 changed files with 2,067 additions and 392 deletions.
6 changes: 6 additions & 0 deletions docs/changelog/88639.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
pr: 88639
summary: Introduce max headroom for disk watermark stages
area: Infra/Settings
type: enhancement
issues:
- 81406
46 changes: 28 additions & 18 deletions docs/reference/how-to/fix-common-cluster-issues.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -51,8 +51,13 @@ PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.disk.watermark.low": "90%",
"cluster.routing.allocation.disk.watermark.low.max_headroom": "100GB",
"cluster.routing.allocation.disk.watermark.high": "95%",
"cluster.routing.allocation.disk.watermark.flood_stage": "97%"
"cluster.routing.allocation.disk.watermark.high.max_headroom": "20GB",
"cluster.routing.allocation.disk.watermark.flood_stage": "97%",
"cluster.routing.allocation.disk.watermark.flood_stage.max_headroom": "5GB",
"cluster.routing.allocation.disk.watermark.flood_stage.frozen": "97%",
"cluster.routing.allocation.disk.watermark.flood_stage.frozen.max_headroom": "5GB"
}
}
Expand Down Expand Up @@ -82,8 +87,13 @@ PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.disk.watermark.low": null,
"cluster.routing.allocation.disk.watermark.low.max_headroom": null,
"cluster.routing.allocation.disk.watermark.high": null,
"cluster.routing.allocation.disk.watermark.flood_stage": null
"cluster.routing.allocation.disk.watermark.high.max_headroom": null,
"cluster.routing.allocation.disk.watermark.flood_stage": null,
"cluster.routing.allocation.disk.watermark.flood_stage.max_headroom": null,
"cluster.routing.allocation.disk.watermark.flood_stage.frozen": null,
"cluster.routing.allocation.disk.watermark.flood_stage.frozen.max_headroom": null
}
}
----
Expand Down Expand Up @@ -674,8 +684,8 @@ for tips on diagnosing and preventing them.
[[task-queue-backlog]]
=== Task queue backlog

A backlogged task queue can prevent tasks from completing and
put the cluster into an unhealthy state.
A backlogged task queue can prevent tasks from completing and
put the cluster into an unhealthy state.
Resource constraints, a large number of tasks being triggered at once,
and long running tasks can all contribute to a backlogged task queue.

Expand All @@ -685,11 +695,11 @@ and long running tasks can all contribute to a backlogged task queue.

**Check the thread pool status**

A <<high-cpu-usage,depleted thread pool>> can result in <<rejected-requests,rejected requests>>.
A <<high-cpu-usage,depleted thread pool>> can result in <<rejected-requests,rejected requests>>.

You can use the <<cat-thread-pool,cat thread pool API>> to
You can use the <<cat-thread-pool,cat thread pool API>> to
see the number of active threads in each thread pool and
how many tasks are queued, how many have been rejected, and how many have completed.
how many tasks are queued, how many have been rejected, and how many have completed.

[source,console]
----
Expand All @@ -698,9 +708,9 @@ GET /_cat/thread_pool?v&s=t,n&h=type,name,node_name,active,queue,rejected,comple

**Inspect the hot threads on each node**

If a particular thread pool queue is backed up,
you can periodically poll the <<cluster-nodes-hot-threads,Nodes hot threads>> API
to determine if the thread has sufficient
If a particular thread pool queue is backed up,
you can periodically poll the <<cluster-nodes-hot-threads,Nodes hot threads>> API
to determine if the thread has sufficient
resources to progress and gauge how quickly it is progressing.

[source,console]
Expand All @@ -710,9 +720,9 @@ GET /_nodes/hot_threads

**Look for long running tasks**

Long-running tasks can also cause a backlog.
You can use the <<tasks,task management>> API to get information about the tasks that are running.
Check the `running_time_in_nanos` to identify tasks that are taking an excessive amount of time to complete.
Long-running tasks can also cause a backlog.
You can use the <<tasks,task management>> API to get information about the tasks that are running.
Check the `running_time_in_nanos` to identify tasks that are taking an excessive amount of time to complete.

[source,console]
----
Expand All @@ -723,16 +733,16 @@ GET /_tasks?filter_path=nodes.*.tasks
[[resolve-task-queue-backlog]]
==== Resolve a task queue backlog

**Increase available resources**
**Increase available resources**

If tasks are progressing slowly and the queue is backing up,
you might need to take steps to <<reduce-cpu-usage>>.
If tasks are progressing slowly and the queue is backing up,
you might need to take steps to <<reduce-cpu-usage>>.

In some cases, increasing the thread pool size might help.
For example, the `force_merge` thread pool defaults to a single thread.
Increasing the size to 2 might help reduce a backlog of force merge requests.

**Cancel stuck tasks**

If you find the active task's hot thread isn't progressing and there's a backlog,
consider canceling the task.
If you find the active task's hot thread isn't progressing and there's a backlog,
consider canceling the task.
12 changes: 9 additions & 3 deletions docs/reference/index-modules/blocks.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,15 @@ the index itself - can increase the index size over time. When
not permitted. However, deleting the index itself releases the read-only index
block and makes resources available almost immediately.
+
IMPORTANT: {es} adds and removes the read-only index block automatically when
the disk utilization falls below the high watermark, controlled by
<<cluster-routing-flood-stage,cluster.routing.allocation.disk.watermark.flood_stage>>.
IMPORTANT: {es} adds the read-only index block automatically when the disk
utilization exceeds the flood stage watermark, controlled by the
<<cluster-routing-flood-stage,cluster.routing.allocation.disk.watermark.flood_stage>>
and <<cluster-routing-flood-stage,cluster.routing.allocation.disk.watermark.flood_stage.max_headroom>>
settings, and removes the block automatically when the disk utilization falls
under the high watermark, controlled by the
<<cluster-routing-flood-stage,cluster.routing.allocation.disk.watermark.high>>
and <<cluster-routing-flood-stage,cluster.routing.allocation.disk.watermark.high.max_headroom>>
settings.

`index.blocks.read`::

Expand Down
22 changes: 19 additions & 3 deletions docs/reference/modules/cluster/disk_allocator.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -75,13 +75,23 @@ Defaults to `true`. Set to `false` to disable the disk allocation decider. Upon
Controls the low watermark for disk usage. It defaults to `85%`, meaning that {es} will not allocate shards to nodes that have more than 85% disk used. It can alternatively be set to a ratio value, e.g., `0.85`. It can also be set to an absolute byte value (like `500mb`) to prevent {es} from allocating shards if less than the specified amount of space is available. This setting has no effect on the primary shards of newly-created indices but will prevent their replicas from being allocated.
// end::cluster-routing-watermark-low-tag[]

`cluster.routing.allocation.disk.watermark.low.max_headroom`::
(<<dynamic-cluster-setting,Dynamic>>) Controls the max headroom for the low watermark (in case of a percentage/ratio value).
Defaults to 200GB when `cluster.routing.allocation.disk.watermark.low` is not explicitly set.
This caps the amount of free space required.

[[cluster-routing-watermark-high]]
// tag::cluster-routing-watermark-high-tag[]
`cluster.routing.allocation.disk.watermark.high` {ess-icon}::
(<<dynamic-cluster-setting,Dynamic>>)
Controls the high watermark. It defaults to `90%`, meaning that {es} will attempt to relocate shards away from a node whose disk usage is above 90%. It can alternatively be set to a ratio value, e.g., `0.9`. It can also be set to an absolute byte value (similarly to the low watermark) to relocate shards away from a node if it has less than the specified amount of free space. This setting affects the allocation of all shards, whether previously allocated or not.
// end::cluster-routing-watermark-high-tag[]

`cluster.routing.allocation.disk.watermark.high.max_headroom`::
(<<dynamic-cluster-setting,Dynamic>>) Controls the max headroom for the high watermark (in case of a percentage/ratio value).
Defaults to 150GB when `cluster.routing.allocation.disk.watermark.high` is not explicitly set.
This caps the amount of free space required.

`cluster.routing.allocation.disk.watermark.enable_for_single_data_node`::
(<<static-cluster-setting,Static>>)
In earlier releases, the default behaviour was to disregard disk watermarks for a single
Expand All @@ -97,8 +107,14 @@ is now `true`. The setting will be removed in a future release.
(<<dynamic-cluster-setting,Dynamic>>)
Controls the flood stage watermark, which defaults to 95%. {es} enforces a read-only index block (`index.blocks.read_only_allow_delete`) on every index that has one or more shards allocated on the node, and that has at least one disk exceeding the flood stage. This setting is a last resort to prevent nodes from running out of disk space. The index block is automatically released when the disk utilization falls below the high watermark. Similarly to the low and high watermark values, it can alternatively be set to a ratio value, e.g., `0.95`, or an absolute byte value.

`cluster.routing.allocation.disk.watermark.flood_stage.max_headroom`::
(<<dynamic-cluster-setting,Dynamic>>) Controls the max headroom for the flood stage watermark (in case of a percentage/ratio value).
Defaults to 100GB when
`cluster.routing.allocation.disk.watermark.flood_stage` is not explicitly set.
This caps the amount of free space required.

NOTE: You cannot mix the usage of percentage/ratio values and byte values within
the watermark settings. Either all values are set to percentage/ratio values, or all are set to byte values. This enforcement is so that {es} can validate that the settings are internally consistent, ensuring that the low disk threshold is less than the high disk threshold, and the high disk threshold is less than the flood stage threshold.
the watermark settings. Either all values are set to percentage/ratio values, or all are set to byte values. This enforcement is so that {es} can validate that the settings are internally consistent, ensuring that the low disk threshold is less than the high disk threshold, and the high disk threshold is less than the flood stage threshold. A similar check is done for the max headroom values.

An example of resetting the read-only index block on the `my-index-000001` index:

Expand All @@ -122,8 +138,8 @@ Controls the flood stage watermark for dedicated frozen nodes, which defaults to

`cluster.routing.allocation.disk.watermark.flood_stage.frozen.max_headroom` {ess-icon}::
(<<dynamic-cluster-setting,Dynamic>>)
Controls the max headroom for the flood stage watermark for dedicated frozen
nodes. Defaults to 20GB when
Controls the max headroom for the flood stage watermark (in case of a
percentage/ratio value) for dedicated frozen nodes. Defaults to 20GB when
`cluster.routing.allocation.disk.watermark.flood_stage.frozen` is not explicitly
set. This caps the amount of free space required on dedicated frozen nodes.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,13 @@ PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.disk.watermark.low": "90%",
"cluster.routing.allocation.disk.watermark.low.max_headroom": "100GB",
"cluster.routing.allocation.disk.watermark.high": "95%",
"cluster.routing.allocation.disk.watermark.flood_stage": "97%"
"cluster.routing.allocation.disk.watermark.high.max_headroom": "20GB",
"cluster.routing.allocation.disk.watermark.flood_stage": "97%",
"cluster.routing.allocation.disk.watermark.flood_stage.max_headroom": "5GB",
"cluster.routing.allocation.disk.watermark.flood_stage.frozen": "97%",
"cluster.routing.allocation.disk.watermark.flood_stage.frozen.max_headroom": "5GB"
}
}
Expand Down Expand Up @@ -77,8 +82,13 @@ PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.disk.watermark.low": null,
"cluster.routing.allocation.disk.watermark.low.max_headroom": null,
"cluster.routing.allocation.disk.watermark.high": null,
"cluster.routing.allocation.disk.watermark.flood_stage": null
"cluster.routing.allocation.disk.watermark.high.max_headroom": null,
"cluster.routing.allocation.disk.watermark.flood_stage": null,
"cluster.routing.allocation.disk.watermark.flood_stage.max_headroom": null,
"cluster.routing.allocation.disk.watermark.flood_stage.frozen": null,
"cluster.routing.allocation.disk.watermark.flood_stage.frozen.max_headroom": null
}
}
----
----
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@ the operation and returns an error.
The most common causes of high CPU usage and their solutions.

<<high-jvm-memory-pressure,High JVM memory pressure>>::
High JVM memory usage can degrade cluster performance and trigger circuit
breaker errors.
High JVM memory usage can degrade cluster performance and trigger circuit
breaker errors.

<<red-yellow-cluster-status,Red or yellow cluster status>>::
A red or yellow cluster status indicates one or more shards are missing or
Expand All @@ -29,8 +29,8 @@ When {es} rejects a request, it stops the operation and returns an error with a
`429` response code.

<<task-queue-backlog,Task queue backlog>>::
A backlogged task queue can prevent tasks from completing and put the cluster
into an unhealthy state.
A backlogged task queue can prevent tasks from completing and put the cluster
into an unhealthy state.

<<diagnose-unassigned-shards,Diagnose unassigned shards>>::
There are multiple reasons why shards might get unassigned, ranging from
Expand All @@ -47,4 +47,4 @@ include::common-issues/high-jvm-memory-pressure.asciidoc[]
include::common-issues/red-yellow-cluster-status.asciidoc[]
include::common-issues/rejected-requests.asciidoc[]
include::common-issues/task-queue-backlog.asciidoc[]
include::common-issues/diagnose-unassigned-shards.asciidoc[]
include::common-issues/diagnose-unassigned-shards.asciidoc[]
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,11 @@
import java.util.Map;
import java.util.concurrent.atomic.AtomicReference;

import static org.elasticsearch.cluster.routing.allocation.DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_MAX_HEADROOM_SETTING;
import static org.elasticsearch.cluster.routing.allocation.DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_WATERMARK_SETTING;
import static org.elasticsearch.cluster.routing.allocation.DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_MAX_HEADROOM_SETTING;
import static org.elasticsearch.cluster.routing.allocation.DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_WATERMARK_SETTING;
import static org.elasticsearch.cluster.routing.allocation.DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_LOW_DISK_MAX_HEADROOM_SETTING;
import static org.elasticsearch.cluster.routing.allocation.DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_LOW_DISK_WATERMARK_SETTING;
import static org.elasticsearch.cluster.routing.allocation.DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_REROUTE_INTERVAL_SETTING;
import static org.elasticsearch.cluster.routing.allocation.decider.EnableAllocationDecider.CLUSTER_ROUTING_REBALANCE_ENABLE_SETTING;
Expand Down Expand Up @@ -92,18 +95,18 @@ public void testRerouteOccursOnDiskPassingHighWatermark() throws Exception {
clusterInfoService.setDiskUsageFunctionAndRefresh((discoveryNode, fsInfoPath) -> setDiskUsage(fsInfoPath, 100, between(10, 100)));

final boolean watermarkBytes = randomBoolean(); // we have to consistently use bytes or percentage for the disk watermark settings
assertAcked(
client().admin()
.cluster()
.prepareUpdateSettings()
.setPersistentSettings(
Settings.builder()
.put(CLUSTER_ROUTING_ALLOCATION_LOW_DISK_WATERMARK_SETTING.getKey(), watermarkBytes ? "10b" : "90%")
.put(CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_WATERMARK_SETTING.getKey(), watermarkBytes ? "10b" : "90%")
.put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_WATERMARK_SETTING.getKey(), watermarkBytes ? "0b" : "100%")
.put(CLUSTER_ROUTING_ALLOCATION_REROUTE_INTERVAL_SETTING.getKey(), "0ms")
)
);
Settings.Builder settings = Settings.builder()
.put(CLUSTER_ROUTING_ALLOCATION_LOW_DISK_WATERMARK_SETTING.getKey(), watermarkBytes ? "10b" : "90%")
.put(CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_WATERMARK_SETTING.getKey(), watermarkBytes ? "10b" : "90%")
.put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_WATERMARK_SETTING.getKey(), watermarkBytes ? "0b" : "100%")
.put(CLUSTER_ROUTING_ALLOCATION_REROUTE_INTERVAL_SETTING.getKey(), "0ms");
if (watermarkBytes == false && randomBoolean()) {
String headroom = randomIntBetween(10, 100) + "b";
settings = settings.put(CLUSTER_ROUTING_ALLOCATION_LOW_DISK_MAX_HEADROOM_SETTING.getKey(), headroom)
.put(CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_MAX_HEADROOM_SETTING.getKey(), headroom)
.put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_MAX_HEADROOM_SETTING.getKey(), headroom);
}
assertAcked(client().admin().cluster().prepareUpdateSettings().setPersistentSettings(settings));
// Create an index with 10 shards so we can check allocation for it
assertAcked(prepareCreate("test").setSettings(Settings.builder().put("number_of_shards", 10).put("number_of_replicas", 0)));
ensureGreen("test");
Expand Down Expand Up @@ -172,18 +175,17 @@ public void testAutomaticReleaseOfIndexBlock() throws Exception {
clusterInfoService.setDiskUsageFunctionAndRefresh((discoveryNode, fsInfoPath) -> setDiskUsage(fsInfoPath, 100, between(15, 100)));

final boolean watermarkBytes = randomBoolean(); // we have to consistently use bytes or percentage for the disk watermark settings
assertAcked(
client().admin()
.cluster()
.prepareUpdateSettings()
.setPersistentSettings(
Settings.builder()
.put(CLUSTER_ROUTING_ALLOCATION_LOW_DISK_WATERMARK_SETTING.getKey(), watermarkBytes ? "10b" : "90%")
.put(CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_WATERMARK_SETTING.getKey(), watermarkBytes ? "10b" : "90%")
.put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_WATERMARK_SETTING.getKey(), watermarkBytes ? "5b" : "95%")
.put(CLUSTER_ROUTING_ALLOCATION_REROUTE_INTERVAL_SETTING.getKey(), "150ms")
)
);
Settings.Builder builder = Settings.builder()
.put(CLUSTER_ROUTING_ALLOCATION_LOW_DISK_WATERMARK_SETTING.getKey(), watermarkBytes ? "10b" : "90%")
.put(CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_WATERMARK_SETTING.getKey(), watermarkBytes ? "10b" : "90%")
.put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_WATERMARK_SETTING.getKey(), watermarkBytes ? "5b" : "95%")
.put(CLUSTER_ROUTING_ALLOCATION_REROUTE_INTERVAL_SETTING.getKey(), "150ms");
if (watermarkBytes == false) {
builder = builder.put(CLUSTER_ROUTING_ALLOCATION_LOW_DISK_MAX_HEADROOM_SETTING.getKey(), "10b")
.put(CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_MAX_HEADROOM_SETTING.getKey(), "10b")
.put(CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_MAX_HEADROOM_SETTING.getKey(), "5b");
}
assertAcked(client().admin().cluster().prepareUpdateSettings().setPersistentSettings(builder));

// Create an index with 6 shards so we can check allocation for it
prepareCreate("test").setSettings(Settings.builder().put("number_of_shards", 6).put("number_of_replicas", 0)).get();
Expand Down
Loading

0 comments on commit 34471b1

Please sign in to comment.