Split common cluster issues page into separate pages (elastic#88495)

(cherry picked from commit 26cc873)
lockewritesdocs · Jul 19, 2022 · 3cfd291 · 3cfd291
1 parent 626c3d6
commit 3cfd291
Show file tree

Hide file tree

Showing 8 changed files with 747 additions and 723 deletions.
diff --git a/docs/reference/troubleshooting/common-issues/circuit-breaker-errors.asciidoc b/docs/reference/troubleshooting/common-issues/circuit-breaker-errors.asciidoc
@@ -0,0 +1,95 @@
+[[circuit-breaker-errors]]
+=== Circuit breaker errors
+
+{es} uses <<circuit-breaker,circuit breakers>> to prevent nodes from running out
+of JVM heap memory. If Elasticsearch estimates an operation would exceed a
+circuit breaker, it stops the operation and returns an error.
+
+By default, the <<parent-circuit-breaker,parent circuit breaker>> triggers at
+95% JVM memory usage. To prevent errors, we recommend taking steps to reduce
+memory pressure if usage consistently exceeds 85%.
+
+[discrete]
+[[diagnose-circuit-breaker-errors]]
+==== Diagnose circuit breaker errors
+
+**Error messages**
+
+If a request triggers a circuit breaker, {es} returns an error with a `429` HTTP
+status code.
+
+[source,js]
+----
+{
+  'error': {
+    'type': 'circuit_breaking_exception',
+    'reason': '[parent] Data too large, data for [<http_request>] would be [123848638/118.1mb], which is larger than the limit of [123273216/117.5mb], real usage: [120182112/114.6mb], new bytes reserved: [3666526/3.4mb]',
+    'bytes_wanted': 123848638,
+    'bytes_limit': 123273216,
+    'durability': 'TRANSIENT'
+  },
+  'status': 429
+}
+----
+// NOTCONSOLE
+
+{es} also writes circuit breaker errors to <<logging,`elasticsearch.log`>>. This
+is helpful when automated processes, such as allocation, trigger a circuit
+breaker.
+
+[source,txt]
+----
+Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [num/numGB], which is larger than the limit of [num/numGB], usages [request=0/0b, fielddata=num/numKB, in_flight_requests=num/numGB, accounting=num/numGB]
+----
+
+**Check JVM memory usage**
+
+If you've enabled Stack Monitoring, you can view JVM memory usage in {kib}. In
+the main menu, click **Stack Monitoring**. On the Stack Monitoring **Overview**
+page, click **Nodes**. The **JVM Heap** column lists the current memory usage
+for each node.
+
+You can also use the <<cat-nodes,cat nodes API>> to get the current
+`heap.percent` for each node.
+
+[source,console]
+----
+GET _cat/nodes?v=true&h=name,node*,heap*
+----
+
+To get the JVM memory usage for each circuit breaker, use the
+<<cluster-nodes-stats,node stats API>>.
+
+[source,console]
+----
+GET _nodes/stats/breaker
+----
+
+[discrete]
+[[prevent-circuit-breaker-errors]]
+==== Prevent circuit breaker errors
+
+**Reduce JVM memory pressure**
+
+High JVM memory pressure often causes circuit breaker errors. See
+<<high-jvm-memory-pressure>>.
+
+**Avoid using fielddata on `text` fields**
+
+For high-cardinality `text` fields, fielddata can use a large amount of JVM
+memory. To avoid this, {es} disables fielddata on `text` fields by default. If
+you've enabled fielddata and triggered the <<fielddata-circuit-breaker,fielddata
+circuit breaker>>, consider disabling it and using a `keyword` field instead.
+See <<fielddata>>.
+
+**Clear the fieldata cache**
+
+If you've triggered the fielddata circuit breaker and can't disable fielddata,
+use the <<indices-clearcache,clear cache API>> to clear the fielddata cache.
+This may disrupt any in-flight searches that use fielddata.
+
+[source,console]
+----
+POST _cache/clear?fielddata=true
+----
+// TEST[s/^/PUT my-index\n/]
diff --git a/docs/reference/troubleshooting/common-issues/disk-usage-exceeded.asciidoc b/docs/reference/troubleshooting/common-issues/disk-usage-exceeded.asciidoc
@@ -0,0 +1,84 @@
+[[disk-usage-exceeded]]
+=== Error: disk usage exceeded flood-stage watermark, index has read-only-allow-delete block
+
+This error indicates a data node is critically low on disk space and has reached
+the <<cluster-routing-flood-stage,flood-stage disk usage watermark>>. To prevent
+a full disk, when a node reaches this watermark, {es} blocks writes to any index
+with a shard on the node. If the block affects related system indices, {kib} and
+other {stack} features may become unavailable.
+
+{es} will automatically remove the write block when the affected node's disk
+usage goes below the <<cluster-routing-watermark-high,high disk watermark>>. To
+achieve this, {es} automatically moves some of the affected node's shards to
+other nodes in the same data tier.
+
+To verify that shards are moving off the affected node, use the <<cat-shards,cat
+shards API>>.
+
+[source,console]
+----
+GET _cat/shards?v=true
+----
+
+If shards remain on the node, use the <<cluster-allocation-explain,cluster
+allocation explanation API>> to get an explanation for their allocation status.
+
+[source,console]
+----
+GET _cluster/allocation/explain
+{
+  "index": "my-index",
+  "shard": 0,
+  "primary": false,
+  "current_node": "my-node"
+}
+----
+// TEST[s/^/PUT my-index\n/]
+// TEST[s/"primary": false,/"primary": false/]
+// TEST[s/"current_node": "my-node"//]
+
+To immediately restore write operations, you can temporarily increase the disk
+watermarks and remove the write block.
+
+[source,console]
+----
+PUT _cluster/settings
+{
+  "persistent": {
+    "cluster.routing.allocation.disk.watermark.low": "90%",
+    "cluster.routing.allocation.disk.watermark.high": "95%",
+    "cluster.routing.allocation.disk.watermark.flood_stage": "97%"
+  }
+}
+
+PUT */_settings?expand_wildcards=all
+{
+  "index.blocks.read_only_allow_delete": null
+}
+----
+// TEST[s/^/PUT my-index\n/]
+
+As a long-term solution, we recommend you add nodes to the affected data tiers
+or upgrade existing nodes to increase disk space. To free up additional disk
+space, you can delete unneeded indices using the <<indices-delete-index,delete
+index API>>.
+
+[source,console]
+----
+DELETE my-index
+----
+// TEST[s/^/PUT my-index\n/]
+
+When a long-term solution is in place, reset or reconfigure the disk watermarks.
+
+[source,console]
+----
+PUT _cluster/settings
+{
+  "persistent": {
+    "cluster.routing.allocation.disk.watermark.low": null,
+    "cluster.routing.allocation.disk.watermark.high": null,
+    "cluster.routing.allocation.disk.watermark.flood_stage": null
+  }
+}
+----
diff --git a/docs/reference/troubleshooting/common-issues/high-cpu-usage.asciidoc b/docs/reference/troubleshooting/common-issues/high-cpu-usage.asciidoc
@@ -0,0 +1,100 @@
+[[high-cpu-usage]]
+=== High CPU usage
+
+{es} uses <<modules-threadpool,thread pools>> to manage CPU resources for
+concurrent operations. High CPU usage typically means one or more thread pools
+are running low.
+
+If a thread pool is depleted, {es} will <<rejected-requests,reject requests>>
+related to the thread pool. For example, if the `search` thread pool is
+depleted, {es} will reject search requests until more threads are available.
+
+[discrete]
+[[diagnose-high-cpu-usage]]
+==== Diagnose high CPU usage
+
+**Check CPU usage**
+
+include::{es-repo-dir}/tab-widgets/cpu-usage-widget.asciidoc[]
+
+**Check hot threads**
+
+If a node has high CPU usage, use the <<cluster-nodes-hot-threads,nodes hot
+threads API>> to check for resource-intensive threads running on the node.
+
+[source,console]
+----
+GET _nodes/my-node,my-other-node/hot_threads
+----
+// TEST[s/\/my-node,my-other-node//]
+
+This API returns a breakdown of any hot threads in plain text.
+
+[discrete]
+[[reduce-cpu-usage]]
+==== Reduce CPU usage
+
+The following tips outline the most common causes of high CPU usage and their
+solutions.
+
+**Scale your cluster**
+
+Heavy indexing and search loads can deplete smaller thread pools. To better
+handle heavy workloads, add more nodes to your cluster or upgrade your existing
+nodes to increase capacity.
+
+**Spread out bulk requests**
+
+While more efficient than individual requests, large <<docs-bulk,bulk indexing>>
+or <<search-multi-search,multi-search>> requests still require CPU resources. If
+possible, submit smaller requests and allow more time between them.
+
+**Cancel long-running searches**
+
+Long-running searches can block threads in the `search` thread pool. To check
+for these searches, use the <<tasks,task management API>>.
+
+[source,console]
+----
+GET _tasks?actions=*search&detailed
+----
+
+The response's `description` contains the search request and its queries.
+`running_time_in_nanos` shows how long the search has been running.
+
+[source,console-result]
+----
+{
+  "nodes" : {
+    "oTUltX4IQMOUUVeiohTt8A" : {
+      "name" : "my-node",
+      "transport_address" : "127.0.0.1:9300",
+      "host" : "127.0.0.1",
+      "ip" : "127.0.0.1:9300",
+      "tasks" : {
+        "oTUltX4IQMOUUVeiohTt8A:464" : {
+          "node" : "oTUltX4IQMOUUVeiohTt8A",
+          "id" : 464,
+          "type" : "transport",
+          "action" : "indices:data/read/search",
+          "description" : "indices[my-index], search_type[QUERY_THEN_FETCH], source[{\"query\":...}]",
+          "start_time_in_millis" : 4081771730000,
+          "running_time_in_nanos" : 13991383,
+          "cancellable" : true
+        }
+      }
+    }
+  }
+}
+----
+// TESTRESPONSE[skip: no way to get tasks]
+
+To cancel a search and free up resources, use the API's `_cancel` endpoint.
+
+[source,console]
+----
+POST _tasks/oTUltX4IQMOUUVeiohTt8A:464/_cancel
+----
+
+For additional tips on how to track and avoid resource-intensive searches, see
+<<avoid-expensive-searches,Avoid expensive searches>>.
diff --git a/docs/reference/troubleshooting/common-issues/high-jvm-memory-pressure.asciidoc b/docs/reference/troubleshooting/common-issues/high-jvm-memory-pressure.asciidoc
@@ -0,0 +1,95 @@
+[[high-jvm-memory-pressure]]
+=== High JVM memory pressure
+
+High JVM memory usage can degrade cluster performance and trigger
+<<circuit-breaker-errors,circuit breaker errors>>. To prevent this, we recommend
+taking steps to reduce memory pressure if a node's JVM memory usage consistently
+exceeds 85%.
+
+[discrete]
+[[diagnose-high-jvm-memory-pressure]]
+==== Diagnose high JVM memory pressure
+
+**Check JVM memory pressure**
+
+include::{es-repo-dir}/tab-widgets/jvm-memory-pressure-widget.asciidoc[]
+
+**Check garbage collection logs**
+
+As memory usage increases, garbage collection becomes more frequent and takes
+longer. You can track the frequency and length of garbage collection events in
+<<logging,`elasticsearch.log`>>. For example, the following event states {es}
+spent more than 50% (21 seconds) of the last 40 seconds performing garbage
+collection.
+
+[source,log]
+----
+[timestamp_short_interval_from_last][INFO ][o.e.m.j.JvmGcMonitorService] [node_id] [gc][number] overhead, spent [21s] collecting in the last [40s]
+----
+
+[discrete]
+[[reduce-jvm-memory-pressure]]
+==== Reduce JVM memory pressure
+
+**Reduce your shard count**
+
+Every shard uses memory. In most cases, a small set of large shards uses fewer
+resources than many small shards. For tips on reducing your shard count, see
+<<size-your-shards>>.
+
+[[avoid-expensive-searches]]
+**Avoid expensive searches**
+
+Expensive searches can use large amounts of memory. To better track expensive
+searches on your cluster, enable <<index-modules-slowlog,slow logs>>.
+
+Expensive searches may have a large <<paginate-search-results,`size` argument>>,
+use aggregations with a large number of buckets, or include
+<<query-dsl-allow-expensive-queries,expensive queries>>. To prevent expensive
+searches, consider the following setting changes:
+
+* Lower the `size` limit using the
+<<index-max-result-window,`index.max_result_window`>> index setting.
+
+* Decrease the maximum number of allowed aggregation buckets using the
+<<search-settings-max-buckets,search.max_buckets>> cluster setting.
+
+* Disable expensive queries using the
+<<query-dsl-allow-expensive-queries,`search.allow_expensive_queries`>> cluster
+setting.
+
+[source,console]
+----
+PUT _settings
+{
+  "index.max_result_window": 5000
+}
+
+PUT _cluster/settings
+{
+  "persistent": {
+    "search.max_buckets": 20000,
+    "search.allow_expensive_queries": false
+  }
+}
+----
+// TEST[s/^/PUT my-index\n/]
+
+**Prevent mapping explosions**
+
+Defining too many fields or nesting fields too deeply can lead to
+<<mapping-limit-settings,mapping explosions>> that use large amounts of memory.
+To prevent mapping explosions, use the <<mapping-settings-limit,mapping limit
+settings>> to limit the number of field mappings.
+
+**Spread out bulk requests**
+
+While more efficient than individual requests, large <<docs-bulk,bulk indexing>>
+or <<search-multi-search,multi-search>> requests can still create high JVM
+memory pressure. If possible, submit smaller requests and allow more time
+between them.
+
+**Upgrade node memory**
+
+Heavy indexing and search loads can cause high JVM memory pressure. To better
+handle heavy workloads, upgrade your nodes to increase their memory capacity.