From 5beb6a10f6a04c3f824648f6ec4c51b1ccee27d3 Mon Sep 17 00:00:00 2001
From: Luiz Aoqui <luiz@hashicorp.com>
Date: Sat, 18 Jun 2022 00:36:58 -0400
Subject: [PATCH] docs: document plan rejection tracking

---
 website/content/docs/configuration/server.mdx | 27 +++++++++++++++++++
 .../docs/operations/metrics-reference.mdx     |  3 +--
 .../docs/operations/monitoring-nomad.mdx      | 22 ++++++++++++++-
 3 files changed, 49 insertions(+), 3 deletions(-)
diff --git a/website/content/docs/configuration/server.mdx b/website/content/docs/configuration/server.mdx
index 7872913a064..f9e107b17b0 100644
--- a/website/content/docs/configuration/server.mdx
+++ b/website/content/docs/configuration/server.mdx
@@ -156,6 +156,10 @@ server {
   disallow this server from making any scheduling decisions. This defaults to
   the number of CPU cores.
 
+- `plan_rejection_tracker` <code>([PlanRejectionTracker](#plan_rejection_tracker-parameters))</code> -
+  Configuration for the plan rejection tracker that the Nomad leader uses to
+  track the history of plan rejections.
+
 - `raft_boltdb` - This is a nested object that allows configuring options for
   Raft's BoltDB based log store.
     - `no_freelist_sync` - Setting this to `true` will disable syncing the BoltDB
@@ -238,6 +242,28 @@ server {
   section for more information on the format of the string. This field is
   deprecated in favor of the [server_join stanza][server-join].
 
+### `plan_rejection_tracker` Parameters
+
+The leader plan rejection tracker can be adjusted to prevent evaluations from
+getting stuck due to always being scheduled to a client that may have an
+unexpected and undetected issues. Refer to [Monitoring
+Nomad][monitoring_nomad_progress] for more details.
+
+- `node_threshold` `(int: 15)` - The number of plan rejections for a node
+  within the `node_window` to trigger a client to be set as ineligible.
+
+- `node_window` `(int: "10m")` - The time window for when plan rejections for a
+  node should be considered.
+
+If you observe too many false positives (clients being marked as ineligible
+even if they don't present any problem) you may want to increase
+`node_threshold`.
+
+Or if you are noticing jobs not being scheduled due to plan rejections for the
+same `node_id` and the client is not being set as ineligible you can try
+increasing the `node_window` so more historical rejections are taken into
+account.
+
 ## `server` Examples
 
 ### Common Setup
@@ -331,5 +357,6 @@ server {
 [update-scheduler-config]: /api-docs/operator/scheduler#update-scheduler-configuration 'Scheduler Config'
 [bootstrapping a cluster]: /docs/faq#bootstrapping
 [rfc4648]: https://tools.ietf.org/html/rfc4648#section-5
+[monitoring_nomad_progress]: /docs/operations/monitoring-nomad#progress
 [`nomad operator keygen`]: /docs/commands/operator/keygen
 [search]: /docs/configuration/search
diff --git a/website/content/docs/operations/metrics-reference.mdx b/website/content/docs/operations/metrics-reference.mdx
index edc1286a857..8c29e159215 100644
--- a/website/content/docs/operations/metrics-reference.mdx
+++ b/website/content/docs/operations/metrics-reference.mdx
@@ -394,6 +394,7 @@ those listed in [Key Metrics](#key-metrics) above.
 | `nomad.nomad.plan.apply`                             | Time elapsed to apply a plan                                                   | Nanoseconds          | Summary | host                                                    |
 | `nomad.nomad.plan.evaluate`                          | Time elapsed to evaluate a plan                                                | Nanoseconds          | Summary | host                                                    |
 | `nomad.nomad.plan.node_rejected`                     | Number of times a node has had a plan rejected                                 | Integer              | Counter | host, node_id                                           |
+| `nomad.nomad.plan.rejection_tracker.node_score`      | Number of times a node has had a plan rejected within the tracker window       | Integer              | Gauge   | host, node_id                                           |
 | `nomad.nomad.plan.queue_depth`                       | Count of evals in the plan queue                                               | Integer              | Gauge   | host                                                    |
 | `nomad.nomad.plan.submit`                            | Time elapsed for `Plan.Submit` RPC call                                        | Nanoseconds          | Summary | host                                                    |
 | `nomad.nomad.plan.wait_for_index`                    | Time elapsed that planner waits for the raft index of the plan to be processed | Nanoseconds          | Summary | host                                                    |
@@ -481,5 +482,3 @@ Raft database metrics are emitted by the `raft-boltdb` library.
 
 [tagged-metrics]: /docs/telemetry/metrics#tagged-metrics
 [s_port_plan_failure]: /s/port-plan-failure
-
-
diff --git a/website/content/docs/operations/monitoring-nomad.mdx b/website/content/docs/operations/monitoring-nomad.mdx
index 0dccbbefd49..dc053d7defb 100644
--- a/website/content/docs/operations/monitoring-nomad.mdx
+++ b/website/content/docs/operations/monitoring-nomad.mdx
@@ -149,10 +149,29 @@ While it is possible for these log lines to occur infrequently due to normal
 cluster conditions, they should not appear repeatedly and prevent the job from
 eventually running (look up the evaluation ID logged to find the job).
 
-If this log *does* appear repeatedly with the same `node_id` referenced, try
+Nomad tracks the history of plan rejections per client and will mark it as
+ineligible for scheduling if the number of rejections goes above a given
+threshold within a time window. When this happens, the following node event is
+registered:
+
+```
+Node marked as ineligible for scheduling due to multiple plan rejections
+```
+
+Along with the log line:
+
+```
+[WARN]  nomad.state_store: marking node as ineligible due to multiple plan rejections: node_id=67af2541-5e96-6f54-9095-11089d627626
+```
+
+If a client is marked as ineligible due to repeated plan rejections, try
 [draining] the node and shutting it down. Misconfigurations not caught by
 validation can cause nodes to enter this state: [#11830][gh-11830].
 
+If the `plan for node rejected` log *does* appear repeatedly with the same
+`node_id` referenced but the client is not being set as ineligible you can try
+adjusting the [`plan_rejection_tracker`] configuration of servers.
+
 ### Performance
 
 The following metrics allow observing changes in throughput at the various
@@ -278,6 +297,7 @@ latency and packet loss for the [Serf] address.
 [metric-types]: /docs/telemetry/metrics#metric-types
 [metrics-api-endpoint]: /api-docs/metrics
 [prometheus-telem]: /docs/configuration/telemetry#prometheus
+[`plan_rejection_tracker`]: /docs/configuration/server#plan_rejection_tracker
 [serf]: /docs/configuration#serf-1
 [statsd-exporter]: https://github.com/prometheus/statsd_exporter
 [statsd-telem]: /docs/configuration/telemetry#statsd