Track plan rejection history and automatically mark clients as ineligible #13421

lgfa29 · 2022-06-17T19:02:29Z

Plan rejections occur when the scheduler work and the leader plan
applier disagree on the feasibility of a plan. This may happen for valid
reasons: since Nomad does parallel scheduling, it is expected that
different workers will have a different state when computing placements.

As the final plan reaches the leader plan applier, it may no longer be
valid due to a concurrent scheduling taking up intended resources. In
these situations the plan applier will notify the worker that the plan
was rejected and that they should refresh their state before trying
again.

In some rare and unexpected circumstances it has been observed that
workers will repeatedly submit the same plan, even if they are always
rejected.

While the root cause is still unknown this mitigation has been put in
place. The plan applier will now track the history of plan rejections
per client and include in the plan result a list of node IDs that should
be set as ineligible if the number of rejections in a given time window
crosses a certain threshold. The window size and threshold value can be
adjusted in the server configuration.

Closes #13017
Closes #12920

Note for reviewers: since we can't yet reliably reproduce this bug, the way I tested this was by applying this patch that causes a plan to be rejected if it is evaluated by a server running the env var CRASH set it's for a client with a name that starts with crash.

So, after applying the patch, start a 3 server cluster with one of them having the CRASH env var set and make sure this server becomes the leader. Start a client with the name starting with crash and run a job.

Monitoring the log you should see the plan rejection messages and, after a few minutes, the client will become ineligible. You can then start a client without crash in the name to verify that the job scheduling will proceed to the new client.

lgfa29 · 2022-06-18T05:17:32Z

command/agent/config.go

+	NodeThreshold int `hcl:"node_threshold"`
+
+	// NodeWindow is the time window used to track active plan rejections for
+	// nodes.
+	NodeWindow    time.Duration
+	NodeWindowHCL string `hcl:"node_window" json:"-"`


I'm not sure about these names, but I wanted to hedge against us potentially needing to track plan rejections for other things, like evals or jobs. Feel free to modify them.

lgfa29 · 2022-06-18T06:08:10Z

command/agent/config.go

+			PlanRejectionTracker: &PlanRejectionTracker{
+				NodeThreshold: 15,
+				NodeWindow:    10 * time.Minute,
+			},


I'm also not sure about these defaults. The workers will retry the plan with exponential back-off, so depending on the cluster load and eval queue size it may take a while for the node to reach the threshold.

So I picked these values based on a threshold number that is likely to indicate a problem (it could maybe even be a little higher?), not just normal plan rejections, and a time window that encompasses a significant part of the rejection history but that will probably have expired by the time operators detected, drained, and restarted the client.

A short time window may risk the threshold never being hit and a large one risks a normal plan rejection triggering the ineligibility since the node score can still be high.

Agreed. Let's try to dig up some real world nomad.nomad.plan.node_rejected metrics to guide this. Better to be a bit conservative.

If we moved this into SchedulerConfig it would ease online updating and we could be very conservative with the defaults. Not sure how aggressively we should jam things into there, but this is at least a scheduler-related parameter.

From our out of band discussion: let's leave this in the config file. If you're fiddling with it live, you can just mark bad nodes ineligible with existing commands and fiddle with these settings after your cluster is healthy.

Hopefully we can find good enough defaults (not to mention root cause fixes!) that no one needs to worry about this anyway.

Plan rejections occur when the scheduler work and the leader plan applier disagree on the feasibility of a plan. This may happen for valid reasons: since Nomad does parallel scheduling, it is expected that different workers will have a different state when computing placements. As the final plan reaches the leader plan applier, it may no longer be valid due to a concurrent scheduling taking up intended resources. In these situations the plan applier will notify the worker that the plan was rejected and that they should refresh their state before trying again. In some rare and unexpected circumstances it has been observed that workers will repeatedly submit the same plan, even if they are always rejected. While the root cause is still unknown this mitigation has been put in place. The plan applier will now track the history of plan rejections per client and include in the plan result a list of node IDs that should be set as ineligible if the number of rejections in a given time window crosses a certain threshold. The window size and threshold value can be adjusted in the server configuration.

lgfa29 · 2022-06-18T07:06:21Z

nomad/structs/structs.go

@@ -11403,8 +11419,9 @@ type PlanResult struct {

 // IsNoOp checks if this plan result would do nothing
 func (p *PlanResult) IsNoOp() bool {
-	return len(p.NodeUpdate) == 0 && len(p.NodeAllocation) == 0 &&
-		len(p.DeploymentUpdates) == 0 && p.Deployment == nil
+	return len(p.IneligibleNodes) == 0 && len(p.NodeUpdate) == 0 &&


This new check will cause the no-op fast-path to be skipped in cases there it previously wouldn't, so it may be worth double checking if nothing will break.

schmichael

Reviewed all but the core algorithm. Will get that done ASAP.

command/agent/config_parse.go

website/content/docs/configuration/server.mdx

…e marked ineligible

tgross

This is looking great @lgfa29! I've left a few suggestions that could tighten it up.

website/content/docs/configuration/server.mdx

tgross · 2022-07-08T13:54:48Z

website/content/docs/operations/monitoring-nomad.mdx

@@ -149,10 +149,29 @@ While it is possible for these log lines to occur infrequently due to normal
 cluster conditions, they should not appear repeatedly and prevent the job from
 eventually running (look up the evaluation ID logged to find the job).

-If this log *does* appear repeatedly with the same `node_id` referenced, try
+Nomad tracks the history of plan rejections per client and will mark it as


We don't do this by default, so we should probably note that you can specifically enable this feature here.

Ah yes, the enabled = false was added later and I forgot to update the docs. Thanks!

tgross · 2022-07-08T13:57:39Z