Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track plan rejection history and automatically mark clients as ineligible #13421

Merged
merged 14 commits into from
Jul 12, 2022

Conversation

lgfa29
Copy link
Contributor

@lgfa29 lgfa29 commented Jun 17, 2022

Plan rejections occur when the scheduler work and the leader plan
applier disagree on the feasibility of a plan. This may happen for valid
reasons: since Nomad does parallel scheduling, it is expected that
different workers will have a different state when computing placements.

As the final plan reaches the leader plan applier, it may no longer be
valid due to a concurrent scheduling taking up intended resources. In
these situations the plan applier will notify the worker that the plan
was rejected and that they should refresh their state before trying
again.

In some rare and unexpected circumstances it has been observed that
workers will repeatedly submit the same plan, even if they are always
rejected.

While the root cause is still unknown this mitigation has been put in
place. The plan applier will now track the history of plan rejections
per client and include in the plan result a list of node IDs that should
be set as ineligible if the number of rejections in a given time window
crosses a certain threshold. The window size and threshold value can be
adjusted in the server configuration.

Closes #13017
Closes #12920


Note for reviewers: since we can't yet reliably reproduce this bug, the way I tested this was by applying this patch that causes a plan to be rejected if it is evaluated by a server running the env var CRASH set it's for a client with a name that starts with crash.

So, after applying the patch, start a 3 server cluster with one of them having the CRASH env var set and make sure this server becomes the leader. Start a client with the name starting with crash and run a job.

Monitoring the log you should see the plan rejection messages and, after a few minutes, the client will become ineligible. You can then start a client without crash in the name to verify that the job scheduling will proceed to the new client.

Comment on lines +560 to +565
NodeThreshold int `hcl:"node_threshold"`

// NodeWindow is the time window used to track active plan rejections for
// nodes.
NodeWindow time.Duration
NodeWindowHCL string `hcl:"node_window" json:"-"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about these names, but I wanted to hedge against us potentially needing to track plan rejections for other things, like evals or jobs. Feel free to modify them.

Comment on lines 1032 to 1035
PlanRejectionTracker: &PlanRejectionTracker{
NodeThreshold: 15,
NodeWindow: 10 * time.Minute,
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure about these defaults. The workers will retry the plan with exponential back-off, so depending on the cluster load and eval queue size it may take a while for the node to reach the threshold.

So I picked these values based on a threshold number that is likely to indicate a problem (it could maybe even be a little higher?), not just normal plan rejections, and a time window that encompasses a significant part of the rejection history but that will probably have expired by the time operators detected, drained, and restarted the client.

A short time window may risk the threshold never being hit and a large one risks a normal plan rejection triggering the ineligibility since the node score can still be high.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Let's try to dig up some real world nomad.nomad.plan.node_rejected metrics to guide this. Better to be a bit conservative.

If we moved this into SchedulerConfig it would ease online updating and we could be very conservative with the defaults. Not sure how aggressively we should jam things into there, but this is at least a scheduler-related parameter.

Copy link
Member

@schmichael schmichael Jun 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our out of band discussion: let's leave this in the config file. If you're fiddling with it live, you can just mark bad nodes ineligible with existing commands and fiddle with these settings after your cluster is healthy.

Hopefully we can find good enough defaults (not to mention root cause fixes!) that no one needs to worry about this anyway.

@lgfa29 lgfa29 requested a review from schmichael June 18, 2022 06:10
@lgfa29 lgfa29 marked this pull request as ready for review June 18, 2022 06:10
@lgfa29 lgfa29 self-assigned this Jun 18, 2022
lgfa29 added 4 commits June 18, 2022 03:01
Plan rejections occur when the scheduler work and the leader plan
applier disagree on the feasibility of a plan. This may happen for valid
reasons: since Nomad does parallel scheduling, it is expected that
different workers will have a different state when computing placements.

As the final plan reaches the leader plan applier, it may no longer be
valid due to a concurrent scheduling taking up intended resources. In
these situations the plan applier will notify the worker that the plan
was rejected and that they should refresh their state before trying
again.

In some rare and unexpected circumstances it has been observed that
workers will repeatedly submit the same plan, even if they are always
rejected.

While the root cause is still unknown this mitigation has been put in
place. The plan applier will now track the history of plan rejections
per client and include in the plan result a list of node IDs that should
be set as ineligible if the number of rejections in a given time window
crosses a certain threshold. The window size and threshold value can be
adjusted in the server configuration.
@@ -11403,8 +11419,9 @@ type PlanResult struct {

// IsNoOp checks if this plan result would do nothing
func (p *PlanResult) IsNoOp() bool {
return len(p.NodeUpdate) == 0 && len(p.NodeAllocation) == 0 &&
len(p.DeploymentUpdates) == 0 && p.Deployment == nil
return len(p.IneligibleNodes) == 0 && len(p.NodeUpdate) == 0 &&
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new check will cause the no-op fast-path to be skipped in cases there it previously wouldn't, so it may be worth double checking if nothing will break.

@lgfa29 lgfa29 added the backport/1.3.x backport to 1.3.x release line label Jun 18, 2022
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all but the core algorithm. Will get that done ASAP.

command/agent/config_parse.go Outdated Show resolved Hide resolved
website/content/docs/configuration/server.mdx Outdated Show resolved Hide resolved
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great @lgfa29! I've left a few suggestions that could tighten it up.

website/content/docs/configuration/server.mdx Outdated Show resolved Hide resolved
@@ -149,10 +149,29 @@ While it is possible for these log lines to occur infrequently due to normal
cluster conditions, they should not appear repeatedly and prevent the job from
eventually running (look up the evaluation ID logged to find the job).

If this log *does* appear repeatedly with the same `node_id` referenced, try
Nomad tracks the history of plan rejections per client and will mark it as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't do this by default, so we should probably note that you can specifically enable this feature here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, the enabled = false was added later and I forgot to update the docs. Thanks!

Comment on lines 20 to 28
// NoopBadNodeTracker is a no-op implementation of bad node tracker that is
// used when tracking is disabled.
type NoopBadNodeTracker struct{}

func (n *NoopBadNodeTracker) Add(string) {}
func (n *NoopBadNodeTracker) EmitStats(time.Duration, <-chan struct{}) {}
func (n *NoopBadNodeTracker) IsBad(string) bool {
return false
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love that we have a "can't possibly break anything" version of the struct for folks who haven't opted to enable it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the first implementation I was just setting the rate limiter to zero, but there was still quite a bit of code running in a critical path, so I chose this no-op approach instead. Can't have bugs if there is no code to run 😄

// Limit the number of nodes we report as bad to avoid mass assigning nodes
// as ineligible, but do it after Get to keep the cache entry fresh.
if !c.limiter.Allow() {
logger.Info("returning false due to rate limiting")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: info is kind of a weird log level because it's not clear how actionable it is. I try to reserve it for the top-level HTTP API and the agent startup. This strikes me as something that should be a warning.

I also wonder if we want to put the rate limit here or after we score? Checking the score seems ok to do (it's all cheap and in-memory and doesn't mutate state).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point. I think I will actually downgrade to Debug along with the others, there really isn't much value knowing this as it should be a temporary state. With nomad monitor is easy to consume the leader log in a lower log level if needed.

And nice catch with moving the rate limit. Calling Allow here will consume a token even if the score is below the threshold, which ultimately end up waste it.

// the time window.
func (c *CachedBadNodeTracker) IsBad(nodeID string) bool {
logger := c.logger.With("node_id", nodeID)
logger.Debug("checking if node is bad")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a debug log line for every branch of this method seems really noisy, and on a busy cluster they'll get interleaved with other logs. Maybe one log line at the caller would reduce the noise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I reduce the number of log messages and dropped their level to Trace since this should just be doing its job 😅

If something was supposed to be happening but it's not we can monitor the leader.

Comment on lines 99 to 100
logger.Debug("node not in cache")
return false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we should never hit this because calling Add first is an invariant of the design. We could return an error here instead, or even better change the construction of Add and IsBad so that we both add and check in the same call in plan_apply.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense. I modified the interface so it only exposes Add and EmitMetrics. isBad now takes the object in the cache that Add created or retrieved, so there's no need for this check anymore 👍

Comment on lines 198 to 203
// record adds a new entry to the stats history and returns the new score.
func (s *badNodeStats) record() int {
now := time.Now()
s.history = append(s.history, now)
return s.score()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On large clusters I'd expect the LRU would age-out most of the tracked nodes eventually. But I think there's a pathological case here where a stable cluster with 50 or fewer nodes will continue to grow the size of the badNodeStats.history forever, so long as none of the nodes get plan rejections more frequently than the window allows.

It's a small bit of data (24 bytes each rejection), but we could eliminate it entirely if we dropped the old entries either when we called record() or when we called score().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I made it so score will remove entries that happened before the window start, so they will never be considered any more.

Comment on lines 133 to 155
func TestBadNodeStats_score(t *testing.T) {
ci.Parallel(t)

window := time.Duration(testutil.TestMultiplier()) * time.Second
stats := newBadNodeStats(window)

require.Equal(t, 0, stats.score())

stats.record()
stats.record()
stats.record()
require.Equal(t, 3, stats.score())

time.Sleep(window / 2)
stats.record()
require.Equal(t, 4, stats.score())

time.Sleep(window / 2)
require.Equal(t, 1, stats.score())

time.Sleep(window / 2)
require.Equal(t, 0, stats.score())
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that can make tests faster and more reliable is to have the score() and record() methods take a timestamp as their argument. The caller in Add/IsBad can pass it time.Now() so it works the same way, but it's testable by constructing timestamps in the test without needing to sleep.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I updated what I could, but I didn't find a way to do something like this with the rate limiter, so it still sleeps for 1s.

lgfa29 added 3 commits July 8, 2022 14:39
Simplify the interface for `BadNodeTracker` by merging the methods `Add`
and `IsBad` since they are always called in tandem and reduce the number
and level of log messages generated. Also cleanup expired records to
avoid inifinite growth the cache entry never expires.

Take explicit timestamp to make tests faster and more reliable.
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

// tracker.
type PlanRejectionTracker struct {
// Enabled controls if the plan rejection tracker is active or not.
Enabled bool `hcl:"enabled"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we ever want to default this to true we'll have to change the type to a pointer (*bool) to differentiate unset vs explicitly-false. That shouldn't be a problem though since unlike objects sent over RPCs we never have to worry about an agent on the new version sending a nil to an agent still expecting a true or false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I may as well do the work now 😄
f5936f0

@@ -359,6 +370,22 @@ func (s *StateStore) UpsertPlanResults(msgType structs.MessageType, index uint64
txn := s.db.WriteTxnMsgT(msgType, index)
defer txn.Abort()

// Mark nodes as ineligible.
now := time.Now().Unix()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since UpsertPlanResults is called by the FSM we cannot check wallclock time here as it will differ on each server creating skew in their statestores. Since it's only informational there shouldn't be a negative functional impact, but it's still preferable to keep our FSMs deterministic on every server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fb2e761

lgfa29 added 2 commits July 12, 2022 15:33
Set the timestamp for a plan apply operation at request time to avoid
non-deterministic operations in the FSM.
Using a pointer allow us to differentiate between a non-set value and an
explicit `false` if we decide to use `true` by default.
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backport/1.3.x backport to 1.3.x release line
Projects
None yet
3 participants