-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track plan rejection history and automatically mark clients as ineligible #13421
Conversation
cf278ec
to
3bd41ea
Compare
3bd41ea
to
1456a23
Compare
1456a23
to
a1a8bb9
Compare
a1a8bb9
to
33c4b29
Compare
NodeThreshold int `hcl:"node_threshold"` | ||
|
||
// NodeWindow is the time window used to track active plan rejections for | ||
// nodes. | ||
NodeWindow time.Duration | ||
NodeWindowHCL string `hcl:"node_window" json:"-"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about these names, but I wanted to hedge against us potentially needing to track plan rejections for other things, like evals or jobs. Feel free to modify them.
command/agent/config.go
Outdated
PlanRejectionTracker: &PlanRejectionTracker{ | ||
NodeThreshold: 15, | ||
NodeWindow: 10 * time.Minute, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also not sure about these defaults. The workers will retry the plan with exponential back-off, so depending on the cluster load and eval queue size it may take a while for the node to reach the threshold.
So I picked these values based on a threshold number that is likely to indicate a problem (it could maybe even be a little higher?), not just normal plan rejections, and a time window that encompasses a significant part of the rejection history but that will probably have expired by the time operators detected, drained, and restarted the client.
A short time window may risk the threshold never being hit and a large one risks a normal plan rejection triggering the ineligibility since the node score can still be high.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Let's try to dig up some real world nomad.nomad.plan.node_rejected
metrics to guide this. Better to be a bit conservative.
If we moved this into SchedulerConfig
it would ease online updating and we could be very conservative with the defaults. Not sure how aggressively we should jam things into there, but this is at least a scheduler-related parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From our out of band discussion: let's leave this in the config file. If you're fiddling with it live, you can just mark bad nodes ineligible with existing commands and fiddle with these settings after your cluster is healthy.
Hopefully we can find good enough defaults (not to mention root cause fixes!) that no one needs to worry about this anyway.
Plan rejections occur when the scheduler work and the leader plan applier disagree on the feasibility of a plan. This may happen for valid reasons: since Nomad does parallel scheduling, it is expected that different workers will have a different state when computing placements. As the final plan reaches the leader plan applier, it may no longer be valid due to a concurrent scheduling taking up intended resources. In these situations the plan applier will notify the worker that the plan was rejected and that they should refresh their state before trying again. In some rare and unexpected circumstances it has been observed that workers will repeatedly submit the same plan, even if they are always rejected. While the root cause is still unknown this mitigation has been put in place. The plan applier will now track the history of plan rejections per client and include in the plan result a list of node IDs that should be set as ineligible if the number of rejections in a given time window crosses a certain threshold. The window size and threshold value can be adjusted in the server configuration.
33c4b29
to
2db5edc
Compare
@@ -11403,8 +11419,9 @@ type PlanResult struct { | |||
|
|||
// IsNoOp checks if this plan result would do nothing | |||
func (p *PlanResult) IsNoOp() bool { | |||
return len(p.NodeUpdate) == 0 && len(p.NodeAllocation) == 0 && | |||
len(p.DeploymentUpdates) == 0 && p.Deployment == nil | |||
return len(p.IneligibleNodes) == 0 && len(p.NodeUpdate) == 0 && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new check will cause the no-op fast-path to be skipped in cases there it previously wouldn't, so it may be worth double checking if nothing will break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all but the core algorithm. Will get that done ASAP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great @lgfa29! I've left a few suggestions that could tighten it up.
@@ -149,10 +149,29 @@ While it is possible for these log lines to occur infrequently due to normal | |||
cluster conditions, they should not appear repeatedly and prevent the job from | |||
eventually running (look up the evaluation ID logged to find the job). | |||
|
|||
If this log *does* appear repeatedly with the same `node_id` referenced, try | |||
Nomad tracks the history of plan rejections per client and will mark it as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't do this by default, so we should probably note that you can specifically enable this feature here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, the enabled = false
was added later and I forgot to update the docs. Thanks!
nomad/plan_apply_node_tracker.go
Outdated
// NoopBadNodeTracker is a no-op implementation of bad node tracker that is | ||
// used when tracking is disabled. | ||
type NoopBadNodeTracker struct{} | ||
|
||
func (n *NoopBadNodeTracker) Add(string) {} | ||
func (n *NoopBadNodeTracker) EmitStats(time.Duration, <-chan struct{}) {} | ||
func (n *NoopBadNodeTracker) IsBad(string) bool { | ||
return false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love that we have a "can't possibly break anything" version of the struct for folks who haven't opted to enable it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the first implementation I was just setting the rate limiter to zero, but there was still quite a bit of code running in a critical path, so I chose this no-op approach instead. Can't have bugs if there is no code to run 😄
nomad/plan_apply_node_tracker.go
Outdated
// Limit the number of nodes we report as bad to avoid mass assigning nodes | ||
// as ineligible, but do it after Get to keep the cache entry fresh. | ||
if !c.limiter.Allow() { | ||
logger.Info("returning false due to rate limiting") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: info is kind of a weird log level because it's not clear how actionable it is. I try to reserve it for the top-level HTTP API and the agent startup. This strikes me as something that should be a warning.
I also wonder if we want to put the rate limit here or after we score? Checking the score seems ok to do (it's all cheap and in-memory and doesn't mutate state).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point. I think I will actually downgrade to Debug
along with the others, there really isn't much value knowing this as it should be a temporary state. With nomad monitor
is easy to consume the leader log in a lower log level if needed.
And nice catch with moving the rate limit. Calling Allow
here will consume a token even if the score is below the threshold, which ultimately end up waste it.
nomad/plan_apply_node_tracker.go
Outdated
// the time window. | ||
func (c *CachedBadNodeTracker) IsBad(nodeID string) bool { | ||
logger := c.logger.With("node_id", nodeID) | ||
logger.Debug("checking if node is bad") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a debug log line for every branch of this method seems really noisy, and on a busy cluster they'll get interleaved with other logs. Maybe one log line at the caller would reduce the noise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I reduce the number of log messages and dropped their level to Trace
since this should just be doing its job 😅
If something was supposed to be happening but it's not we can monitor
the leader.
nomad/plan_apply_node_tracker.go
Outdated
logger.Debug("node not in cache") | ||
return false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we should never hit this because calling Add
first is an invariant of the design. We could return an error here instead, or even better change the construction of Add
and IsBad
so that we both add and check in the same call in plan_apply.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that makes sense. I modified the interface so it only exposes Add
and EmitMetrics
. isBad
now takes the object in the cache that Add
created or retrieved, so there's no need for this check anymore 👍
nomad/plan_apply_node_tracker.go
Outdated
// record adds a new entry to the stats history and returns the new score. | ||
func (s *badNodeStats) record() int { | ||
now := time.Now() | ||
s.history = append(s.history, now) | ||
return s.score() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On large clusters I'd expect the LRU would age-out most of the tracked nodes eventually. But I think there's a pathological case here where a stable cluster with 50 or fewer nodes will continue to grow the size of the badNodeStats.history
forever, so long as none of the nodes get plan rejections more frequently than the window allows.
It's a small bit of data (24 bytes each rejection), but we could eliminate it entirely if we dropped the old entries either when we called record()
or when we called score()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I made it so score
will remove entries that happened before the window start, so they will never be considered any more.
func TestBadNodeStats_score(t *testing.T) { | ||
ci.Parallel(t) | ||
|
||
window := time.Duration(testutil.TestMultiplier()) * time.Second | ||
stats := newBadNodeStats(window) | ||
|
||
require.Equal(t, 0, stats.score()) | ||
|
||
stats.record() | ||
stats.record() | ||
stats.record() | ||
require.Equal(t, 3, stats.score()) | ||
|
||
time.Sleep(window / 2) | ||
stats.record() | ||
require.Equal(t, 4, stats.score()) | ||
|
||
time.Sleep(window / 2) | ||
require.Equal(t, 1, stats.score()) | ||
|
||
time.Sleep(window / 2) | ||
require.Equal(t, 0, stats.score()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something that can make tests faster and more reliable is to have the score()
and record()
methods take a timestamp as their argument. The caller in Add/IsBad
can pass it time.Now()
so it works the same way, but it's testable by constructing timestamps in the test without needing to sleep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I updated what I could, but I didn't find a way to do something like this with the rate limiter, so it still sleeps for 1s.
Simplify the interface for `BadNodeTracker` by merging the methods `Add` and `IsBad` since they are always called in tandem and reduce the number and level of log messages generated. Also cleanup expired records to avoid inifinite growth the cache entry never expires. Take explicit timestamp to make tests faster and more reliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
command/agent/config.go
Outdated
// tracker. | ||
type PlanRejectionTracker struct { | ||
// Enabled controls if the plan rejection tracker is active or not. | ||
Enabled bool `hcl:"enabled"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we ever want to default this to true
we'll have to change the type to a pointer (*bool
) to differentiate unset vs explicitly-false. That shouldn't be a problem though since unlike objects sent over RPCs we never have to worry about an agent on the new version sending a nil
to an agent still expecting a true or false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I may as well do the work now 😄
f5936f0
nomad/state/state_store.go
Outdated
@@ -359,6 +370,22 @@ func (s *StateStore) UpsertPlanResults(msgType structs.MessageType, index uint64 | |||
txn := s.db.WriteTxnMsgT(msgType, index) | |||
defer txn.Abort() | |||
|
|||
// Mark nodes as ineligible. | |||
now := time.Now().Unix() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since UpsertPlanResults is called by the FSM we cannot check wallclock time here as it will differ on each server creating skew in their statestores. Since it's only informational there shouldn't be a negative functional impact, but it's still preferable to keep our FSMs deterministic on every server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in fb2e761
Set the timestamp for a plan apply operation at request time to avoid non-deterministic operations in the FSM.
Using a pointer allow us to differentiate between a non-set value and an explicit `false` if we decide to use `true` by default.
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Plan rejections occur when the scheduler work and the leader plan
applier disagree on the feasibility of a plan. This may happen for valid
reasons: since Nomad does parallel scheduling, it is expected that
different workers will have a different state when computing placements.
As the final plan reaches the leader plan applier, it may no longer be
valid due to a concurrent scheduling taking up intended resources. In
these situations the plan applier will notify the worker that the plan
was rejected and that they should refresh their state before trying
again.
In some rare and unexpected circumstances it has been observed that
workers will repeatedly submit the same plan, even if they are always
rejected.
While the root cause is still unknown this mitigation has been put in
place. The plan applier will now track the history of plan rejections
per client and include in the plan result a list of node IDs that should
be set as ineligible if the number of rejections in a given time window
crosses a certain threshold. The window size and threshold value can be
adjusted in the server configuration.
Closes #13017
Closes #12920
Note for reviewers: since we can't yet reliably reproduce this bug, the way I tested this was by applying this patch that causes a plan to be rejected if it is evaluated by a server running the env var
CRASH
set it's for a client with a name that starts withcrash
.So, after applying the patch, start a 3 server cluster with one of them having the
CRASH
env var set and make sure this server becomes the leader. Start a client with the name starting withcrash
and run a job.Monitoring the log you should see the plan rejection messages and, after a few minutes, the client will become ineligible. You can then start a client without
crash
in the name to verify that the job scheduling will proceed to the new client.