-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduler: fix reconciliation of reconnecting allocs #16609
Conversation
When a disconnect client reconnects the `allocReconciler` must find the allocations that were created to replace the original disconnected allocations. This process was being done in only a subset of non-terminal untainted allocations, meaning that, if the replacement allocations were not in this state the reconciler didn't stop them, leaving the job in an inconsistent state. This inconsistency is only solved in a future job evaluation, but at that point the allocation is considered reconnected and so the specific reconnection logic was not applied, leading to unexpected outcomes. This commit fixes the problem by running reconnecting allocation reconciliation logic earlier into the process, leaving the rest of the reconciler oblivious of reconnecting allocations. It also uses the full set of allocations to search for replacements, stopping them even if they are not in the `untainted` set. The system `SystemScheduler` is not affected by this bug because disconnected clients don't trigger replacements: every eligible client is already running an allocation.
desiredChanges.Stop += uint64(len(stop)) | ||
untainted = untainted.difference(stop) | ||
|
||
// Validate and add reconnecting allocs to the plan so that they will be logged. | ||
a.computeReconnecting(reconnecting) | ||
desiredChanges.Ignore += uint64(len(a.result.reconnectUpdates)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tracked down exactly why, but pulling this line to the reconnecting if
block above causes a miscount of Ignore
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've spent an hour looking at it and can't figure it out either. You mentioned there was a bug around that remove
variable decrementing, maybe that was resulting in a change to the ignore set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see it now. computeUpdates
find allocs to ignore from the untainted
set. Before the reconnecting allocs were not included there, but now they are (if not stopped), so if we increment Ignore
during the reconnect reconciliation we will double-count the reconnecting allocs.
remove-- | ||
// if we've removed all we need to, stop iterating and return. | ||
if remove == 0 { | ||
return remove | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may have been a bug as well where remove
was miscalculated so we were only stopping a fixed number of replacements, not all of them. But I didn't investigate further since we don't do this anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call: removing this logic lets computeStop
do its job to reduce the total count if we need to, without any spooky action at a distance.
replacementHasBetterScore := originalMaxScoreMeta == nil && replacementMaxScoreMeta != nil || | ||
(originalMaxScoreMeta != nil && replacementMaxScoreMeta != nil && | ||
replacementMaxScoreMeta.NormScore > originalMaxScoreMeta.NormScore) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing implementation skips allocations without metrics, which I think may also be a source of inconsistent state. I don't know when MaxNormScore
could be nil
, but I turned those errors into a "best effort" check by preferring the original alloc unless enough metrics are available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because these allocations are always existing allocations that have been thru the whole scheduler once already, I suspect NormScore
is only ever nil
in test scenarios and that was put in to catch panics because of bad test setup. I don't think it makes much of a difference and this code seems a little nicer. 👍
// Check if the replacement has better client status. | ||
// Even with a better placement score make sure we don't replace a running | ||
// allocation with one that is not. | ||
replacementIsRunning := replacement.ClientStatus == structs.AllocClientStatusRunning | ||
originalNotRunning := original.ClientStatus != structs.AllocClientStatusRunning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new rule I added to prevent a replacement if the original alloc is still running but the replacement is not, even if it has better placement metric.
I first added it as an attempt to fix the original bug but it turns out the problem was something else. So I don't think this is strictly necessary, but it kind makes sense to not replace a running alloc with one that is not not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's 3 sets of client states to worry about here: pending
(for the replacement only), running
, and terminal. We also have three checks in the caller that influence whether we ever even call this function: the reconnecting.filterByFailedReconnect()
check (line 1082R), the stopReconnecting
check (line 1090R) and the replacementAlloc.ServerTerminalStatus()
check (line 1131R). That gives us the following table:
original client status | replacement client status | replacement server status | what to do? | done in this PR? |
---|---|---|---|---|
running | running | run | keep highest score, stop the other | ✔️ |
running | running | stop | keep original, stop replacement | ✔️ |
running | pending | run | keep highest score, stop the other | ✔️ |
running | pending | stop | keep original, stop replacement | ✔️ |
running | terminal | run | keep original, stop replacement | ✔️ |
running | terminal | stop | keep original, stop replacement | ✔️ |
terminal | running | run | stop original, reconcile replacement | ❌ (replacement not checked here?) |
terminal | running | stop | stop original, reconcile replacement | ✔️ |
terminal | pending | run | stop original, reconcile replacement | ❌ (replacement not checked here?) |
terminal | pending | stop | stop original, reconcile replacement | ✔️ |
terminal | terminal | run | stop original, reconcile replacement | ✔️ |
terminal | terminal | stop | stop original, reconcile replacement | ✔️ |
So it looks like the logic is correct except for maybe the case where the original is filtered by filterByFailedReconnect
and the replacement is non-terminal. I'm not sure what happens to the replacement in that case -- does it just fall thru to the rest of the reconciler? In any case, I had to build a big ol' truth table to verify that for myself because the logic is split between this method and its caller. It's kinda hard to follow. Maybe we should move all these client status checks into the caller, so that we only ever call this method if we have two legitimate options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to build a big ol' truth table to verify that for myself
Yup, I had to do the same thing, which should've been a signal that this logic is confusing 😅
But I looked into it again, and I'm not sure if we can move this ClientStatus
check to the caller because it's an awkward conditional tie-breaker.
The way I approached this was using a process of elimination. There are three states we can safely stop the reconnecting allocation without needing to compare it with its replacements:
- desired status not
run
(sostop
orevict
): handled bystopReconnecting
, but now I'm wondering if we even need to stop in this case as opposed to just skip these allocs 🤷 - client status
failed
: handled byfilterByFailedReconnect
, but we probably don't need this anymore and just handle it in the main loop.
That leaves 5 possible client status for original alloc:
lost
andunknown
: I don't think these are possible for a reconnecting alloc? I thinklost
is only used for groups withoutmax_client_disconnect
(which is by definition not the case for allocation that are reconnecting) and the allocation would not be in thereconnecting
set if it wasunknown
, so we can probably ignore these.complete
: probably a no-op? Or maybe we still need to check and stop replacements? But only if not batch? Not sure, this seems a bit undefined 😅pending
andrunning
: we need to compare with its replacements. If the replacement is newer we keep it, done. If the replacement has better score we need to check the client status.
And this is where the awkwardness lives. Outside this function we can't make a decision before comparing them and we don't know if the replacement was picked because it had a higher version or better placement (well, we can check again, but that would be kind of redundant).
I'm not sure what happens to the replacement in that case -- does it just fall thru to the rest of the reconciler?
That's right. My assumption here is the idea of merging timelines. When the client reconnects we assume the replacement allocations were being handled by the reconciler like any other, so if we don't need to stop them just let the reconciler keep managing it as it has been.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
client status
failed
: handled byfilterByFailedReconnect
, but we probably don't need this anymore and just handle it in the main loop.
The filterByFailedReconnect
should pick up the complete
case as well because that's terminal.
pending
andrunning
: we need to compare with its replacements. If the replacement is newer we keep it, done. If the replacement has better score we need to check the client status.
I probably could have explained it better, but I think we only need to compare if the replacement is running. So in other words, the check we're doing here could be before we compare scores and return original
early. (And at that point, it could probably be hoisted up to the caller instead.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
filterByFailedReconnect
should pick up thecomplete
case as well because that's terminal.
It only looks for not server terminal and failed
explicitly, so I think complete
is not handled.
nomad/scheduler/reconcile_util.go
Line 529 in 9a2fdb5
if !alloc.ServerTerminalStatus() && alloc.ClientStatus == structs.AllocClientStatusFailed { |
I think we only need to compare if the replacement is running
...and has the same job version.
A new job version always wins because we don't want to keep outdated allocs, even if they are the original ones. So the client status check can be done before the score, but it can't be done before the job version check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch. Yeah it seems like there's no way to hoist this up.
@@ -5490,14 +5535,15 @@ func TestReconciler_Disconnected_Client(t *testing.T) { | |||
}, | |||
}, | |||
{ | |||
name: "stop-original-alloc-with-old-job-version-and-failed-replacements", | |||
name: "stop-original-alloc-with-old-job-version-and-failed-replacements-replaced", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this test was redundant because, in the end, the replacement of the replacements were running, but it exercises the path where a reconnecting alloc has multiple replacements.
nextReplacement.PreviousAllocation = replacement.ID | ||
|
||
replacement.NextAllocation = nextReplacement.ID | ||
replacement.DesiredStatus = structs.AllocDesiredStatusStop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was missing but I'm pretty sure we stop the previous allocation once it has been replaced 🤔
|
||
// Find replacement allocations and decide which one to stop. A | ||
// reconnecting allocation may have multiple replacements. | ||
for _, replacementAlloc := range others { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I also need to check that replacementAlloc.CreateIndex > reconnectingAlloc.CreateIndex
otherwise in a scenario where both the original and one of its replacements are reconnecting, and they tie in all selection criteria, I think one alloc will stop the other because they will both think they are the original.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup! Fixed in 0b97e34.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great overall. I've left a large and rambling comment about legibility of the status checks we're that you might want to consider.
// Check if the replacement has better client status. | ||
// Even with a better placement score make sure we don't replace a running | ||
// allocation with one that is not. | ||
replacementIsRunning := replacement.ClientStatus == structs.AllocClientStatusRunning | ||
originalNotRunning := original.ClientStatus != structs.AllocClientStatusRunning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's 3 sets of client states to worry about here: pending
(for the replacement only), running
, and terminal. We also have three checks in the caller that influence whether we ever even call this function: the reconnecting.filterByFailedReconnect()
check (line 1082R), the stopReconnecting
check (line 1090R) and the replacementAlloc.ServerTerminalStatus()
check (line 1131R). That gives us the following table:
original client status | replacement client status | replacement server status | what to do? | done in this PR? |
---|---|---|---|---|
running | running | run | keep highest score, stop the other | ✔️ |
running | running | stop | keep original, stop replacement | ✔️ |
running | pending | run | keep highest score, stop the other | ✔️ |
running | pending | stop | keep original, stop replacement | ✔️ |
running | terminal | run | keep original, stop replacement | ✔️ |
running | terminal | stop | keep original, stop replacement | ✔️ |
terminal | running | run | stop original, reconcile replacement | ❌ (replacement not checked here?) |
terminal | running | stop | stop original, reconcile replacement | ✔️ |
terminal | pending | run | stop original, reconcile replacement | ❌ (replacement not checked here?) |
terminal | pending | stop | stop original, reconcile replacement | ✔️ |
terminal | terminal | run | stop original, reconcile replacement | ✔️ |
terminal | terminal | stop | stop original, reconcile replacement | ✔️ |
So it looks like the logic is correct except for maybe the case where the original is filtered by filterByFailedReconnect
and the replacement is non-terminal. I'm not sure what happens to the replacement in that case -- does it just fall thru to the rest of the reconciler? In any case, I had to build a big ol' truth table to verify that for myself because the logic is split between this method and its caller. It's kinda hard to follow. Maybe we should move all these client status checks into the caller, so that we only ever call this method if we have two legitimate options?
replacementHasBetterScore := originalMaxScoreMeta == nil && replacementMaxScoreMeta != nil || | ||
(originalMaxScoreMeta != nil && replacementMaxScoreMeta != nil && | ||
replacementMaxScoreMeta.NormScore > originalMaxScoreMeta.NormScore) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because these allocations are always existing allocations that have been thru the whole scheduler once already, I suspect NormScore
is only ever nil
in test scenarios and that was put in to catch panics because of bad test setup. I don't think it makes much of a difference and this code seems a little nicer. 👍
remove-- | ||
// if we've removed all we need to, stop iterating and return. | ||
if remove == 0 { | ||
return remove | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call: removing this logic lets computeStop
do its job to reduce the total count if we need to, without any spooky action at a distance.
desiredChanges.Stop += uint64(len(stop)) | ||
untainted = untainted.difference(stop) | ||
|
||
// Validate and add reconnecting allocs to the plan so that they will be logged. | ||
a.computeReconnecting(reconnecting) | ||
desiredChanges.Ignore += uint64(len(a.result.reconnectUpdates)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've spent an hour looking at it and can't figure it out either. You mentioned there was a bug around that remove
variable decrementing, maybe that was resulting in a change to the ignore set?
If the replacement for a reconnecting allocation is also reconnecting we need to make sure we only compare the original with the replacement, and not the other way around, otherwise the replacement may stop the original if they tie in the selection criteria.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice work @lgfa29!
Since we are now already iterating over the reconnecting allocations in a specific method having a separate loop to find failed allocations is unnecessary.
Doing some more edge case testing I thought I broke canary deployments, but it turned out to be an existing bug, so I filed #16644. I also removed the As for allocations that are |
When a disconnect client reconnects the
allocReconciler
must find theallocations that were created to replace the original disconnected
allocations.
This process was being done in only a subset of non-terminal untainted
allocations, meaning that, if the replacement allocations were not in
this state the reconciler didn't stop them, leaving the job in an
inconsistent state.
This inconsistency is only solved in a future job evaluation, but at
that point the allocation is considered reconnected and so the specific
reconnection logic was not applied, leading to unexpected outcomes.
This commit fixes the problem by running reconnecting allocation
reconciliation logic earlier into the process, leaving the rest of the
reconciler oblivious of reconnecting allocations.
It also uses the full set of allocations to search for replacements,
stopping them even if they are not in the
untainted
set.The system
SystemScheduler
is not affected by this bug becausedisconnected clients don't trigger replacements: every eligible client
is already running an allocation.
Closes #15483