Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: fix reconciliation of reconnecting allocs #16609

Merged
merged 7 commits into from
Mar 24, 2023

Conversation

lgfa29
Copy link
Contributor

@lgfa29 lgfa29 commented Mar 22, 2023

When a disconnect client reconnects the allocReconciler must find the
allocations that were created to replace the original disconnected
allocations.

This process was being done in only a subset of non-terminal untainted
allocations, meaning that, if the replacement allocations were not in
this state the reconciler didn't stop them, leaving the job in an
inconsistent state.

This inconsistency is only solved in a future job evaluation, but at
that point the allocation is considered reconnected and so the specific
reconnection logic was not applied, leading to unexpected outcomes.

This commit fixes the problem by running reconnecting allocation
reconciliation logic earlier into the process, leaving the rest of the
reconciler oblivious of reconnecting allocations.

It also uses the full set of allocations to search for replacements,
stopping them even if they are not in the untainted set.

The system SystemScheduler is not affected by this bug because
disconnected clients don't trigger replacements: every eligible client
is already running an allocation.

Closes #15483

When a disconnect client reconnects the `allocReconciler` must find the
allocations that were created to replace the original disconnected
allocations.

This process was being done in only a subset of non-terminal untainted
allocations, meaning that, if the replacement allocations were not in
this state the reconciler didn't stop them, leaving the job in an
inconsistent state.

This inconsistency is only solved in a future job evaluation, but at
that point the allocation is considered reconnected and so the specific
reconnection logic was not applied, leading to unexpected outcomes.

This commit fixes the problem by running reconnecting allocation
reconciliation logic earlier into the process, leaving the rest of the
reconciler oblivious of reconnecting allocations.

It also uses the full set of allocations to search for replacements,
stopping them even if they are not in the `untainted` set.

The system `SystemScheduler` is not affected by this bug because
disconnected clients don't trigger replacements: every eligible client
is already running an allocation.
@lgfa29 lgfa29 changed the title scheduler: reconnect alloc with an failed replacement scheduler: fix reconciliation of reconnecting allocs Mar 22, 2023
@lgfa29 lgfa29 added backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line labels Mar 22, 2023
@lgfa29 lgfa29 requested review from schmichael, jrasell and tgross March 22, 2023 19:31
@lgfa29 lgfa29 added this to the 1.5.x milestone Mar 22, 2023
@lgfa29 lgfa29 added the hcc/cst Admin - internal label Mar 22, 2023
@lgfa29 lgfa29 marked this pull request as ready for review March 22, 2023 19:32
desiredChanges.Stop += uint64(len(stop))
untainted = untainted.difference(stop)

// Validate and add reconnecting allocs to the plan so that they will be logged.
a.computeReconnecting(reconnecting)
desiredChanges.Ignore += uint64(len(a.result.reconnectUpdates))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tracked down exactly why, but pulling this line to the reconnecting if block above causes a miscount of Ignore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've spent an hour looking at it and can't figure it out either. You mentioned there was a bug around that remove variable decrementing, maybe that was resulting in a change to the ignore set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see it now. computeUpdates find allocs to ignore from the untainted set. Before the reconnecting allocs were not included there, but now they are (if not stopped), so if we increment Ignore during the reconnect reconciliation we will double-count the reconnecting allocs.

Comment on lines -1079 to -1083
remove--
// if we've removed all we need to, stop iterating and return.
if remove == 0 {
return remove
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may have been a bug as well where remove was miscalculated so we were only stopping a fixed number of replacements, not all of them. But I didn't investigate further since we don't do this anymore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call: removing this logic lets computeStop do its job to reduce the total count if we need to, without any spooky action at a distance.

Comment on lines +1188 to +1190
replacementHasBetterScore := originalMaxScoreMeta == nil && replacementMaxScoreMeta != nil ||
(originalMaxScoreMeta != nil && replacementMaxScoreMeta != nil &&
replacementMaxScoreMeta.NormScore > originalMaxScoreMeta.NormScore)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing implementation skips allocations without metrics, which I think may also be a source of inconsistent state. I don't know when MaxNormScore could be nil, but I turned those errors into a "best effort" check by preferring the original alloc unless enough metrics are available.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because these allocations are always existing allocations that have been thru the whole scheduler once already, I suspect NormScore is only ever nil in test scenarios and that was put in to catch panics because of bad test setup. I don't think it makes much of a difference and this code seems a little nicer. 👍

Comment on lines +1192 to +1196
// Check if the replacement has better client status.
// Even with a better placement score make sure we don't replace a running
// allocation with one that is not.
replacementIsRunning := replacement.ClientStatus == structs.AllocClientStatusRunning
originalNotRunning := original.ClientStatus != structs.AllocClientStatusRunning
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new rule I added to prevent a replacement if the original alloc is still running but the replacement is not, even if it has better placement metric.

I first added it as an attempt to fix the original bug but it turns out the problem was something else. So I don't think this is strictly necessary, but it kind makes sense to not replace a running alloc with one that is not not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's 3 sets of client states to worry about here: pending (for the replacement only), running, and terminal. We also have three checks in the caller that influence whether we ever even call this function: the reconnecting.filterByFailedReconnect() check (line 1082R), the stopReconnecting check (line 1090R) and the replacementAlloc.ServerTerminalStatus() check (line 1131R). That gives us the following table:

original client status replacement client status replacement server status what to do? done in this PR?
running running run keep highest score, stop the other ✔️
running running stop keep original, stop replacement ✔️
running pending run keep highest score, stop the other ✔️
running pending stop keep original, stop replacement ✔️
running terminal run keep original, stop replacement ✔️
running terminal stop keep original, stop replacement ✔️
terminal running run stop original, reconcile replacement ❌ (replacement not checked here?)
terminal running stop stop original, reconcile replacement ✔️
terminal pending run stop original, reconcile replacement ❌ (replacement not checked here?)
terminal pending stop stop original, reconcile replacement ✔️
terminal terminal run stop original, reconcile replacement ✔️
terminal terminal stop stop original, reconcile replacement ✔️

So it looks like the logic is correct except for maybe the case where the original is filtered by filterByFailedReconnect and the replacement is non-terminal. I'm not sure what happens to the replacement in that case -- does it just fall thru to the rest of the reconciler? In any case, I had to build a big ol' truth table to verify that for myself because the logic is split between this method and its caller. It's kinda hard to follow. Maybe we should move all these client status checks into the caller, so that we only ever call this method if we have two legitimate options?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to build a big ol' truth table to verify that for myself

Yup, I had to do the same thing, which should've been a signal that this logic is confusing 😅

But I looked into it again, and I'm not sure if we can move this ClientStatus check to the caller because it's an awkward conditional tie-breaker.

The way I approached this was using a process of elimination. There are three states we can safely stop the reconnecting allocation without needing to compare it with its replacements:

  • desired status not run (so stop or evict): handled by stopReconnecting, but now I'm wondering if we even need to stop in this case as opposed to just skip these allocs 🤷
  • client status failed: handled by filterByFailedReconnect, but we probably don't need this anymore and just handle it in the main loop.

That leaves 5 possible client status for original alloc:

  • lost and unknown: I don't think these are possible for a reconnecting alloc? I think lost is only used for groups without max_client_disconnect (which is by definition not the case for allocation that are reconnecting) and the allocation would not be in the reconnecting set if it was unknown, so we can probably ignore these.
  • complete: probably a no-op? Or maybe we still need to check and stop replacements? But only if not batch? Not sure, this seems a bit undefined 😅
  • pending and running: we need to compare with its replacements. If the replacement is newer we keep it, done. If the replacement has better score we need to check the client status.

And this is where the awkwardness lives. Outside this function we can't make a decision before comparing them and we don't know if the replacement was picked because it had a higher version or better placement (well, we can check again, but that would be kind of redundant).

I'm not sure what happens to the replacement in that case -- does it just fall thru to the rest of the reconciler?

That's right. My assumption here is the idea of merging timelines. When the client reconnects we assume the replacement allocations were being handled by the reconciler like any other, so if we don't need to stop them just let the reconciler keep managing it as it has been.

Copy link
Member

@tgross tgross Mar 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

client status failed: handled by filterByFailedReconnect, but we probably don't need this anymore and just handle it in the main loop.

The filterByFailedReconnect should pick up the complete case as well because that's terminal.

pending and running: we need to compare with its replacements. If the replacement is newer we keep it, done. If the replacement has better score we need to check the client status.

I probably could have explained it better, but I think we only need to compare if the replacement is running. So in other words, the check we're doing here could be before we compare scores and return original early. (And at that point, it could probably be hoisted up to the caller instead.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filterByFailedReconnect should pick up the complete case as well because that's terminal.

It only looks for not server terminal and failed explicitly, so I think complete is not handled.

if !alloc.ServerTerminalStatus() && alloc.ClientStatus == structs.AllocClientStatusFailed {

I think we only need to compare if the replacement is running

...and has the same job version.

A new job version always wins because we don't want to keep outdated allocs, even if they are the original ones. So the client status check can be done before the score, but it can't be done before the job version check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch. Yeah it seems like there's no way to hoist this up.

@@ -5490,14 +5535,15 @@ func TestReconciler_Disconnected_Client(t *testing.T) {
},
},
{
name: "stop-original-alloc-with-old-job-version-and-failed-replacements",
name: "stop-original-alloc-with-old-job-version-and-failed-replacements-replaced",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this test was redundant because, in the end, the replacement of the replacements were running, but it exercises the path where a reconnecting alloc has multiple replacements.

nextReplacement.PreviousAllocation = replacement.ID

replacement.NextAllocation = nextReplacement.ID
replacement.DesiredStatus = structs.AllocDesiredStatusStop
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was missing but I'm pretty sure we stop the previous allocation once it has been replaced 🤔


// Find replacement allocations and decide which one to stop. A
// reconnecting allocation may have multiple replacements.
for _, replacementAlloc := range others {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I also need to check that replacementAlloc.CreateIndex > reconnectingAlloc.CreateIndex otherwise in a scenario where both the original and one of its replacements are reconnecting, and they tie in all selection criteria, I think one alloc will stop the other because they will both think they are the original.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! Fixed in 0b97e34.

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great overall. I've left a large and rambling comment about legibility of the status checks we're that you might want to consider.

Comment on lines +1192 to +1196
// Check if the replacement has better client status.
// Even with a better placement score make sure we don't replace a running
// allocation with one that is not.
replacementIsRunning := replacement.ClientStatus == structs.AllocClientStatusRunning
originalNotRunning := original.ClientStatus != structs.AllocClientStatusRunning
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's 3 sets of client states to worry about here: pending (for the replacement only), running, and terminal. We also have three checks in the caller that influence whether we ever even call this function: the reconnecting.filterByFailedReconnect() check (line 1082R), the stopReconnecting check (line 1090R) and the replacementAlloc.ServerTerminalStatus() check (line 1131R). That gives us the following table:

original client status replacement client status replacement server status what to do? done in this PR?
running running run keep highest score, stop the other ✔️
running running stop keep original, stop replacement ✔️
running pending run keep highest score, stop the other ✔️
running pending stop keep original, stop replacement ✔️
running terminal run keep original, stop replacement ✔️
running terminal stop keep original, stop replacement ✔️
terminal running run stop original, reconcile replacement ❌ (replacement not checked here?)
terminal running stop stop original, reconcile replacement ✔️
terminal pending run stop original, reconcile replacement ❌ (replacement not checked here?)
terminal pending stop stop original, reconcile replacement ✔️
terminal terminal run stop original, reconcile replacement ✔️
terminal terminal stop stop original, reconcile replacement ✔️

So it looks like the logic is correct except for maybe the case where the original is filtered by filterByFailedReconnect and the replacement is non-terminal. I'm not sure what happens to the replacement in that case -- does it just fall thru to the rest of the reconciler? In any case, I had to build a big ol' truth table to verify that for myself because the logic is split between this method and its caller. It's kinda hard to follow. Maybe we should move all these client status checks into the caller, so that we only ever call this method if we have two legitimate options?

Comment on lines +1188 to +1190
replacementHasBetterScore := originalMaxScoreMeta == nil && replacementMaxScoreMeta != nil ||
(originalMaxScoreMeta != nil && replacementMaxScoreMeta != nil &&
replacementMaxScoreMeta.NormScore > originalMaxScoreMeta.NormScore)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because these allocations are always existing allocations that have been thru the whole scheduler once already, I suspect NormScore is only ever nil in test scenarios and that was put in to catch panics because of bad test setup. I don't think it makes much of a difference and this code seems a little nicer. 👍

Comment on lines -1079 to -1083
remove--
// if we've removed all we need to, stop iterating and return.
if remove == 0 {
return remove
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call: removing this logic lets computeStop do its job to reduce the total count if we need to, without any spooky action at a distance.

desiredChanges.Stop += uint64(len(stop))
untainted = untainted.difference(stop)

// Validate and add reconnecting allocs to the plan so that they will be logged.
a.computeReconnecting(reconnecting)
desiredChanges.Ignore += uint64(len(a.result.reconnectUpdates))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've spent an hour looking at it and can't figure it out either. You mentioned there was a bug around that remove variable decrementing, maybe that was resulting in a change to the ignore set?

If the replacement for a reconnecting allocation is also reconnecting we
need to make sure we only compare the original with the replacement, and
not the other way around, otherwise the replacement may stop the
original if they tie in the selection criteria.
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice work @lgfa29!

Since we are now already iterating over the reconnecting allocations in
a specific method having a separate loop to find failed allocations is
unnecessary.
@lgfa29
Copy link
Contributor Author

lgfa29 commented Mar 24, 2023

Doing some more edge case testing I thought I broke canary deployments, but it turned out to be an existing bug, so I filed #16644.

I also removed the filterByFailedReconnect in 129dda0 to keep the allocation selection logic closer together as discussed in #16609 (comment).

As for allocations that are complete on reconnect the pickReconnectingAlloc logic is applied, which I think is the right thing to do. If you have a batch job with max_client_disconnect you should expect it to run multiple times due to replacement allocations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line hcc/cst Admin - internal
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New behavior when max_client_disconnect is used and the worker node moves from disconnected to ready
2 participants