-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a new parameter to avoid starting a replacement for lost allocs #19101
Conversation
7dad070
to
3539b88
Compare
5808b66
to
3421d3e
Compare
ac32f36
to
b956853
Compare
c2a4e0e
to
fc3e205
Compare
2224db6
to
fc42602
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving some early comments here, I will need some extra brain juice to reason about the reconciler part 😄
nomad/structs/structs.go
Outdated
if a.Job != nil { | ||
tg := a.Job.LookupTaskGroup(a.TaskGroup) | ||
if tg != nil { | ||
return tg.AvoidRescheduleOnLost | ||
} | ||
} | ||
|
||
return false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you hit a panic
where a.Job
was nil
?
We denormalize the allocation in some parts of the scheduler and remove Job
from the allocation. I suspect this may be the case for isValidForLostNode()
because it acts on the plan, so we may be incorrectly returning false
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not seen it yet, it returns the default value, im not sure what to do if the alloc does not have the job
Co-authored-by: Tim Gross <[email protected]>
Co-authored-by: Luiz Aoqui <[email protected]>
Co-authored-by: Luiz Aoqui <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job getting this implemented, it's such a tricky part of the code 😅
I left a few docs suggestions and also some food for thought for potentially rethinking and refactoring this part of the code and the intended behaviours we want, but not something we should do here.
One thing to double-check is that a block like this will create replacements, even with attempts = 0
, but it seems like the intention is disallow the use of this new feature in this situation?
restart {
attempts = 0
unlimited = true
}
if tg.MaxClientDisconnect != nil && | ||
tg.ReschedulePolicy.Attempts > 0 && | ||
tg.PreventRescheduleOnLost { | ||
err := fmt.Errorf("max_client_disconnect and prevent_reschedule_on_lost cannot be enabled when rechedule.attempts > 0") | ||
mErr.Errors = append(mErr.Errors, err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure about this validation as they seem to mix two scenarios. For example, a group configured like this:
group "..." {
max_client_disconnect = "12h"
prevent_reschedule_on_lost = true
reschedule {
attempts = 10
}
}
As an user, I think my intention in writing this job would be something like this:
- Try to reschedule my allocations at least 10 times if they fail.
- If a node running an alloc for this group misses heartbeats, consider it
disconnected
instead ofdown
for 12h as it may come back up, but don't create a replacement allocation. After the 12h consider the allocation aslost
and create a replacement.
So in a scenario where this group has 3 allocs, if a node misses its hearbeats I would expect there to still be 3 allocs, 2 running
and 1 unknown
. After 12h, if the node doesn't reconnect, it will considered down
and the alloc lost
, and a replacement is created, resulting in 4 allocs: 3 running
and 1 lost
.
This is consistent with the documented behaviour of the reschedule
block:
Nomad will attempt to schedule the allocation on another node if any of its task statuses become
failed
.
Since the alloc is unknown
, none of the tasks should be in failed
state, so the reschedule
block doesn't apply. By forcing reschedule.attempts
to be 0 we're preventing users from being able to handle allocation failures.
In hindsight, I think the confusion started with max_client_disconnect
, where the docs mention:
Replacement allocations will be scheduled according to the allocations' reschedule policy until the disconnected client reconnects.
This conflates cluster state (a node misses heartbeat) with allocation state (a task fails), making it impossible to properly handle both cases as any combination of them is valid.
If we were to refactor this part of the code I think I would create a new block to disentangle these two scenarios. reschedule
is applied when allocations fail, and a new disconnected
(or a better name 😅) block to handle nodes missing heartbeats.
This new block could then have configurations like these:
-
Create a replacement for
unknown
allocs but try to reconnect the original alloc if the node comes back within 12h (equivalent to today'smax_client_disconnect
).disconnected { replace_unknown = true max_disconnect_time = "12h" }
-
Never create a replacement for
unkown
allocs (I think this is what we're trying to accomplish here withprevent_reschedule_on_lost
).disconnected { replace_unknown = false }
-
Wait 12h before creating a replacement for
unknown
allocs (I don't think this is currently possible to express, even with this PR, and I think it's a valid use case).disconnected { replace_unknown = false max_disconnect_time = "12h" }
Parallel to all of these configs, I could set different reschedule
policies to handle allocations that fail. Regardless of client status, I probably do want to reschedule them (or maybe I don't 😄).
This is a lot to change that would require proper planning and scoping, but until then I think the relationship between all of these flags can get confusing.
For this PR specifically, if the intention is to prevent the use of max_client_disconnect
and prevent_reschedule_on_lost
with a reschedule
policy that allows for new allocations to be created, then I think we also need to check the value of reschedule.unlimited
, as its often used with with reschedule.attempts = 0
but does create replacements.
To modify the allocation behaviour on the client, see | ||
[`stop_after_client_disconnect`](#stop_after_client_disconnect) . | ||
|
||
Setting `max_client_disconnect` and `prevent_reschedule_on_lost=true` at the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting `max_client_disconnect` and `prevent_reschedule_on_lost=true` at the same | |
Setting `max_client_disconnect` and `prevent_reschedule_on_lost = true` at the same |
If [`max_client_disconnect`](#max_client_disconnect) is set and | ||
`prevent_reschedule_on_lost=true`, allocations on disconnected nodes will be | ||
`unknown` until the `max_client_disconnect` window expires, at which point |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If [`max_client_disconnect`](#max_client_disconnect) is set and | |
`prevent_reschedule_on_lost=true`, allocations on disconnected nodes will be | |
`unknown` until the `max_client_disconnect` window expires, at which point | |
If [`max_client_disconnect`](#max_client_disconnect) is set and | |
`prevent_reschedule_on_lost = true`, allocations on disconnected nodes remains with status | |
`unknown` until the `max_client_disconnect` window expires, at which point |
the node will be transition from `disconnected` to `down`. The allocation | ||
will remain as `unknown` and won't be rescheduled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the node will be transition from `disconnected` to `down`. The allocation | |
will remain as `unknown` and won't be rescheduled. | |
the node transitions from `disconnected` to `down`. The allocation | |
remain in the `unknown` status and is not rescheduled. |
Setting `max_client_disconnect` and `prevent_reschedule_on_lost=true` at the same | ||
time requires that [rescheduling is disabled entirely][]. | ||
If [`max_client_disconnect`](#max_client_disconnect) is set and | ||
`prevent_reschedule_on_lost=true`, allocations on disconnected nodes will be | ||
`unknown` until the `max_client_disconnect` window expires, at which point | ||
the node will be transition from `disconnected` to `down`. The allocation | ||
will remain as `unknown` and won't be rescheduled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This interaction between prevent_reschedule_on_lost
and max_client_disconnect
should probably be documented in both places. So perhaps we should move this part to a separate section and link to there from both configs?
Co-authored-by: Luiz Aoqui <[email protected]>
Co-authored-by: Luiz Aoqui <[email protected]>
Co-authored-by: Luiz Aoqui <[email protected]>
Co-authored-by: Luiz Aoqui <[email protected]>
Co-authored-by: Luiz Aoqui <[email protected]>
This PR introduces the parameter
preventRescheduleOnLost
which indicates that the task group can't afford to have multiple instances running at the same time. In the case of a node going down, its allocations will be registered as unknown but no replacements will be rescheduled. If the lost node comes back up, the allocs will reconnect and continue to run.In case of
max_client_disconnect
also being enabled, if there is a reschedule policy, an error will be returned.Implements issue #10366