Add a new parameter to avoid starting a replacement for lost allocs #19101

Juanadelacuesta · 2023-11-16T11:02:06Z

This PR introduces the parameter preventRescheduleOnLost which indicates that the task group can't afford to have multiple instances running at the same time. In the case of a node going down, its allocations will be registered as unknown but no replacements will be rescheduled. If the lost node comes back up, the allocs will reconnect and continue to run.

In case of max_client_disconnect also being enabled, if there is a reschedule policy, an error will be returned.
Implements issue #10366

lgfa29

Leaving some early comments here, I will need some extra brain juice to reason about the reconciler part 😄

nomad/plan_apply.go

nomad/structs/structs.go

lgfa29 · 2023-11-30T21:33:11Z

nomad/structs/structs.go

+	if a.Job != nil {
+		tg := a.Job.LookupTaskGroup(a.TaskGroup)
+		if tg != nil {
+			return tg.AvoidRescheduleOnLost
+		}
+	}
+
+	return false


Have you hit a panic where a.Job was nil?

We denormalize the allocation in some parts of the scheduler and remove Job from the allocation. I suspect this may be the case for isValidForLostNode() because it acts on the plan, so we may be incorrectly returning false here.

I have not seen it yet, it returns the default value, im not sure what to do if the alloc does not have the job

Co-authored-by: Tim Gross <[email protected]>

Co-authored-by: Luiz Aoqui <[email protected]>

lgfa29

Great job getting this implemented, it's such a tricky part of the code 😅

I left a few docs suggestions and also some food for thought for potentially rethinking and refactoring this part of the code and the intended behaviours we want, but not something we should do here.

One thing to double-check is that a block like this will create replacements, even with attempts = 0, but it seems like the intention is disallow the use of this new feature in this situation?

restart {
  attempts  = 0
  unlimited = true
}

nomad/structs/structs.go

lgfa29 · 2023-12-05T19:07:40Z

nomad/structs/structs.go

+		if tg.MaxClientDisconnect != nil &&
+			tg.ReschedulePolicy.Attempts > 0 &&
+			tg.PreventRescheduleOnLost {
+			err := fmt.Errorf("max_client_disconnect and prevent_reschedule_on_lost cannot be enabled when rechedule.attempts > 0")
+			mErr.Errors = append(mErr.Errors, err)
+		}


I'm still not sure about this validation as they seem to mix two scenarios. For example, a group configured like this:

group "..." { max_client_disconnect = "12h" prevent_reschedule_on_lost = true reschedule { attempts = 10 } }

As an user, I think my intention in writing this job would be something like this:

Try to reschedule my allocations at least 10 times if they fail.

If a node running an alloc for this group misses heartbeats, consider it disconnected instead of down for 12h as it may come back up, but don't create a replacement allocation. After the 12h consider the allocation as lost and create a replacement.

So in a scenario where this group has 3 allocs, if a node misses its hearbeats I would expect there to still be 3 allocs, 2 running and 1 unknown. After 12h, if the node doesn't reconnect, it will considered down and the alloc lost, and a replacement is created, resulting in 4 allocs: 3 running and 1 lost.

This is consistent with the documented behaviour of the reschedule block:

Nomad will attempt to schedule the allocation on another node if any of its task statuses become failed.

Since the alloc is unknown, none of the tasks should be in failed state, so the reschedule block doesn't apply. By forcing reschedule.attempts to be 0 we're preventing users from being able to handle allocation failures.

In hindsight, I think the confusion started with max_client_disconnect, where the docs mention:

Replacement allocations will be scheduled according to the allocations' reschedule policy until the disconnected client reconnects.

This conflates cluster state (a node misses heartbeat) with allocation state (a task fails), making it impossible to properly handle both cases as any combination of them is valid.

If we were to refactor this part of the code I think I would create a new block to disentangle these two scenarios. reschedule is applied when allocations fail, and a new disconnected (or a better name 😅) block to handle nodes missing heartbeats.

This new block could then have configurations like these:

Create a replacement for unknown allocs but try to reconnect the original alloc if the node comes back within 12h (equivalent to today's max_client_disconnect).

disconnected { replace_unknown = true max_disconnect_time = "12h" }

Never create a replacement for unkown allocs (I think this is what we're trying to accomplish here with prevent_reschedule_on_lost).

disconnected { replace_unknown = false }

Wait 12h before creating a replacement for unknown allocs (I don't think this is currently possible to express, even with this PR, and I think it's a valid use case).

disconnected { replace_unknown = false max_disconnect_time = "12h" }

Parallel to all of these configs, I could set different reschedule policies to handle allocations that fail. Regardless of client status, I probably do want to reschedule them (or maybe I don't 😄).

This is a lot to change that would require proper planning and scoping, but until then I think the relationship between all of these flags can get confusing.

For this PR specifically, if the intention is to prevent the use of max_client_disconnect and prevent_reschedule_on_lost with a reschedule policy that allows for new allocations to be created, then I think we also need to check the value of reschedule.unlimited, as its often used with with reschedule.attempts = 0 but does create replacements.

scheduler/util_test.go