Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new parameter to avoid starting a replacement for lost allocs #19101

Merged
merged 50 commits into from
Dec 6, 2023

Conversation

Juanadelacuesta
Copy link
Member

@Juanadelacuesta Juanadelacuesta commented Nov 16, 2023

This PR introduces the parameter preventRescheduleOnLost which indicates that the task group can't afford to have multiple instances running at the same time. In the case of a node going down, its allocations will be registered as unknown but no replacements will be rescheduled. If the lost node comes back up, the allocs will reconnect and continue to run.

In case of max_client_disconnect also being enabled, if there is a reschedule policy, an error will be returned.
Implements issue #10366

@Juanadelacuesta Juanadelacuesta marked this pull request as draft November 16, 2023 11:02
@Juanadelacuesta Juanadelacuesta changed the title func: add new configuration to consider task as failed if lost [gh-103660] Nov 27, 2023
Copy link
Contributor

@lgfa29 lgfa29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving some early comments here, I will need some extra brain juice to reason about the reconciler part 😄

nomad/plan_apply.go Outdated Show resolved Hide resolved
nomad/structs/structs.go Outdated Show resolved Hide resolved
Comment on lines 11038 to 11045
if a.Job != nil {
tg := a.Job.LookupTaskGroup(a.TaskGroup)
if tg != nil {
return tg.AvoidRescheduleOnLost
}
}

return false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you hit a panic where a.Job was nil?

We denormalize the allocation in some parts of the scheduler and remove Job from the allocation. I suspect this may be the case for isValidForLostNode() because it acts on the plan, so we may be incorrectly returning false here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not seen it yet, it returns the default value, im not sure what to do if the alloc does not have the job

Copy link
Contributor

@lgfa29 lgfa29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job getting this implemented, it's such a tricky part of the code 😅

I left a few docs suggestions and also some food for thought for potentially rethinking and refactoring this part of the code and the intended behaviours we want, but not something we should do here.

One thing to double-check is that a block like this will create replacements, even with attempts = 0, but it seems like the intention is disallow the use of this new feature in this situation?

restart {
  attempts  = 0
  unlimited = true
}

nomad/structs/structs.go Outdated Show resolved Hide resolved
nomad/structs/structs.go Outdated Show resolved Hide resolved
Comment on lines +4676 to +4681
if tg.MaxClientDisconnect != nil &&
tg.ReschedulePolicy.Attempts > 0 &&
tg.PreventRescheduleOnLost {
err := fmt.Errorf("max_client_disconnect and prevent_reschedule_on_lost cannot be enabled when rechedule.attempts > 0")
mErr.Errors = append(mErr.Errors, err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure about this validation as they seem to mix two scenarios. For example, a group configured like this:

group "..." {
  max_client_disconnect      = "12h"
  prevent_reschedule_on_lost = true
  
  reschedule {
    attempts = 10
  }
}

As an user, I think my intention in writing this job would be something like this:

  • Try to reschedule my allocations at least 10 times if they fail.
  • If a node running an alloc for this group misses heartbeats, consider it disconnected instead of down for 12h as it may come back up, but don't create a replacement allocation. After the 12h consider the allocation as lost and create a replacement.

So in a scenario where this group has 3 allocs, if a node misses its hearbeats I would expect there to still be 3 allocs, 2 running and 1 unknown. After 12h, if the node doesn't reconnect, it will considered down and the alloc lost, and a replacement is created, resulting in 4 allocs: 3 running and 1 lost.

This is consistent with the documented behaviour of the reschedule block:

Nomad will attempt to schedule the allocation on another node if any of its task statuses become failed.

Since the alloc is unknown, none of the tasks should be in failed state, so the reschedule block doesn't apply. By forcing reschedule.attempts to be 0 we're preventing users from being able to handle allocation failures.

In hindsight, I think the confusion started with max_client_disconnect, where the docs mention:

Replacement allocations will be scheduled according to the allocations' reschedule policy until the disconnected client reconnects.

This conflates cluster state (a node misses heartbeat) with allocation state (a task fails), making it impossible to properly handle both cases as any combination of them is valid.

If we were to refactor this part of the code I think I would create a new block to disentangle these two scenarios. reschedule is applied when allocations fail, and a new disconnected (or a better name 😅) block to handle nodes missing heartbeats.

This new block could then have configurations like these:

  • Create a replacement for unknown allocs but try to reconnect the original alloc if the node comes back within 12h (equivalent to today's max_client_disconnect).

    disconnected {
      replace_unknown     = true
      max_disconnect_time = "12h"
    }
  • Never create a replacement for unkown allocs (I think this is what we're trying to accomplish here with prevent_reschedule_on_lost).

    disconnected {
      replace_unknown = false
    }
  • Wait 12h before creating a replacement for unknown allocs (I don't think this is currently possible to express, even with this PR, and I think it's a valid use case).

    disconnected {
      replace_unknown     = false
      max_disconnect_time = "12h"
    }

Parallel to all of these configs, I could set different reschedule policies to handle allocations that fail. Regardless of client status, I probably do want to reschedule them (or maybe I don't 😄).

This is a lot to change that would require proper planning and scoping, but until then I think the relationship between all of these flags can get confusing.

For this PR specifically, if the intention is to prevent the use of max_client_disconnect and prevent_reschedule_on_lost with a reschedule policy that allows for new allocations to be created, then I think we also need to check the value of reschedule.unlimited, as its often used with with reschedule.attempts = 0 but does create replacements.

scheduler/util_test.go Outdated Show resolved Hide resolved
website/content/docs/job-specification/group.mdx Outdated Show resolved Hide resolved
To modify the allocation behaviour on the client, see
[`stop_after_client_disconnect`](#stop_after_client_disconnect) .

Setting `max_client_disconnect` and `prevent_reschedule_on_lost=true` at the same
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Setting `max_client_disconnect` and `prevent_reschedule_on_lost=true` at the same
Setting `max_client_disconnect` and `prevent_reschedule_on_lost = true` at the same

Comment on lines 78 to 80
If [`max_client_disconnect`](#max_client_disconnect) is set and
`prevent_reschedule_on_lost=true`, allocations on disconnected nodes will be
`unknown` until the `max_client_disconnect` window expires, at which point
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If [`max_client_disconnect`](#max_client_disconnect) is set and
`prevent_reschedule_on_lost=true`, allocations on disconnected nodes will be
`unknown` until the `max_client_disconnect` window expires, at which point
If [`max_client_disconnect`](#max_client_disconnect) is set and
`prevent_reschedule_on_lost = true`, allocations on disconnected nodes remains with status
`unknown` until the `max_client_disconnect` window expires, at which point

Comment on lines 81 to 82
the node will be transition from `disconnected` to `down`. The allocation
will remain as `unknown` and won't be rescheduled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the node will be transition from `disconnected` to `down`. The allocation
will remain as `unknown` and won't be rescheduled.
the node transitions from `disconnected` to `down`. The allocation
remain in the `unknown` status and is not rescheduled.

website/content/docs/job-specification/group.mdx Outdated Show resolved Hide resolved
Comment on lines 76 to 82
Setting `max_client_disconnect` and `prevent_reschedule_on_lost=true` at the same
time requires that [rescheduling is disabled entirely][].
If [`max_client_disconnect`](#max_client_disconnect) is set and
`prevent_reschedule_on_lost=true`, allocations on disconnected nodes will be
`unknown` until the `max_client_disconnect` window expires, at which point
the node will be transition from `disconnected` to `down`. The allocation
will remain as `unknown` and won't be rescheduled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interaction between prevent_reschedule_on_lost and max_client_disconnect should probably be documented in both places. So perhaps we should move this part to a separate section and link to there from both configs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants