-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
health: detect failing tasks #7383
Conversation
|
||
// check health should always be false if tasks are unhealthy | ||
// as checks might be missing from unhealthy tasks | ||
t.checksHealthy = healthy && t.tasksHealthy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This calls out for having a state machine to represent these dependent states. Would appreciate suggestions, specially ones that don't require significant rewrite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your logic seems sound even if the structure of this package is sub-optimal. 👍 to your approach.
2e0c0dd
to
2cbca7b
Compare
Add tests to check for failing or missing service checks in consul update.
Fixes a bug where an allocation is considered healthy if some of the tasks are being restarted and as such, their checks aren't tracked by consul agent client. Here, we fix the immediate case by ensuring that an alloc is healthy only if tasks are running and the registered checks at the time are healthy. Previously, health tracker tracked task "health" independently from checks and leads to problems when a task restarts. Consider the following series of events: 1. all tasks start running -> `tracker.tasksHealthy` is true 2. one task has unhealthy checks and get restarted 3. remaining checks are healthy -> `tracker.checksHealthy` is true 4. propagate health status now that `tracker.tasksHealthy` and `tracker.checksHealthy`. This change ensures that we accurately use the latest status of tasks and checks regardless of their status changes. Also, ensures that we only consider check health after tasks are considered healthy, otherwise we risk trusting incomplete checks. This approach accomodates task dependencies well. Service jobs can have prestart short-lived tasks that will terminate before main process runs. These dead tasks that complete successfully will not negate health status.
2cbca7b
to
314f345
Compare
In service jobs, lifecycles non-sidecar task tweak health logic a bit: they may terminate successfully without impacting alloc health, but fail the alloc if they fail. Sidecars should be treated just like a normal task.
314f345
to
525623c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good, just the one question about task state progression. I don't have a lot of context in this code, it would be good to get a second review.
t.setTaskHealth(false, true) | ||
return | ||
} | ||
|
||
if state.State != structs.TaskStateRunning { | ||
if state.State == structs.TaskStatePending { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably missing context: do tasks controlled by a restart
block always go back through the pending state on their way to running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. When a task fails, it moves to dead once it's beyond restart policy attempts, otherwise, it moves to pending until it's scheduled to run after restart policy delay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! My comments are just suggestions if you circle back to this code to write further tests or something. Not critical.
|
||
for _, task := range t.tg.Tasks { | ||
if task.Lifecycle != nil && !task.Lifecycle.Sidecar { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want a helper like connect has for this as it's easy for people to forget the initial nil check and cause a panic: https://github.com/hashicorp/nomad/blob/v0.10.5/nomad/structs/services.go#L585
|
||
// check health should always be false if tasks are unhealthy | ||
// as checks might be missing from unhealthy tasks | ||
t.checksHealthy = healthy && t.tasksHealthy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your logic seems sound even if the structure of this package is sub-optimal. 👍 to your approach.
// lifecycleTasks is a set of tasks with lifecycle hook set and may | ||
// terminate without affecting alloc health | ||
lifecycleTasks map[string]bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's document that sidecars are explicitly excluded from this (code suggestions are disabled, sorry)
case structs.TaskStatePending: | ||
return "Task not running by deadline", true | ||
case structs.TaskStateDead: | ||
// hook tasks are healthy when dead successfully |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could maybe get a bit more descriptive here? Just an idea:
// non-sidecar lifecycle tasks are expected to terminate and therefore healthy when dead
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Alternative to #7366 that is more task dependencies friendly.
Fixes a bug where an allocation is considered healthy if some of the
tasks are being restarted and as such, their checks aren't tracked by
consul agent client.
Here, we fix the immediate case by ensuring that an alloc is healthy
only if tasks are running and the registered checks at the time are
healthy.
Previously, health tracker tracked task "health" independently from
checks and leads to problems when a task restarts. Consider the
following series of events:
tracker.tasksHealthy
is truetracker.checksHealthy
is truetracker.tasksHealthy
andtracker.checksHealthy
.This change ensures that we accurately use the latest status of tasks
and checks regardless of their status changes.
Also, ensures that we only consider check health after tasks are
considered healthy, otherwise we risk trusting incomplete checks.
This approach accommodates task dependencies well. Service jobs can have
prestart short-lived tasks that will terminate before main process runs.
These dead tasks that complete successfully will not negate health
status.
I also included some fixes here targeting task dependencies (fyi/ @jazzyfresh ), so that successfully completed lifecycle non-sidecar tasks don't impact the allocation health.
Fixes #7320
Closes #7375