health: detect failing tasks #7383

notnoop · 2020-03-18T20:18:14Z

Alternative to #7366 that is more task dependencies friendly.

Fixes a bug where an allocation is considered healthy if some of the
tasks are being restarted and as such, their checks aren't tracked by
consul agent client.

Here, we fix the immediate case by ensuring that an alloc is healthy
only if tasks are running and the registered checks at the time are
healthy.

Previously, health tracker tracked task "health" independently from
checks and leads to problems when a task restarts. Consider the
following series of events:

all tasks start running -> tracker.tasksHealthy is true
one task has unhealthy checks and get restarted
remaining checks are healthy -> tracker.checksHealthy is true
propagate health status now that tracker.tasksHealthy and
tracker.checksHealthy.

This change ensures that we accurately use the latest status of tasks
and checks regardless of their status changes.

Also, ensures that we only consider check health after tasks are
considered healthy, otherwise we risk trusting incomplete checks.

This approach accommodates task dependencies well. Service jobs can have
prestart short-lived tasks that will terminate before main process runs.
These dead tasks that complete successfully will not negate health
status.

I also included some fixes here targeting task dependencies (fyi/ @jazzyfresh ), so that successfully completed lifecycle non-sidecar tasks don't impact the allocation health.

Fixes #7320
Closes #7375

notnoop · 2020-03-18T20:21:01Z

client/allochealth/tracker.go

+
+	// check health should always be false if tasks are unhealthy
+	// as checks might be missing from unhealthy tasks
+	t.checksHealthy = healthy && t.tasksHealthy


This calls out for having a state machine to represent these dependent states. Would appreciate suggestions, specially ones that don't require significant rewrite.

Your logic seems sound even if the structure of this package is sub-optimal. 👍 to your approach.

Add tests to check for failing or missing service checks in consul update.

Fixes a bug where an allocation is considered healthy if some of the tasks are being restarted and as such, their checks aren't tracked by consul agent client. Here, we fix the immediate case by ensuring that an alloc is healthy only if tasks are running and the registered checks at the time are healthy. Previously, health tracker tracked task "health" independently from checks and leads to problems when a task restarts. Consider the following series of events: 1. all tasks start running -> `tracker.tasksHealthy` is true 2. one task has unhealthy checks and get restarted 3. remaining checks are healthy -> `tracker.checksHealthy` is true 4. propagate health status now that `tracker.tasksHealthy` and `tracker.checksHealthy`. This change ensures that we accurately use the latest status of tasks and checks regardless of their status changes. Also, ensures that we only consider check health after tasks are considered healthy, otherwise we risk trusting incomplete checks. This approach accomodates task dependencies well. Service jobs can have prestart short-lived tasks that will terminate before main process runs. These dead tasks that complete successfully will not negate health status.

In service jobs, lifecycles non-sidecar task tweak health logic a bit: they may terminate successfully without impacting alloc health, but fail the alloc if they fail. Sidecars should be treated just like a normal task.

langmartin

The code looks good, just the one question about task state progression. I don't have a lot of context in this code, it would be good to get a second review.

langmartin · 2020-03-25T03:05:57Z

client/allochealth/tracker.go

 				t.setTaskHealth(false, true)
 				return
 			}

-			if state.State != structs.TaskStateRunning {
+			if state.State == structs.TaskStatePending {


Probably missing context: do tasks controlled by a restart block always go back through the pending state on their way to running?

Yes. When a task fails, it moves to dead once it's beyond restart policy attempts, otherwise, it moves to pending until it's scheduled to run after restart policy delay.

schmichael

Looks good! My comments are just suggestions if you circle back to this code to write further tests or something. Not critical.

schmichael · 2020-03-26T18:28:46Z

client/allochealth/tracker.go


-	for _, task := range t.tg.Tasks {
+		if task.Lifecycle != nil && !task.Lifecycle.Sidecar {


We may want a helper like connect has for this as it's easy for people to forget the initial nil check and cause a panic: https://github.com/hashicorp/nomad/blob/v0.10.5/nomad/structs/services.go#L585

schmichael · 2020-03-26T18:30:33Z

client/allochealth/tracker.go

+
+	// check health should always be false if tasks are unhealthy
+	// as checks might be missing from unhealthy tasks
+	t.checksHealthy = healthy && t.tasksHealthy


Your logic seems sound even if the structure of this package is sub-optimal. 👍 to your approach.

schmichael · 2020-03-26T18:42:38Z

client/allochealth/tracker.go

+	// lifecycleTasks is a set of tasks with lifecycle hook set and may
+	// terminate without affecting alloc health
+	lifecycleTasks map[string]bool


Let's document that sidecars are explicitly excluded from this (code suggestions are disabled, sorry)

schmichael · 2020-03-26T18:49:02Z

client/allochealth/tracker.go

+		case structs.TaskStatePending:
+			return "Task not running by deadline", true
+		case structs.TaskStateDead:
+			// hook tasks are healthy when dead successfully


Could maybe get a bit more descriptive here? Just an idea:

// non-sidecar lifecycle tasks are expected to terminate and therefore healthy when dead

github-actions · 2023-01-13T02:17:45Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

notnoop requested review from schmichael and dadgar March 18, 2020 20:18

notnoop self-assigned this Mar 18, 2020

notnoop mentioned this pull request Mar 18, 2020

health: detect missing task checks #7366

Closed

notnoop commented Mar 18, 2020

View reviewed changes

pruett deployed to Netlify Preview March 18, 2020 20:25 View deployment

notnoop force-pushed the b-health-detect-failing-tasks branch from 2e0c0dd to 2cbca7b Compare March 18, 2020 20:32

Mahmood Ali added 2 commits March 22, 2020 11:13

tests: add a check for failing service checks

3719ff3

Add tests to check for failing or missing service checks in consul update.

notnoop force-pushed the b-health-detect-failing-tasks branch from 2cbca7b to 314f345 Compare March 22, 2020 16:32

health check account for task lifecycle

1454af7

In service jobs, lifecycles non-sidecar task tweak health logic a bit: they may terminate successfully without impacting alloc health, but fail the alloc if they fail. Sidecars should be treated just like a normal task.

pruett deployed to Netlify Preview March 22, 2020 16:38 View deployment

health tracker: account for group service checks

525623c

notnoop force-pushed the b-health-detect-failing-tasks branch from 314f345 to 525623c Compare March 22, 2020 16:38

pruett deployed to Netlify Preview March 22, 2020 16:45 View deployment

notnoop requested review from nickethier, langmartin and cgbaker March 24, 2020 15:58

langmartin approved these changes Mar 25, 2020

View reviewed changes

notnoop merged commit 4a27cdd into master Mar 25, 2020

notnoop deleted the b-health-detect-failing-tasks branch March 25, 2020 10:30

schmichael reviewed Mar 26, 2020

View reviewed changes

github-actions bot locked as resolved and limited conversation to collaborators Jan 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

health: detect failing tasks #7383

health: detect failing tasks #7383

notnoop commented Mar 18, 2020 •

edited

Loading

notnoop Mar 18, 2020

schmichael Mar 26, 2020

langmartin left a comment

langmartin Mar 25, 2020

notnoop Mar 25, 2020

schmichael left a comment

schmichael Mar 26, 2020

schmichael Mar 26, 2020

schmichael Mar 26, 2020

schmichael Mar 26, 2020

github-actions bot commented Jan 13, 2023


		for _, task := range t.tg.Tasks {
		if task.Lifecycle != nil && !task.Lifecycle.Sidecar {

health: detect failing tasks #7383

health: detect failing tasks #7383

Conversation

notnoop commented Mar 18, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

langmartin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schmichael left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jan 13, 2023

notnoop commented Mar 18, 2020 •

edited

Loading