deploymentwatcher: fail early whenever possible #17341

nicoche · 2023-05-29T10:21:06Z

Given a deployment that has a progress_deadline, if a task group runs out of reschedule attempts, allow it to fail at this time instead of waiting until the progress_deadline is reached.

See #17260

hashicorp-cla · 2023-05-29T10:21:10Z

All committers have signed the CLA.

nomad/structs/structs.go

nomad/deploymentwatcher/deployment_watcher.go

tgross

This is looking great so far @nicoche! A few things we'll want before it's ready to merge:

Run make cl to add a changelog entry
Tests: we should make sure the existing tests work (or are altered as needed) of course, but we should probably include a test that exercises these specific behaviors.

nomad/deploymentwatcher/deployment_watcher.go

nomad/structs/structs.go

nomad/deploymentwatcher/deployment_watcher.go

Given a deployment that has a progress_deadline, if a task group runs out of reschedule attempts, allow it to fail at this time instead of waiting until the progress_deadline is reached. See hashicorp#17260

nicoche · 2023-06-05T15:18:00Z

Hey @tgross !

Thanks for your comments. I incorporated them into the PR. I haven't squashed the commits because I figured out that it would make things easier for you if you wanted to check the changes. They should have all been addressed.

Btw, did you see #17260 (comment) ? Especially the part regarding the behavior of an update block for a first deployment?

tgross

Hi @nicoche this is looking pretty good. I've pulled it down locally to do some testing and the typical case works fine. But I'll need to do some more testing of the cases around unlimited reschedule attempts and canaries, especially given the note I left around the unlimited case. Do you have a Nomad job you're using as your smoke test that you could share?

nomad/deploymentwatcher/deployments_watcher_test.go

nomad/structs/structs.go

nomad/deploymentwatcher/deployment_watcher.go

* Use shoenig/test instead of stretchr/testify * Run gofmt

nicoche · 2023-06-14T16:21:35Z

Hello @tgross

Thanks for the feedbacks, you caught a lot of blunders that I made 🤦

especially given the note I left around the unlimited case

I applied your suggestion. I think that it mostly solves the issue.

Do you have a Nomad job you're using as your smoke test that you could share?

I'll try the whole thing with this new version and get back to you with the specs that I'm using

nicoche · 2023-06-14T21:20:13Z

Here are a few manifests that I tried which had the expected behavior:

A job with a unlimited reschedules and a progress deadline will reschedule many times until hitting the progress deadline

job "broken-unlimited-reschedules" {
  type = "service"

  update {
    max_parallel      = 1
    auto_promote      = true
    canary            = 1
    progress_deadline = "180s"
    healthy_deadline = "179s"
  }

  reschedule {
    attempts       = 0
    interval       = 0
    delay          = "5s"
    delay_function = "constant"
    unlimited      = true
  }

  group "some-broken-group" {
    count = 1

    restart {
      delay = "1s"
      mode  = "fail"
    }

    task "some-broken-task" {
      driver = "docker"

      config {
        image = "redis:zzzzzzzzzz"
      }
    }
  }
}

A job with limited reschedules and a progress deadline will be stopped after hitting the max number of reschedules

job "broken-finite-reschedules" {
  type = "service"

  update {
    max_parallel      = 1
    auto_promote      = true
    canary            = 1
    progress_deadline = "15m"
  }

  reschedule {
    attempts       = 1
    interval       = "24h"
    delay          = "20s"
    delay_function = "constant"
    unlimited      = false
  }

  group "some-broken-group" {
    count = 1

    restart {
      delay = "1s"
      mode  = "fail"
    }

    task "some-broken-task" {
      driver = "docker"

      config {
        image = "redis:qqqqqqqqqq"
      }
    }
  }
}

I just tested canaries the following way:

I deployed a working redis job, everything OK
I deployed a new version with update.canary=1, update.progress_deadline=180s, reschedule.unlimited=true and edit the image version to something that does not exist
I validated that the deployment ended in Error and the job was still running the latest stable version

Then, I did the same but with update.canary=1, update.progress_deadline=15m, reschedule.unlimited=false, reschedule.attempts=1 and validated the same thing, except that the deployment failed after running out of reschedule attempts

nicoche · 2023-06-26T09:10:51Z

Hey @tgross !

Is there any way I can help to move this forward? I hear that you might want to do more tests. Let me know if I can perform them on your behalf or add specific, automated tests to the PR

tgross · 2023-06-26T14:20:17Z

Hi @nicoche! Sorry about that, I've been busy trying to get a few things landed for the 1.6.0 beta. I'd like to include this PR as well so I'll try to give it another pass today so we can get it merged.

tgross

Thanks for your patience on this @nicoche. I've run thru a bunch of bench testing just to see if we've missed any edge cases and it looks like you've got us well covered here. I'm going to merge this and it'll ship in the Nomad 1.6.0-beta, which should be shipping very soon.

Thanks for the PR!

nicoche · 2023-06-27T08:54:43Z

Hey @tgross ! Thanks for the thorough review and for pushing this through 🙂

nicoche commented May 29, 2023

View reviewed changes

nomad/structs/structs.go Outdated Show resolved Hide resolved

nicoche force-pushed the 17260/fail-early-deployments branch 2 times, most recently from 5c967a8 to 7a639a9 Compare May 29, 2023 10:25

nicoche commented May 29, 2023

View reviewed changes

nomad/deploymentwatcher/deployment_watcher.go Outdated Show resolved Hide resolved

vercel bot deployed to Preview – nomad-storybook-and-ui May 29, 2023 10:30 View deployment

tgross self-requested a review June 1, 2023 14:35

tgross reviewed Jun 2, 2023

View reviewed changes

nomad/deploymentwatcher/deployment_watcher.go Outdated Show resolved Hide resolved

nomad/structs/structs.go Outdated Show resolved Hide resolved

nomad/deploymentwatcher/deployment_watcher.go Outdated Show resolved Hide resolved

nicoche added 2 commits June 5, 2023 17:02

deploymentwatcher: fail early whenever possible

234c47c

Given a deployment that has a progress_deadline, if a task group runs out of reschedule attempts, allow it to fail at this time instead of waiting until the progress_deadline is reached. See hashicorp#17260

deploymentwatcher: add test for early failing

0f5f6b9

nicoche force-pushed the 17260/fail-early-deployments branch from 7a639a9 to 698fa19 Compare June 5, 2023 15:05

nicoche added 2 commits June 5, 2023 17:09

Move RescheduleEligible logic to RescheduleTracker

4eec0eb

deploymentwatcher: move fast fail code in function

d2f6b1d

nicoche force-pushed the 17260/fail-early-deployments branch from 698fa19 to ce81550 Compare June 5, 2023 15:09

Add changelog entry

f0726c0

nicoche force-pushed the 17260/fail-early-deployments branch from ce81550 to f0726c0 Compare June 5, 2023 15:10

nicoche marked this pull request as ready for review June 5, 2023 15:14

vercel bot deployed to Preview – nomad-storybook-and-ui June 5, 2023 15:22 View deployment

nicoche requested a review from tgross June 12, 2023 07:40

tgross requested changes Jun 12, 2023

View reviewed changes

nicoche added 4 commits June 14, 2023 17:30

deploymentwatcher: fix early fail test

e69c268

* Use shoenig/test instead of stretchr/testify * Run gofmt

deploymentwatcher: fix unnecessary indirection

b8cebae

deploymentwatcher: fix logging

ea4b792

Fix Alloc.RescheduleEligible call

f4c35b7

vercel bot deployed to Preview – nomad-storybook-and-ui June 14, 2023 16:11 View deployment

nicoche requested a review from tgross June 14, 2023 16:21

tgross added this to the 1.6.0 milestone Jun 26, 2023

tgross approved these changes Jun 26, 2023

View reviewed changes

tgross merged commit a9135bc into hashicorp:main Jun 26, 2023

tgross added theme/deployments type/enhancement labels Jun 26, 2023

nicoche deleted the 17260/fail-early-deployments branch June 27, 2023 08:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deploymentwatcher: fail early whenever possible #17341

deploymentwatcher: fail early whenever possible #17341

nicoche commented May 29, 2023

hashicorp-cla commented May 29, 2023 •

edited

Loading

tgross left a comment

nicoche commented Jun 5, 2023

tgross left a comment

nicoche commented Jun 14, 2023

nicoche commented Jun 14, 2023

nicoche commented Jun 26, 2023

tgross commented Jun 26, 2023

tgross left a comment

nicoche commented Jun 27, 2023

deploymentwatcher: fail early whenever possible #17341

deploymentwatcher: fail early whenever possible #17341

Conversation

nicoche commented May 29, 2023

hashicorp-cla commented May 29, 2023 • edited Loading

tgross left a comment

Choose a reason for hiding this comment

nicoche commented Jun 5, 2023

tgross left a comment

Choose a reason for hiding this comment

nicoche commented Jun 14, 2023

nicoche commented Jun 14, 2023

nicoche commented Jun 26, 2023

tgross commented Jun 26, 2023

tgross left a comment

Choose a reason for hiding this comment

nicoche commented Jun 27, 2023

hashicorp-cla commented May 29, 2023 •

edited

Loading