Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deploymentwatcher: fail early whenever possible #17341

Merged
merged 9 commits into from
Jun 26, 2023

Conversation

nicoche
Copy link
Contributor

@nicoche nicoche commented May 29, 2023

Given a deployment that has a progress_deadline, if a task group runs out of reschedule attempts, allow it to fail at this time instead of waiting until the progress_deadline is reached.

See #17260

@hashicorp-cla
Copy link

hashicorp-cla commented May 29, 2023

CLA assistant check
All committers have signed the CLA.

nomad/structs/structs.go Outdated Show resolved Hide resolved
@nicoche nicoche force-pushed the 17260/fail-early-deployments branch 2 times, most recently from 5c967a8 to 7a639a9 Compare May 29, 2023 10:25
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great so far @nicoche! A few things we'll want before it's ready to merge:

  • Run make cl to add a changelog entry
  • Tests: we should make sure the existing tests work (or are altered as needed) of course, but we should probably include a test that exercises these specific behaviors.

nomad/deploymentwatcher/deployment_watcher.go Outdated Show resolved Hide resolved
nomad/structs/structs.go Outdated Show resolved Hide resolved
nomad/deploymentwatcher/deployment_watcher.go Outdated Show resolved Hide resolved
nicoche added 2 commits June 5, 2023 17:02
Given a deployment that has a progress_deadline, if a task group runs
out of reschedule attempts, allow it to fail at this time instead of
waiting until the progress_deadline is reached.

See hashicorp#17260
@nicoche nicoche force-pushed the 17260/fail-early-deployments branch from 7a639a9 to 698fa19 Compare June 5, 2023 15:05
@nicoche nicoche force-pushed the 17260/fail-early-deployments branch from 698fa19 to ce81550 Compare June 5, 2023 15:09
@nicoche nicoche force-pushed the 17260/fail-early-deployments branch from ce81550 to f0726c0 Compare June 5, 2023 15:10
@nicoche nicoche marked this pull request as ready for review June 5, 2023 15:14
@nicoche
Copy link
Contributor Author

nicoche commented Jun 5, 2023

Hey @tgross !

Thanks for your comments. I incorporated them into the PR. I haven't squashed the commits because I figured out that it would make things easier for you if you wanted to check the changes. They should have all been addressed.

Btw, did you see #17260 (comment) ? Especially the part regarding the behavior of an update block for a first deployment?

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nicoche this is looking pretty good. I've pulled it down locally to do some testing and the typical case works fine. But I'll need to do some more testing of the cases around unlimited reschedule attempts and canaries, especially given the note I left around the unlimited case. Do you have a Nomad job you're using as your smoke test that you could share?

nomad/deploymentwatcher/deployments_watcher_test.go Outdated Show resolved Hide resolved
nomad/deploymentwatcher/deployments_watcher_test.go Outdated Show resolved Hide resolved
nomad/deploymentwatcher/deployments_watcher_test.go Outdated Show resolved Hide resolved
nomad/structs/structs.go Outdated Show resolved Hide resolved
nomad/deploymentwatcher/deployment_watcher.go Outdated Show resolved Hide resolved
nomad/deploymentwatcher/deployment_watcher.go Outdated Show resolved Hide resolved
nomad/deploymentwatcher/deployment_watcher.go Outdated Show resolved Hide resolved
@nicoche
Copy link
Contributor Author

nicoche commented Jun 14, 2023

Hello @tgross

Thanks for the feedbacks, you caught a lot of blunders that I made 🤦

especially given the note I left around the unlimited case

I applied your suggestion. I think that it mostly solves the issue.

Do you have a Nomad job you're using as your smoke test that you could share?

I'll try the whole thing with this new version and get back to you with the specs that I'm using

@nicoche nicoche requested a review from tgross June 14, 2023 16:21
@nicoche
Copy link
Contributor Author

nicoche commented Jun 14, 2023

Here are a few manifests that I tried which had the expected behavior:

A job with a unlimited reschedules and a progress deadline will reschedule many times until hitting the progress deadline
job "broken-unlimited-reschedules" {
  type = "service"

  update {
    max_parallel      = 1
    auto_promote      = true
    canary            = 1
    progress_deadline = "180s"
    healthy_deadline = "179s"
  }

  reschedule {
    attempts       = 0
    interval       = 0
    delay          = "5s"
    delay_function = "constant"
    unlimited      = true
  }

  group "some-broken-group" {
    count = 1

    restart {
      delay = "1s"
      mode  = "fail"
    }

    task "some-broken-task" {
      driver = "docker"

      config {
        image = "redis:zzzzzzzzzz"
      }
    }
  }
}
A job with limited reschedules and a progress deadline will be stopped after hitting the max number of reschedules
job "broken-finite-reschedules" {
  type = "service"

  update {
    max_parallel      = 1
    auto_promote      = true
    canary            = 1
    progress_deadline = "15m"
  }

  reschedule {
    attempts       = 1
    interval       = "24h"
    delay          = "20s"
    delay_function = "constant"
    unlimited      = false
  }

  group "some-broken-group" {
    count = 1

    restart {
      delay = "1s"
      mode  = "fail"
    }

    task "some-broken-task" {
      driver = "docker"

      config {
        image = "redis:qqqqqqqqqq"
      }
    }
  }
}

I just tested canaries the following way:

  • I deployed a working redis job, everything OK
  • I deployed a new version with update.canary=1, update.progress_deadline=180s, reschedule.unlimited=true and edit the image version to something that does not exist
  • I validated that the deployment ended in Error and the job was still running the latest stable version

Then, I did the same but with update.canary=1, update.progress_deadline=15m, reschedule.unlimited=false, reschedule.attempts=1 and validated the same thing, except that the deployment failed after running out of reschedule attempts

@nicoche
Copy link
Contributor Author

nicoche commented Jun 26, 2023

Hey @tgross !

Is there any way I can help to move this forward? I hear that you might want to do more tests. Let me know if I can perform them on your behalf or add specific, automated tests to the PR

@tgross
Copy link
Member

tgross commented Jun 26, 2023

Hi @nicoche! Sorry about that, I've been busy trying to get a few things landed for the 1.6.0 beta. I'd like to include this PR as well so I'll try to give it another pass today so we can get it merged.

@tgross tgross added this to the 1.6.0 milestone Jun 26, 2023
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience on this @nicoche. I've run thru a bunch of bench testing just to see if we've missed any edge cases and it looks like you've got us well covered here. I'm going to merge this and it'll ship in the Nomad 1.6.0-beta, which should be shipping very soon.

Thanks for the PR!

@tgross tgross merged commit a9135bc into hashicorp:main Jun 26, 2023
@nicoche nicoche deleted the 17260/fail-early-deployments branch June 27, 2023 08:52
@nicoche
Copy link
Contributor Author

nicoche commented Jun 27, 2023

Hey @tgross ! Thanks for the thorough review and for pushing this through 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants