-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deploymentwatcher: fail early whenever possible #17341
Conversation
5c967a8
to
7a639a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great so far @nicoche! A few things we'll want before it's ready to merge:
- Run
make cl
to add a changelog entry - Tests: we should make sure the existing tests work (or are altered as needed) of course, but we should probably include a test that exercises these specific behaviors.
Given a deployment that has a progress_deadline, if a task group runs out of reschedule attempts, allow it to fail at this time instead of waiting until the progress_deadline is reached. See hashicorp#17260
7a639a9
to
698fa19
Compare
698fa19
to
ce81550
Compare
ce81550
to
f0726c0
Compare
Hey @tgross ! Thanks for your comments. I incorporated them into the PR. I haven't squashed the commits because I figured out that it would make things easier for you if you wanted to check the changes. They should have all been addressed. Btw, did you see #17260 (comment) ? Especially the part regarding the behavior of an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @nicoche this is looking pretty good. I've pulled it down locally to do some testing and the typical case works fine. But I'll need to do some more testing of the cases around unlimited reschedule attempts and canaries, especially given the note I left around the unlimited case. Do you have a Nomad job you're using as your smoke test that you could share?
* Use shoenig/test instead of stretchr/testify * Run gofmt
Hello @tgross Thanks for the feedbacks, you caught a lot of blunders that I made 🤦
I applied your suggestion. I think that it mostly solves the issue.
I'll try the whole thing with this new version and get back to you with the specs that I'm using |
Here are a few manifests that I tried which had the expected behavior: A job with a unlimited reschedules and a progress deadline will reschedule many times until hitting the progress deadline
A job with limited reschedules and a progress deadline will be stopped after hitting the max number of reschedules
I just tested canaries the following way:
Then, I did the same but with |
Hey @tgross ! Is there any way I can help to move this forward? I hear that you might want to do more tests. Let me know if I can perform them on your behalf or add specific, automated tests to the PR |
Hi @nicoche! Sorry about that, I've been busy trying to get a few things landed for the 1.6.0 beta. I'd like to include this PR as well so I'll try to give it another pass today so we can get it merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your patience on this @nicoche. I've run thru a bunch of bench testing just to see if we've missed any edge cases and it looks like you've got us well covered here. I'm going to merge this and it'll ship in the Nomad 1.6.0-beta, which should be shipping very soon.
Thanks for the PR!
Hey @tgross ! Thanks for the thorough review and for pushing this through 🙂 |
Given a deployment that has a progress_deadline, if a task group runs out of reschedule attempts, allow it to fail at this time instead of waiting until the progress_deadline is reached.
See #17260