Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop already rescheduled but somehow running allocs #8886

Merged
merged 2 commits into from
Sep 15, 2020

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented Sep 15, 2020

This is a band-aid fix for a case where an alloc has been rescheduled but somehow is left in a running state. In such case, currently, the alloc is left running uninterrupted as it is removed from consideration by the scheduler. The alloc will remain running uninterrupted even after a new job version is pushed, resulting into a mixed fleet. Currently, operators need to manually force stop these "leaked" allocations.

This PR fixes the issue by reconsidering an alloc that has been rescheduled but is still running for scheduling purposes. The check in reconciler_util.go meant that once an allocation is rescheduled, forever it will never be examined again as it's removed from the untainted allocations.

But how did it get here

It is very unclear how an alloc can get into this state. In all of my testing so far, only failed allocs can be rescheduled, and once they are rescheduled, and alloc.DesiredStatus is set to stop. So theoretically, we should never see a running allocation with NextAllocation != "".

@cgbaker observed this issue in #5921 (comment). We've also had bugs in the past where finished allocations get to re-run again upon a client restart, e.g. #6354, #5945.

So while we need to keep digging into understanding the underlying cause, I propose this "band-aid" to at least recover smoothly from the bad state.

@notnoop notnoop requested a review from cgbaker September 15, 2020 01:48
Copy link
Contributor

@cgbaker cgbaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me...

@notnoop notnoop merged commit 45ffcc5 into master Sep 15, 2020
@notnoop notnoop deleted the b-running-next-allocation branch September 15, 2020 15:00
@notnoop notnoop added this to the 0.12.5 milestone Sep 16, 2020
notnoop pushed a commit that referenced this pull request Sep 16, 2020
Stop already rescheduled but somehow running allocs
teutat3s pushed a commit to teutat3s/nomad that referenced this pull request Oct 27, 2020
…ation

Stop already rescheduled but somehow running allocs
teutat3s pushed a commit to teutat3s/nomad that referenced this pull request Jan 16, 2021
…ation

Stop already rescheduled but somehow running allocs
teutat3s pushed a commit to teutat3s/nomad that referenced this pull request Jan 17, 2021
…ation

Stop already rescheduled but somehow running allocs
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants