Stop already rescheduled but somehow running allocs #8886

notnoop · 2020-09-15T01:48:58Z

This is a band-aid fix for a case where an alloc has been rescheduled but somehow is left in a running state. In such case, currently, the alloc is left running uninterrupted as it is removed from consideration by the scheduler. The alloc will remain running uninterrupted even after a new job version is pushed, resulting into a mixed fleet. Currently, operators need to manually force stop these "leaked" allocations.

This PR fixes the issue by reconsidering an alloc that has been rescheduled but is still running for scheduling purposes. The check in reconciler_util.go meant that once an allocation is rescheduled, forever it will never be examined again as it's removed from the untainted allocations.

But how did it get here

It is very unclear how an alloc can get into this state. In all of my testing so far, only failed allocs can be rescheduled, and once they are rescheduled, and alloc.DesiredStatus is set to stop. So theoretically, we should never see a running allocation with NextAllocation != "".

@cgbaker observed this issue in #5921 (comment). We've also had bugs in the past where finished allocations get to re-run again upon a client restart, e.g. #6354, #5945.

So while we need to keep digging into understanding the underlying cause, I propose this "band-aid" to at least recover smoothly from the bad state.

cgbaker

looks good to me...

Stop already rescheduled but somehow running allocs

…ation Stop already rescheduled but somehow running allocs

github-actions · 2022-12-18T02:15:24Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

Mahmood Ali added 2 commits September 14, 2020 17:12

add a test when .NextAllocation is set but alloc is still running

c6848d3

Only ignore rescheduled allocations if they got stopped

a80ccb8

notnoop requested a review from cgbaker September 15, 2020 01:48

cgbaker approved these changes Sep 15, 2020

View reviewed changes

notnoop merged commit 45ffcc5 into master Sep 15, 2020

notnoop deleted the b-running-next-allocation branch September 15, 2020 15:00

notnoop added this to the 0.12.5 milestone Sep 16, 2020

notnoop pushed a commit that referenced this pull request Sep 16, 2020

Merge pull request #8886 from hashicorp/b-running-next-allocation

899c2d3

Stop already rescheduled but somehow running allocs

teutat3s pushed a commit to teutat3s/nomad that referenced this pull request Oct 27, 2020

Merge pull request hashicorp#8886 from hashicorp/b-running-next-alloc…

0eaff6a

…ation Stop already rescheduled but somehow running allocs

teutat3s pushed a commit to teutat3s/nomad that referenced this pull request Jan 16, 2021

Merge pull request hashicorp#8886 from hashicorp/b-running-next-alloc…

a6169b8

…ation Stop already rescheduled but somehow running allocs

teutat3s pushed a commit to teutat3s/nomad that referenced this pull request Jan 17, 2021

Merge pull request hashicorp#8886 from hashicorp/b-running-next-alloc…

eb0bddb

…ation Stop already rescheduled but somehow running allocs

notnoop mentioned this pull request Jan 26, 2021

[Placeholder] Allocations may fail to be upgraded #8881

Closed

github-actions bot locked as resolved and limited conversation to collaborators Dec 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop already rescheduled but somehow running allocs #8886

Stop already rescheduled but somehow running allocs #8886

notnoop commented Sep 15, 2020

cgbaker left a comment

github-actions bot commented Dec 18, 2022

Stop already rescheduled but somehow running allocs #8886

Stop already rescheduled but somehow running allocs #8886

Conversation

notnoop commented Sep 15, 2020

But how did it get here

cgbaker left a comment

Choose a reason for hiding this comment

github-actions bot commented Dec 18, 2022