-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drain allocation status is ignored when draining at capacity. #16117
Comments
Hi @stswidwinski! First some background: note that there's a difference between the For the case of
But you'd expect the drainer to respect the |
This makes sense. Thank you! Looking forward to the patch :) |
This issue is fixed by #14348, which will ship in the next regular patch release of Nomad. |
I have just tested this against 1.5.5 and the bug as described still occurs in the same way. The repro remains the same, except now it's against @tgross, I think that your patch changes the handling of stopping allocations correctly in the case of non-blocked evaluations, but leave the blocked evaluation case in the old state. Do you mind taking another look? |
Re-opening |
Nomad version
Nomad v1.4.2 (039d70e)
However, this repros on
v1.4.3
as well.Operating system and Environment details
These do not matter. Unix/Linux.
Issue
When running a cluster which is running at capacity, a drain of a node which has
service
allocations running on it will create an evaluation which is Pending. This Pending evaluation will immediately be solved for if more capacity is added resulting in multiple allocations running for a single job, especially with large kill timeouts.Under normal circumstances we expect that the allocation which has been drained blocks the creation of any new allocation.
Reproduction steps
Let us begin with the local setup. We will want two clients and one server. The first server and client are created using the usual, boring setup. Please note however that we set the max kill timeout to something considerable, such as an hour:
The second client is set up analogously, but we cannot use
nomad agent -dev
as easily. To avoid port conflicts we do:After this setup we have two nodes with raw-exec enabled. Just as a sanity check:
Then, start a job:
Now, let us flip the node with no allocations to be unavailable. We want to simulate the situation in which we are running at full capacity:
The job continued to run just fine. Now, let us drain the node on which the job is currently running and inspect the state of allocations:
Now, let us make the node that had nothing running on it eligible again.
And to our surprise the job which should have just 1 allocation has... Two! Both running.
Expected Result
The behavior should be consistent with regular drain behavior in which we do not schedule additional allocations until the last allocation is in a terminal state.
Actual Result
We schedule extra allocations and ignore the state of the old ones.
The logs don't contain much insight into what happened.
The text was updated successfully, but these errors were encountered: