You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Allocations with a ClientStatus="running" and a DesiredStatus="stop" are removed by garbage collection, even if the process managed by the allocation continues running.
If the garbage collection is done on a server via a forced garbage collection, the server will no longer be aware of the allocation, but the client will be in some fashion.
If the garbage collection is done on a client, then it seems the server still thinks the allocation exists, while the client seems to get into a weird state. The client will, for example, try to delete the allocation directory, but if the allocation's process is logging to that directory, then the client will delete everything but the file that is being used by the process.
This seems to have some impact on job updates, since a replacement allocation will be created with ClientStatus="pending" and DesiredStatus="run", but it will not actually start running until the allocation it is replacing goes into "ClientStatus="complete" and DesiredStatus="stop". In my case, I'm seeing hundreds of allocations that get stuck in the pending->run state, never to actually start up, so it occurred to me this could be related.
This issue may be one of the precise definition of when an allocation is in the "terminal state". The docs don't seem to define it exactly - is it when an allocation has a ClientStatus="running" and a DesiredStatus="stop" (as current behavior seems to indicate), or is when an allocaton has a "ClientStatus="complete" and a DesiredStatus="stop" (which is what I had been expecting)?
It seems the expected behavior would be to never garbage collect an allocation if it's ClientStatus="running", except, possibly after some configurable threshold. I don't think I'm hitting that kind of threshold, since I can reproduce it with the below steps within minutes after an allocation has been started.
Reproduction steps
Deploy a jobspec that runs a script that handles the nomad stop signal but does not exit for minutes or hours (simulate a long-duration graceful draining operation).
Observe the allocation changes to having a ClientStatus="running" and a DesiredStatus="stop".
Force a garbage collection from the server.
Observe the allocation disappears. Test with "nomad status "
The text was updated successfully, but these errors were encountered:
joshuaclausen
changed the title
Garbage collection removes allocations that are draining
Garbage collection removes allocations that are still running
Nov 29, 2018
This PR fixes an edge case where we could GC an allocation that was in a
desired stop state but had not terminated yet. This can be hit if the
client hasn't shutdown the allocation yet or if the allocation is still
shutting down (long kill_timeout).
Fixes#4940
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Nomad version
Nomad v0.8.6 (ab54ebc+CHANGES)n`
Operating system and Environment details
Windows Server 2012r2 and 2016
Issue
Allocations with a ClientStatus="running" and a DesiredStatus="stop" are removed by garbage collection, even if the process managed by the allocation continues running.
If the garbage collection is done on a server via a forced garbage collection, the server will no longer be aware of the allocation, but the client will be in some fashion.
If the garbage collection is done on a client, then it seems the server still thinks the allocation exists, while the client seems to get into a weird state. The client will, for example, try to delete the allocation directory, but if the allocation's process is logging to that directory, then the client will delete everything but the file that is being used by the process.
This seems to have some impact on job updates, since a replacement allocation will be created with ClientStatus="pending" and DesiredStatus="run", but it will not actually start running until the allocation it is replacing goes into "ClientStatus="complete" and DesiredStatus="stop". In my case, I'm seeing hundreds of allocations that get stuck in the pending->run state, never to actually start up, so it occurred to me this could be related.
This issue may be one of the precise definition of when an allocation is in the "terminal state". The docs don't seem to define it exactly - is it when an allocation has a ClientStatus="running" and a DesiredStatus="stop" (as current behavior seems to indicate), or is when an allocaton has a "ClientStatus="complete" and a DesiredStatus="stop" (which is what I had been expecting)?
It seems the expected behavior would be to never garbage collect an allocation if it's ClientStatus="running", except, possibly after some configurable threshold. I don't think I'm hitting that kind of threshold, since I can reproduce it with the below steps within minutes after an allocation has been started.
Reproduction steps
The text was updated successfully, but these errors were encountered: