Garbage collection removes allocations that are still running #4940

joshuaclausen · 2018-11-29T22:25:30Z

Nomad version

Nomad v0.8.6 (ab54ebc+CHANGES)n`

Operating system and Environment details

Windows Server 2012r2 and 2016

Issue

Allocations with a ClientStatus="running" and a DesiredStatus="stop" are removed by garbage collection, even if the process managed by the allocation continues running.

If the garbage collection is done on a server via a forced garbage collection, the server will no longer be aware of the allocation, but the client will be in some fashion.

If the garbage collection is done on a client, then it seems the server still thinks the allocation exists, while the client seems to get into a weird state. The client will, for example, try to delete the allocation directory, but if the allocation's process is logging to that directory, then the client will delete everything but the file that is being used by the process.

This seems to have some impact on job updates, since a replacement allocation will be created with ClientStatus="pending" and DesiredStatus="run", but it will not actually start running until the allocation it is replacing goes into "ClientStatus="complete" and DesiredStatus="stop". In my case, I'm seeing hundreds of allocations that get stuck in the pending->run state, never to actually start up, so it occurred to me this could be related.

This issue may be one of the precise definition of when an allocation is in the "terminal state". The docs don't seem to define it exactly - is it when an allocation has a ClientStatus="running" and a DesiredStatus="stop" (as current behavior seems to indicate), or is when an allocaton has a "ClientStatus="complete" and a DesiredStatus="stop" (which is what I had been expecting)?

It seems the expected behavior would be to never garbage collect an allocation if it's ClientStatus="running", except, possibly after some configurable threshold. I don't think I'm hitting that kind of threshold, since I can reproduce it with the below steps within minutes after an allocation has been started.

Reproduction steps

Deploy a jobspec that runs a script that handles the nomad stop signal but does not exit for minutes or hours (simulate a long-duration graceful draining operation).
Observe the allocation changes to having a ClientStatus="running" and a DesiredStatus="stop".
Force a garbage collection from the server.
Observe the allocation disappears. Test with "nomad status "

The text was updated successfully, but these errors were encountered:

This PR fixes an edge case where we could GC an allocation that was in a desired stop state but had not terminated yet. This can be hit if the client hasn't shutdown the allocation yet or if the allocation is still shutting down (long kill_timeout). Fixes #4940

github-actions · 2022-11-27T02:24:05Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

joshuaclausen changed the title ~~Garbage collection removes allocations that are draining~~ Garbage collection removes allocations that are still running Nov 29, 2018

endocrimes added the theme/client label Nov 30, 2018

dadgar added the theme/core label Dec 4, 2018

dadgar mentioned this issue Dec 5, 2018

Don't GC running but desired stop allocations #4965

Merged

dadgar closed this as completed in #4965 Dec 7, 2018

github-actions bot locked as resolved and limited conversation to collaborators Nov 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Garbage collection removes allocations that are still running #4940

Garbage collection removes allocations that are still running #4940

joshuaclausen commented Nov 29, 2018

github-actions bot commented Nov 27, 2022

Garbage collection removes allocations that are still running #4940

Garbage collection removes allocations that are still running #4940

Comments

joshuaclausen commented Nov 29, 2018

Nomad version

Operating system and Environment details

Issue

Reproduction steps

github-actions bot commented Nov 27, 2022