Stuck allocation on dead job #806

far-blue · 2016-02-16T22:25:20Z

I'm new to all this so maybe I've just missed something but I appear to have an orphan allocation from a dead job that failed to completely start.

Context: Running v0.3.0rc1 in the dev environment created by the included vagrantfile. Running in --dev' mode (dual agent/client mode).

I started with a modified version of the example.nomad file created with nomad init and I modified the task to run a mysql container and added a second task to run an apache container. I started the job with nomad run but it failed to complete because I'd typo'd the apache container image name.

At this point I had a mysql container running but no apache container. So I edited the job to correct my typo and called nomad run again. My understanding was that it would evaluate the difference and just start the apache container (because the mysql container was already running).

However, it actually re-evaluated the entire job and started both the apache container and a second mysql container, while leaving the original container running. Note that I have not changed the name of the job or the task group (I left them as example and cache, as per the original job config).

So I called nomad stop thinking it would clean everything up but it only stopped the new containers, leaving the original mysql container. I thought maybe nomad had 'forgotten' about it so killed it with Docker directly - but nomad put it back.

So now I have a mysql container that nomad is keeping alive but no job to control it with.

> nomad status example
No job(s) with prefix or id "example" found

> docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                                                  NAMES
266885ac1ee4        mysql:latest        "/entrypoint.sh mysql"   33 minutes ago      Up 33 minutes       127.0.0.1:23968->3306/tcp, 127.0.0.1:23968->3306/udp   mysql-bc506dfd-6351-ab4e-ad23-95c3fd971baa

> nomad alloc-status bc506dfd
ID              = bc506dfd
Eval ID         = 3226e9b9
Name            = example.cache[0]
Node ID         = f8e6eacc
Job ID          = example
Client Status   = failed
Evaluated Nodes = 1
Filtered Nodes  = 0
Exhausted Nodes = 0
Allocation Time = 2.21072ms
Failures        = 0

==> Task "apache" is "dead"
Recent Events:
Time                   Type            Description
16/02/16 21:33:52 UTC  Driver Failure  failed to start: Failed to pull `apache:latest`: Error: image library/apache not found

==> Task "mysql" is "running"
Recent Events:
Time                   Type        Description
16/02/16 21:49:12 UTC  Started     <none>
16/02/16 21:48:39 UTC  Terminated  Exit Code: 0
16/02/16 21:34:23 UTC  Started     <none>

==> Status
Allocation "bc506dfd" status "failed" (0/1 nodes filtered)
  * Score "f8e6eacc-46f7-18b0-df52-350346732e60.binpack" = 7.683003

So I'm not quite sure what to do next and I'm pretty certain this is not expected behaviour.

Any thoughts anyone?

The text was updated successfully, but these errors were encountered:

diptanu · 2016-02-16T23:34:48Z

@far-blue I could reproduce this, and thanks for reporting.

dgshep · 2016-02-25T18:07:16Z

I ran into this exact issue when trying out 0.3.0-rc2. As far as I can tell the only way to clear out the orphaned allocation is to clobber the nomad servers and remove all existing state :/

dgshep · 2016-03-07T18:13:09Z

@diptanu, is there someone actively working on this? If not I would be willing to take a crack at it.

diptanu · 2016-03-17T01:39:23Z

@dgshep Yes! We might be able to tackle this in the next release.

dgshep · 2016-03-17T05:30:16Z

Very cool. BTW Congrats on the C1M project! Stellar stuff...

rickardrosen · 2017-03-01T20:59:58Z

I am seeing this on Nomad v0.5.4.

I had a job that no longer exists with an allocation stuck on a node, trying to pull a container that no longer exists and receiving a 400 from the registry.
It's been doing this for a couple of weeks without getting cleaned up, so tonight I decided to restart the nomad agent which allowed the task to be killed.

Is it a regression, or have I triggered something completely new for some reason?

github-actions · 2022-12-15T02:17:41Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

diptanu added type/bug theme/scheduling labels Feb 16, 2016

dgshep mentioned this issue Mar 17, 2016

client: Driver starting is included in restart policy. #859

Merged

dadgar mentioned this issue Mar 22, 2016

client: When a task fails, kill all other tasks in the task group #962

Merged

dadgar closed this as completed in #962 Mar 26, 2016

github-actions bot locked as resolved and limited conversation to collaborators Dec 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stuck allocation on dead job #806

Stuck allocation on dead job #806

far-blue commented Feb 16, 2016

diptanu commented Feb 16, 2016

dgshep commented Feb 25, 2016

dgshep commented Mar 7, 2016

diptanu commented Mar 17, 2016

dgshep commented Mar 17, 2016

rickardrosen commented Mar 1, 2017

github-actions bot commented Dec 15, 2022

Stuck allocation on dead job #806

Stuck allocation on dead job #806

Comments

far-blue commented Feb 16, 2016

diptanu commented Feb 16, 2016

dgshep commented Feb 25, 2016

dgshep commented Mar 7, 2016

diptanu commented Mar 17, 2016

dgshep commented Mar 17, 2016

rickardrosen commented Mar 1, 2017

github-actions bot commented Dec 15, 2022