-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker stop deadlocks intermittently #1300
Comments
Hi, Do you know any wy to reproduce this ? Or at least see it more often ? |
Ping. Are you still seeing this @kstaken? Anything else you can give us to reproduce? |
I'm getting this on a Ubuntu 13.10 daily image. Seems to happen for every container which is |
To add another bit of information: |
@pwaller @shykes @vieux I was looking at another issue. I feel these two are related. Do have a look at my observations at : #1906 (comment) Also, I am seeing this on Ubuntu 13.10. Not tried it on older versions yet. |
I have the same problem with stopping and attaching to a container running postgresql (sudo docker pull zaiste/postgresql). OSX Mavericks running Vagrant -> Ubuntu 13.10 (saucy) -> running 'base' docker container. |
I am experiencing this issue in a Digital Ocean docker droplet. The first time I run |
Running into this one as well it looks like, host is ubuntu 13.10 (running a custom 13.10 container). |
Is this bug still present now that Docker uses straight libcontainer by default? |
I'm not sure, maybe someone who can reproduce can help. |
I just upgraded to docker 0.9 to give it a try, but unfortunately the On Thu, Mar 13, 2014 at 8:37 PM, Michael Crosby [email protected]:
|
@Kermit666 do you have a reliable way to reproduce? |
Well, it's happening on my server (a Digital Ocean Docker droplet). It's an This is the image I am using (though I think the same thing happened with https://index.docker.io/u/cloudfleet/simple-ldap/ I start it with:
And stop it with
On Fri, Mar 14, 2014 at 12:21 AM, Michael Crosby
|
@Kermit666 awesome, i'll take a look |
I'm getting this error too. I'm not sure if this is related, but I'm seeing this in my docker logs: [error] mount.go:11 [warning]: couldn't run auplink before unmount: exec: "auplink": executable file not found in $PATH ... It looks like auplink flush is failing? (I have no idea what that is) |
@developit Are you sure that |
@unclejack I can't seem to get |
@developit What's your Linux distribution? |
@unclejack CentOS 6.
|
Multiple problems are being discussed on this issue and I'll try to address them all. The original problem around @developit's problem: That auplink problem isn't fixable without aufs-tools. Please upgrade to CentOS 7 if you wish to use a recent kernel in a supported setup. I'll close this issue now. |
When you say Docker has improved a lot, what version are you talking about? We are still regularly experiencing this issue on Docker 1.0. In combination with #5684 we lose our data everytime. I checked the changelogs for later versions but didn't see anything mentioned about stability and reliability? |
I'm experiencing the same issue with |
@mrdfuse It's not only one thing, there have been a lot of changes which would fix that. The kernel matters as well. Installing the updates provided by your distribution is important. Would you mind providing the full output of @dmitry Could you provide the full output of |
docker info docker version: uname -a: cat /etc/redhat-release It scares me a bit that dmitry has the same problem with the latest Docker version, and unfortunately upgrading our kernel is not an option (not in our control). |
@mrdfuse There are known issues with older kernels and devicemapper. Not keeping up with the latest kernel updates of your distribution (what you get when you install system updates) is how you can run into these problems. Since @dmitry hasn't provided the information regarding his environment, I'll stop discussing his problems any further. 2.6.32 is a kernel which is supported by Red Hat. That kernel needs to be kept up to date with the latest system updates. The kernel you are running is one of the newest versions provided by RHEL 6.5. Red Hat has released RHEL 6.6 and the kernel version I'm seeing on a CentOS 6.6 system is 2.6.32-504.1.3.el6. I've also just received updates for devicemapper related packages. Updates for systems running RHEL/RHEL derivatives aren't optional when it comes to running containers on these systems, they're vital because they provide vital bug fixes from newer kernels and packages. @vbatts Can you also look into this, provide some advice or both, please? |
@unclejack sorry, haven't time to provide such information.
|
Ok thanks for the explanation. We outsource our infrastructure and that company wants to keep its system in a stable condition, ie not update constantly (only security patches). Since I can't guarantee this issue won't occur again even if they would upgrade (again, I see @dmitry's data, he seems uptodate), I'm gonna stop using Docker for the more critical applications (with regret). |
@dmitry The kernel you are running is no longer supported by Canonical on any of their distributions. Bugs have been reported on this issue tracker which affect kernel 3.8 :
Many other smaller and less visible bugs have been fixed in newer kernels, and many of them also affect 3.8 WARNING: Kernel 3.8 has bugs and it's not supported in any way on any distribution since August, 2014. Kernel 3.8 isn't updated to fix severe bugs which can cause data loss and it's also not updated to fix security vulnerabilities. Canonical stopped supporting this kernel in favor of a newer and better kernel. Updating to that You can find out more about this here: https://wiki.ubuntu.com/Kernel/LTSEnablementStack As you can see in the image above, the support for kernel 3.8 ended officially in August 2014. Even for a 3.8 kernel, you have an old kernel. I recall seeing 3.8.0-3x on one of my systems before the upgrade to 3.13. Please upgrade to kernel 3.13 ( No amount of Docker updates will fix the kernel bugs on kernel 3.8. @mrdfuse If you don't keep your system up to date with updates from Red Hat, there isn't much to do. Red Hat is updating kernel 2.6.32 to make sure it's stable and that the system they're offering to their customers is stable. Keeping systems fully patched and updated is necessary. |
@unclejack thank you for such a great explanation! I will defenetelly upgrade the server. |
I understand completely, it's just not under my control |
@mrdfuse
|
It only happened a few times when running about 6 containers for a few months non-stop. I don't it will be easy to reproduce. |
I have something happening like this just now on ubuntu. Ubuntu itself is running in a kvm. I dockered a lot of stuff already over the last month and never had this problem so far: # docker info Containers: 10 Images: 99 Storage Driver: aufs Root Dir: /var/lib/docker/aufs Dirs: 119 Execution Driver: native-0.2 Kernel Version: 3.13.0-43-generic Operating System: Ubuntu 14.04.1 LTS CPUs: 2 Total Memory: 1.955 GiB # docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES f32bedd01e50 mulle/firefox-syncserver:latest "/usr/bin/make serve 14 hours ago Up 14 hours 0.0.0.0:5000->5000/tcp firefox-syncserver but docker stop firefox-syncserver just hangs there since now 15 minutes. There is nothing interesting in /var/log/syslog. ... So now I CTRL-Ced it and tried to do it again and this is what happened: ^Croot@muhbuntu:~/docker# docker stop firefox-syncserver Error response from daemon: Cannot stop container firefox-syncserver: no such process FATA[0000] Error: failed to stop one or more containers root@muhbuntu:~/docker# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES f32bedd01e50 mulle/firefox-syncserver:latest "/usr/bin/make serve 14 hours ago Up 14 hours 0.0.0.0:5000->5000/tcp firefox-syncserver Definetly inconsistent. I did service docker restart to fix this. |
hi, I'm not yet sure whether that's lxc's or docker problem. Please check artifacts below.
|
Hitting this on
docker info
strace
docker.log:
|
Alright, that container seems to be gone now for whatever reason. I can stop other containers fine. Will keep an eye out for it again. EDIT: after being able to stop/rm and recreate that container on the same host, I am getting deadlocks again logs are the same:
Aaaaaand it's gone again SIGKILL on the docker process doesn't fix anything EDIT2: restart of host allowed me to stop/rm the container, but yet AGAIN if I recreate it, I cannot stop it. This is very bad |
so I tried reinstalling the docker package as suggested in #10684 (comment) And even that didn't work! I have now concluded the only way to fix this is to destroy and recreate my host which is very disturbing. Can this please be reopened? @jfrazelle @unclejack @crosbymichael ? I will keep this host up for debugging |
I have the same issue on: Server: Has this been re-opened? Regards, |
I'm seeing the same thing for one of our docker hosts. Any guidance? We're using:
|
I am experiencing the same problem, with strace output similar to @MrMMorris
The
We can see that the rm's parent process bash is zombified, so the signals sent to him get lost in nature. |
Same here cannot stop a centos container: docker ps
$ docker stop 4cb482d2c1f9
strace docker stop centos
The complete file herehttps://drive.google.com/open?id=0BzYXUSsUVFR_RGlWLVZBY2ZDWjQ Docker host
|
That strace looks to be incorrect; |
@thaJeztah How about this strace for my hanged container:
Is that more helpful? |
See also: #18758 |
@sleaze just asked around, but that strace probably doesn't contain useful information, because thats from the client, not the daemon. |
@thaJeztah I appreciate you looking into this for me, seriously you're the man!!! I actually found a hack-ish workaround based on @SamSaffron's brilliant observations. The "problem" goes away as long as I wonder how many other people in this thread are experiencing the same thing as @SamSaffron and I... This certainly makes docker seem ill-equipped to handle many real-life situations compared things like LX[CD] which natively support multiple process environments and use-cases. |
I do. My previous comment gets explained by @SamSaffron observations |
There seems to be an issue in docker stop that causes it to intermittently deadlock and hang the execution context performing the stop. It seems to be a race condition as it's very sporadic.
I've tracked the issue down to a block on container.Wait() here: https://github.com/dotcloud/docker/blob/master/container.go#L886 The prior step is that docker tries to directly kill the process which doesn't error but also doesn't actually succeed in killing it either.
In this scenario the lxc container is already gone but the process is still running and docker thinks the container still exists. When it's in this state docker stop on the container just hangs because it waits to acquire the lock. Killing the process manually will cause docker to think the container exited and will free the lock. It requires a kill -9 to kill the process but I'm not sure why Process.Kill isn't able to do it directly.
Here's the log of a hung session (the last line is a debug I added right before L886 of container.go):
The text was updated successfully, but these errors were encountered: