Docker does not catch container exit #2306

Raffo · 2017-12-28T11:35:56Z

Issue Report

Bug

Docker does not correctly catch the container exit.
The same issue is described on moby/moby#33820 . It's unclear at this stage if it is related to the docker build used in Container Linux which is the reason why I am opening this here.

Container Linux Version

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1576.4.0
VERSION_ID=1576.4.0
BUILD_ID=2017-12-06-0449
PRETTY_NAME="Container Linux by CoreOS 1576.4.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

Environment

AWS EC2 m4.large

Expected Behavior

Docker should correctly catch the container exit.

Actual Behavior

Docker does not catch the container exit: an exited container cannot be found in the process try while it is still visible via docker ps.

Reproduction Steps

Once the problem happens:

$ docker run -it ubuntu /bin/bash
root@943b8935e38e:/# exit
exit

[nothing happens]

Other Information

Docker version:

Client:
 Version:      17.09.0-ce
 API version:  1.32
 Go version:   go1.8.4
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:24:58 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.09.0-ce
 API version:  1.32 (minimum version 1.12)
 Go version:   go1.8.4
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:24:58 2017
 OS/Arch:      linux/amd64
 Experimental: false

docker info:

Containers: 33
 Running: 26
 Paused: 0
 Stopped: 7
Images: 21
Server Version: 17.09.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: v0.13.2 (expected: 949e6facb77383876aeff8a6944dde66b3089574)
Security Options:
 seccomp
  Profile: default
 selinux
Kernel Version: 4.13.16-coreos-r2
Operating System: Container Linux by CoreOS 1576.4.0 (Ladybug)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.799GiB
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Live Restore Enabled: false

The text was updated successfully, but these errors were encountered:

Deshke · 2018-01-11T09:55:28Z

could be the same thing here

DISTRIB_ID="Container Linux by CoreOS"
DISTRIB_RELEASE=1576.5.0
DISTRIB_CODENAME="Ladybug"
DISTRIB_DESCRIPTION="Container Linux by CoreOS 1576.5.0 (Ladybug)"

PLEG is going crazy because it can not connect to a container that does not exists anymore

Jan 11 09:48:35 ip-172-25-104-55.us-east-2.compute.internal env[1157]: time="2018-01-11T09:48:35.606000889Z" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container 8824044248a7efceb04e1744fbd2b4ea4faa7d4b8394a8b70b29ef93e8f2d59c: rpc error: code = Unknown desc = containerd: container not found"

docker info

docker info
Containers: 106
 Running: 51
 Paused: 0
 Stopped: 55
Images: 27
Server Version: 17.09.0-ce
Storage Driver: overlay
 Backing Filesystem: extfs
 Supports d_type: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: v0.13.2 (expected: 949e6facb77383876aeff8a6944dde66b3089574)
Security Options:
 seccomp
  Profile: default
 selinux
Kernel Version: 4.14.11-coreos
Operating System: Container Linux by CoreOS 1576.5.0 (Ladybug)
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 29.43GiB
Name: ip-172-25-104-55.us-east-2.compute.internal
ID: XQ2B:UBZW:PXBF:6ME4:IFA2:H4LQ:H7WN:UUWS:AESL:FVM5:W2V6:VNZ4
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: livevideocloud
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Raffo · 2018-01-11T10:13:17Z

@Deshke You can make sure that is is the same problem by running:

docker run -it ubuntu /bin/bash
root@943b8935e38e:/# exit

If your terminal just hangs, you have the same problem.

lucab · 2018-01-11T10:56:16Z

For reference, I tried to reproduce the ubuntu-bash-exit hang on all current versions of docker across stable (17.09.0-ce), beta (17.09.01-ce) and alpha (17.11.0-ce) without luck so far, so there may be additional environmental factors triggering this (or increasing race chances).
For those who can semi-reliably reproduce this, it may be helpful to check if this also happens on beta and alpha.

However, the original report on Debian suggests that this is a generic docker upstream issue which is better triaged on moby tracker, if you have additional details please followup at moby/moby#33820. I'm keeping this ticket open to track future resolution status in CL channels.

Raffo · 2018-01-11T11:01:21Z

Thanks for having a look @lucab . I think the thing we have in common is running lots of docker containers on the same host (and using Kubernetes). Maybe in such an environment the issue becomes more frequent? Anyway, we currently have rolled back to 1.12.06 and the issue disappeared with the same version of Container Linux.

ghost · 2018-01-11T23:44:38Z

@lucab - I couldn't reproduce it either manually.
Our k8s clusters run our CICD builds, so we schedule thousands of pods a day. We start seeing the issue after a few hours on some nodes and eventually all nodes show the symptoms.

CoreOS Beta is affected too, I'm going put a few nodes on Alpha and will report back with the results.

Deshke · 2018-01-12T06:49:51Z

@Raffo can reproduce on a instance that already has a zombie container running.

So far i can reproduce this with the current stable, beta and alpha image. on the alpha image it currently takes a day until docker is unresponsive

(instance response time = time of the PLEG health check )

chrisferry · 2018-02-12T15:58:11Z

We are seeing this issue as well which then causes PLEG issues and finally general k8s cluster instability.
Docker version 17.09.1-ce, build 19e2cf6
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0

Going to revert to 1.12.06, hopefully that will solve the problem for now.

tyranron · 2018-02-13T09:58:49Z

Same problem here. Reverting to 1.12.06 solved the PLEG issue.

lucab · 2018-04-11T10:01:15Z

A runc race has been recently fixed via opencontainers/runc#1698, and that has been backported to docker 17.12.1 which we are currently shipping in beta and stable channels.

I suspect it may be related to this bug and thus fixing it, but I have no way to verify that. It would be good if anybody previously affected by this could check if the issue is still present with docker 17.12.1.

euank · 2018-04-16T21:19:16Z

I agree with @lucab's assessment that this may be that runc issue that should be fixed in all current channels with docker-ce 17.12.1 and newer.

To confirm whether that's the bug you're encountering, once dockerd has hung, send it and containerd a SIGUSR1 to collect stack traces.

If that's the bug, dockerd's stack trace should include ones similar to those here.

@chrisferry since you saw an issue similar to this on 17.12.1, can you open a new issue (here or against the upstream docker project as appropriate) with the Container Linux version information + goroutine stacks?

euank · 2018-05-16T21:21:02Z

Per my previous comment, I'm closing this with the hope it's fixed as of docker-ce 17.12.1, and thus fixed in all channels. If you still see this issue, let us know.

Raffo mentioned this issue Dec 28, 2017

Docker does not catch container exit moby/moby#33820

Closed

bgilbert added area/stability component/docker kind/bug needs/repro team/os labels Dec 28, 2017

euank closed this as completed May 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker does not catch container exit #2306

Docker does not catch container exit #2306

Raffo commented Dec 28, 2017 •

edited

Loading

Deshke commented Jan 11, 2018

Raffo commented Jan 11, 2018

lucab commented Jan 11, 2018

Raffo commented Jan 11, 2018

ghost commented Jan 11, 2018

Deshke commented Jan 12, 2018 •

edited

Loading

chrisferry commented Feb 12, 2018

tyranron commented Feb 13, 2018

lucab commented Apr 11, 2018 •

edited

Loading

euank commented Apr 16, 2018

euank commented May 16, 2018 •

edited

Loading

Docker does not catch container exit #2306

Docker does not catch container exit #2306

Comments

Raffo commented Dec 28, 2017 • edited Loading

Issue Report

Bug

Container Linux Version

Environment

Expected Behavior

Actual Behavior

Reproduction Steps

Other Information

Deshke commented Jan 11, 2018

Raffo commented Jan 11, 2018

lucab commented Jan 11, 2018

Raffo commented Jan 11, 2018

ghost commented Jan 11, 2018

Deshke commented Jan 12, 2018 • edited Loading

chrisferry commented Feb 12, 2018

tyranron commented Feb 13, 2018

lucab commented Apr 11, 2018 • edited Loading

euank commented Apr 16, 2018

euank commented May 16, 2018 • edited Loading

Raffo commented Dec 28, 2017 •

edited

Loading

Deshke commented Jan 12, 2018 •

edited

Loading

lucab commented Apr 11, 2018 •

edited

Loading

euank commented May 16, 2018 •

edited

Loading