[bug] upgrade/restart container failed on AliOS4.9 #1470

Letty5411 · 2018-06-05T11:04:46Z

Ⅰ. Issue Description

Due to PR:#1307 upgrade/restart related integration tests fail.

Ⅱ. Describe what happened

The error message is similar, so I'll take restart cmd as example.
Run the following script:

#sh -x 1.sh
+ set -x
+ pouch run -d --name test1 busybox
7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf
+ pouch restart -t 1 test1
Error: {"message":"failed to create task for container(7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf): OCI runtime create failed: container_linux.go:265: starting container process caused \"process_linux.go:348: container init caused \\\"process_linux.go:331: running prestart hook 0 caused \\\\\\\"error running hook: exit status 1, stdout: , stderr: time=\\\\\\\\\\\\\\\"2018-06-05T19:01:49+08:00\\\\\\\\\\\\\\\" level=fatal msg=\\\\\\\\\\\\\\\"no sandbox present for 7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf\\\\\\\\\\\\\\\" \\\\\\\\n\\\\\\\"\\\"\": unknown"}

The error will happen when this two CMDs run quickly one after the other in script, it can't be reproduced manually.

Log of pouchd:

DEBU[2018-06-05T19:01:49.019584085+08:00] Calling GET /v1.24/images/busybox/json, client @
DEBU[2018-06-05T19:01:49.01975094+08:00] Get a grpc client                             elapsed=6.867µs
INFO[2018-06-05T19:01:49.027294262+08:00] Calling POST /v1.24/containers/create?name=test1, client @
INFO[2018-06-05T19:01:49.027527851+08:00] create container with args: {"Cmd":null,"Entrypoint":[],"Env":null,"Image":"busybox","OnBuild":null,"Shell":null,"HostConfig":{"NetworkMode":"bridge","OomScoreAdj":-500,"RestartPolicy":{"Name":"no"},"BlkioDeviceReadBps":null,"BlkioDeviceReadIOps":null,"BlkioDeviceWriteBps":null,"BlkioDeviceWriteIOps":null,"BlkioWeightDevice":null,"DeviceCgroupRules":null,"Devices":[],"MemoryExtra":0,"MemorySwappiness":-1,"MemoryWmarkRatio":0,"OomKillDisable":false,"Ulimits":null},"NetworkingConfig":{}}
DEBU[2018-06-05T19:01:49.027614569+08:00] Get a grpc client                             elapsed=1.172µs
DEBU[0271] prepare snapshot                              key=7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf module="containerd/snapshot" parent="sha256:432b65032b9466b4dadcc5c7b11701e71d21c18400aae946b101ad16be62333a"
DEBU[0271] event published                               module="containerd/snapshot" ns=default topic="/snapshot/prepare" type=containerd.events.SnapshotPrepare
DEBU[2018-06-05T19:01:49.031876446+08:00] Get a grpc client                             elapsed=915ns
DEBU[0271] get snapshot mounts                           key=7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf module="containerd/snapshot"
DEBU[2018-06-05T19:01:49.032194936+08:00] Get a grpc client                             elapsed=888ns
DEBU[2018-06-05T19:01:49.035978988+08:00] bind volumes: []
DEBU[2018-06-05T19:01:49.035999159+08:00] Get a grpc client                             elapsed=1.012µs
DEBU[0271] stat snapshot                                 key=7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf module="containerd/snapshot"
DEBU[2018-06-05T19:01:49.036257307+08:00] Get a grpc client                             elapsed=897ns
INFO[2018-06-05T19:01:49.038856883+08:00] Calling POST /v1.24/containers/test1/start, client @
DEBU[2018-06-05T19:01:49.038947031+08:00] create endpoint for container [7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf] on network [bridge]
DEBU[2018-06-05T19:01:49.03902409+08:00] Assigning addresses for endpoint 7ba712d6's interface on network bridge
DEBU[2018-06-05T19:01:49.039043578+08:00] RequestAddress(LocalDefault/172.17.0.0/24, <nil>, map[])
DEBU[2018-06-05T19:01:49.041043437+08:00] Assigning addresses for endpoint 7ba712d6's interface on network bridge
DEBU[2018-06-05T19:01:49.042951347+08:00] Programming external connectivity on endpoint 7ba712d6 (6ddb9b11e080d087e5d64a126e6137e436ebd3a128ca4b25cf03b62a03af71e6)
DEBU[2018-06-05T19:01:49.043399984+08:00] Get a grpc client                             elapsed=1.19µs
INFO[2018-06-05T19:01:49.043446565+08:00] start to subscribe io, backend: jsonfile, id: 7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf
INFO[2018-06-05T19:01:49.043474315+08:00] start to subscribe io, backend: jsonfile, id: 7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf
INFO[2018-06-05T19:01:49.043643304+08:00] success to get image registry.hub.docker.com/library/busybox:latest, container id 7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf
DEBU[2018-06-05T19:01:49.043666661+08:00] Get a grpc client                             elapsed=1.068µs
DEBU[0271] stat snapshot                                 key=7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf module="containerd/snapshot"
DEBU[0271] get snapshot mounts                           key=7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf module="containerd/snapshot"
DEBU[0271] event published                               module="containerd/containers" ns=default topic="/containers/create" type=containerd.events.ContainerCreate
INFO[2018-06-05T19:01:49.045780117+08:00] success to new container: 7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf
DEBU[0271] get snapshot mounts                           key=7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf module="containerd/snapshot"
INFO[0271] shim containerd-shim started                  address="/containerd-shim/default/7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf/shim.sock" debug=false module="containerd/tasks" pid=11965
INFO[0000] Firewalld running: false
DEBU[2018-06-05T19:01:49.257610762+08:00] sandbox set key processing took 117.550858ms for container 7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf
DEBU[0271] event published                               module="containerd/tasks" ns=default topic="/tasks/create" type=containerd.events.TaskCreate
INFO[2018-06-05T19:01:49.325172503+08:00] success to create task(pid=11981) in container(7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf)
DEBU[0271] event published                               module="containerd/tasks" ns=default topic="/tasks/start" type=containerd.events.TaskStart
INFO[2018-06-05T19:01:49.338689619+08:00] success to start task in container(7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf)
INFO[2018-06-05T19:01:49.338713881+08:00] success to add container, id: 7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf
DEBU[2018-06-05T19:01:49.33954004+08:00] Calling GET /v1.24/containers/test1/json, client @
INFO[2018-06-05T19:01:49.379813872+08:00] the task has quit, id: 7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf, err: <nil>, exitcode: 0, time: 2018-06-05 11:01:49.379486687 +0000 UTC
DEBU[0271] event published                               module="containerd/events" ns=default topic="/tasks/exit" type=containerd.events.TaskExit
INFO[2018-06-05T19:01:49.408530839+08:00] Calling POST /v1.24/containers/test1/restart?t=1, client @
DEBU[2018-06-05T19:01:49.408603639+08:00] Get a grpc client                             elapsed=2.314µs
INFO[0271] shim reaped                                   id=7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf module="containerd/tasks"
DEBU[0271] event published                               module="containerd/containers" ns=default topic="/containers/delete" type=containerd.events.ContainerDelete
INFO[2018-06-05T19:01:49.421445522+08:00] success to destroy container: 7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf
INFO[2018-06-05T19:01:49.421950564+08:00] close containerio backend: jsonfile, id: 7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf
INFO[2018-06-05T19:01:49.421956289+08:00] finished to subscribe io, backend: jsonfile, id: 7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf
INFO[2018-06-05T19:01:49.421966454+08:00] close containerio backend: jsonfile, id: 7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf
DEBU[2018-06-05T19:01:49.421993707+08:00] remove endpoint(6ddb9b11e080d087e5d64a126e6137e436ebd3a128ca4b25cf03b62a03af71e6) on network(bridge)
DEBU[2018-06-05T19:01:49.422058146+08:00] Revoking external connectivity on endpoint 7ba712d6 (6ddb9b11e080d087e5d64a126e6137e436ebd3a128ca4b25cf03b62a03af71e6)
INFO[2018-06-05T19:01:49.421964651+08:00] finished to subscribe io, backend: jsonfile, id: 7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf
DEBU[0271] event published                               module="containerd/tasks" ns=default topic="/tasks/delete" type=containerd.events.TaskDelete
ERRO[2018-06-05T19:01:49.431106789+08:00] failed to delete container, container id: 7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf: container "7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf" in namespace "default": not found
DEBU[2018-06-05T19:01:49.484179559+08:00] Releasing addresses for endpoint 7ba712d6's interface on network bridge
DEBU[2018-06-05T19:01:49.484201879+08:00] ReleaseAddress(LocalDefault/172.17.0.0/24, 172.17.0.2)
DEBU[2018-06-05T19:01:49.500422311+08:00] start container 7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf when restarting
DEBU[2018-06-05T19:01:49.500504067+08:00] create endpoint for container [7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf] on network [bridge]
DEBU[2018-06-05T19:01:49.500572154+08:00] Assigning addresses for endpoint 7ba712d6's interface on network bridge
DEBU[2018-06-05T19:01:49.500590902+08:00] RequestAddress(LocalDefault/172.17.0.0/24, <nil>, map[])
DEBU[2018-06-05T19:01:49.502208544+08:00] Assigning addresses for endpoint 7ba712d6's interface on network bridge
DEBU[2018-06-05T19:01:49.50392161+08:00] Programming external connectivity on endpoint 7ba712d6 (2fa3c9cfae3b1f592353836d0b231cb02de785394b7099abc25dc55ce650ebe2)
INFO[2018-06-05T19:01:49.504365396+08:00] start to subscribe io, backend: jsonfile, id: 7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf
INFO[2018-06-05T19:01:49.504398369+08:00] close containerio backend: jsonfile, id: 7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf
INFO[2018-06-05T19:01:49.504416361+08:00] close containerio backend: jsonfile, id: 7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf
INFO[2018-06-05T19:01:49.504416251+08:00] start to subscribe io, backend: jsonfile, id: 7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf
DEBU[2018-06-05T19:01:49.504419613+08:00] Get a grpc client                             elapsed=1.864µs
INFO[2018-06-05T19:01:49.504436509+08:00] finished to subscribe io, backend: jsonfile, id: 7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf
INFO[2018-06-05T19:01:49.50443665+08:00] finished to subscribe io, backend: jsonfile, id: 7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf
DEBU[2018-06-05T19:01:49.504428036+08:00] remove endpoint(2fa3c9cfae3b1f592353836d0b231cb02de785394b7099abc25dc55ce650ebe2) on network(bridge)
DEBU[2018-06-05T19:01:49.504546111+08:00] Revoking external connectivity on endpoint 7ba712d6 (2fa3c9cfae3b1f592353836d0b231cb02de785394b7099abc25dc55ce650ebe2)
INFO[2018-06-05T19:01:49.504768919+08:00] success to get image registry.hub.docker.com/library/busybox:latest, container id 7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf
DEBU[2018-06-05T19:01:49.504796217+08:00] Get a grpc client                             elapsed=1.403µs
DEBU[0271] stat snapshot                                 key=7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf module="containerd/snapshot"
DEBU[0271] get snapshot mounts                           key=7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf module="containerd/snapshot"
DEBU[0271] event published                               module="containerd/containers" ns=default topic="/containers/create" type=containerd.events.ContainerCreate
INFO[2018-06-05T19:01:49.507122935+08:00] success to new container: 7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf
DEBU[0271] get snapshot mounts                           key=7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf module="containerd/snapshot"
INFO[0271] shim containerd-shim started                  address="/containerd-shim/default/7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf/shim.sock" debug=false module="containerd/tasks" pid=12128
DEBU[2018-06-05T19:01:49.548333004+08:00] Releasing addresses for endpoint 7ba712d6's interface on network bridge
DEBU[2018-06-05T19:01:49.548372478+08:00] ReleaseAddress(LocalDefault/172.17.0.0/24, 172.17.0.2)
DEBU[2018-06-05T19:01:49.549740869+08:00] receive event: 7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf exit
INFO[2018-06-05T19:01:49.549763892+08:00] handle event: 7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf exit
INFO[0271] shim reaped                                   id=7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf module="containerd/tasks"
DEBU[0271] event published                               module="containerd/containers" ns=default topic="/containers/delete" type=containerd.events.ContainerDelete
ERRO[2018-06-05T19:01:49.666298334+08:00] failed to create new containerd container: failed to create task for container(7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf): OCI runtime create failed: container_linux.go:265: starting container process caused "process_linux.go:348: container init caused \"process_linux.go:331: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: time=\\\\\\\"2018-06-05T19:01:49+08:00\\\\\\\" level=fatal msg=\\\\\\\"no sandbox present for 7ba712d6cde9bd55871ccefa5e8ba9d2cca68052db6b933ef588482d7bb0a1cf\\\\\\\" \\\\n\\\"\"": unknown
DEBU[2018-06-05T19:01:49.666739141+08:00] remove endpoint() on network(bridge)

Ⅲ. Describe what you expected to happen

Ⅳ. How to reproduce it (as minimally and precisely as possible)

Ⅴ. Anything else we need to know?

Ⅵ. Environment:

pouch version (use pouch version):
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

HusterWan · 2018-06-10T13:09:49Z

@Letty5411 Hi, i have analysed the log you posted, the reason why your test case failed is the container exited immediately after run it.

create a containerd container have two part:

create a containerd container
run task

we have a hook to delete network endpoint and containerd container when the running container exit. so it will fail when use a deleted network sanbox to create a containerd task.

cc @allencloud

allencloud · 2018-06-11T07:12:28Z

Today @HusterWan and I discussed about it offline, and we have drawn a picture of it:

And we found that in the test case of restart, here we have a scenario that when a have ran a container, we have no idea about when the container exits. If the container exits immediately, then container would trigger the hook in pouchd, and pouchd would execute the function markStoppendAndRelease. However, if the execution time is durationA, if in this duration, pouchd accepts another request to stop or start the container, some unexpected situation would happen.

markStoppendAndRelease would excute after exit of container and release the container IO and release the container endpoint or related network items. And we must guarantee the atomic of markStoppendAndRelease.

allencloud · 2018-06-21T07:22:37Z

I think this has been fixed in #1540

pouchrobot added the kind/bug This is bug report for project label Jun 5, 2018

Letty5411 assigned allencloud Jun 5, 2018

allencloud mentioned this issue Jun 6, 2018

bugfix: copy data before put it into ringbuf #1471

Merged

HusterWan added this to the v0.5.1-milestone milestone Jun 7, 2018

allencloud mentioned this issue Jun 11, 2018

fix: lock the whole markStoppendAndRelease to be automic #1504

Closed

allencloud closed this as completed Jun 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] upgrade/restart container failed on AliOS4.9 #1470

[bug] upgrade/restart container failed on AliOS4.9 #1470

Letty5411 commented Jun 5, 2018

HusterWan commented Jun 10, 2018

allencloud commented Jun 11, 2018

allencloud commented Jun 21, 2018

[bug] upgrade/restart container failed on AliOS4.9 #1470

[bug] upgrade/restart container failed on AliOS4.9 #1470

Comments

Letty5411 commented Jun 5, 2018

Ⅰ. Issue Description

Ⅱ. Describe what happened

Ⅲ. Describe what you expected to happen

Ⅳ. How to reproduce it (as minimally and precisely as possible)

Ⅴ. Anything else we need to know?

Ⅵ. Environment:

HusterWan commented Jun 10, 2018

allencloud commented Jun 11, 2018

allencloud commented Jun 21, 2018