Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kata cleanup is not completed when --rm is used #6222

Closed
snir911 opened this issue May 14, 2020 · 8 comments
Closed

kata cleanup is not completed when --rm is used #6222

snir911 opened this issue May 14, 2020 · 8 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@snir911
Copy link

snir911 commented May 14, 2020

/kind bug

Description

When running kata-containers with --rm qemu is not terminated after container
running is completed.

Steps to reproduce the issue:

  1. Install kata

  2. sudo podman --runtime=/usr/bin/kata-runtime run --security-opt label=disable -it --rm fedora:latest sleep 1

  3. ps aux | grep qemu

Describe the results you received:

qemu process is still running

Describe the results you expected:
qemu should have been terminated

Additional information you deem important (e.g. issue happens only occasionally):

This is happening since ContainerStateRemoving state was added
(25cc43c)

Output of podman version:

Version:            1.9.1 (+ upstream)
RemoteAPI Version:  1
Go Version:         go1.14.2
OS/Arch:            linux/amd64

kata-runtime: 1.11
qemu: 4.2.0

Output of podman info --debug:

debug:
  compiler: gc
  gitCommit: ""
  goVersion: go1.14.2
  podmanVersion: 1.9.1
host:
  arch: amd64
  buildahVersion: 1.14.8
  cgroupVersion: v1
  conmon:
    package: conmon-2.0.15-1.fc32.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.15, commit: 33da5ef83bf2abc7965fc37980a49d02fdb71826'
  cpus: 2
  distribution:
    distribution: fedora
    version: "32"
  eventLogger: file
  hostname: kata-f32
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 5.6.8-300.fc32.x86_64
  memFree: 1201999872
  memTotal: 4118786048
  ociRuntime:
    name: runc
    package: runc-1.0.0-144.dev.gite6555cc.fc32.x86_64
    path: /usr/bin/runc
    version: |-
      runc version 1.0.0-rc10+dev
      commit: fbdbaf85ecbc0e077f336c03062710435607dbf1
      spec: 1.0.1-dev
  os: linux
  rootless: true
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.0.0-1.fc32.x86_64
    version: |-
      slirp4netns version 1.0.0
      commit: a3be729152a33e692cd28b52f664defbf2e7810a
      libslirp: 4.2.0
  swapFree: 0
  swapTotal: 0
  uptime: 191h 22m 42.55s (Approximately 7.96 days)
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - registry.centos.org
  - docker.io
store:
  configFile: /home/test/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: fuse-overlayfs-1.0.0-1.fc32.x86_64
      Version: |-
        fusermount3 version: 3.9.1
        fuse-overlayfs: version 1.0.0
        FUSE library version 3.9.1
        using FUSE kernel interface version 7.31
  graphRoot: /home/test/.local/share/containers/storage
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 13
  runRoot: /run/user/1000/containers
  volumePath: /home/test/.local/share/containers/storage/volumes
@openshift-ci-robot openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 14, 2020
@snir911
Copy link
Author

snir911 commented May 14, 2020

I was able to fix it with something like that:
https://github.com/snir911/libpod/tree/kata_cleanup

Although I'm not sure it's a proper fix as i found out that the issue is derived by the order of the cleanup, qemu is not terminated whenever runtime continer cleanup (delete(ctx)) is done after tearing down the storage (Is it a valid order of operations?)

Any guidance will be appreciated :)

@rhatdan
Copy link
Member

rhatdan commented May 14, 2020

I believe we are calling kata-runtime-stop and kata-runtime-kill when stopping and removing the container, then we remove the storage. kata-runtime should be stopping the qemu process. Podman does not know anything special about kata versus crun, versus runc.

@mheon WDYT

@mheon
Copy link
Member

mheon commented May 14, 2020

The suggested patch is completely unworkable; we cannot allow cleanup on running containers, it would allow containers that are running to be unmounted.

It sounds like Kata really disagrees with us unmounting the storage first, then removing the container from the runtime. That seems like a reasonable ask from an OCI runtime, so I've added a patch (#6229) to do this. If that's not enough to resolve things, this is definitely a Kata bug.

@snir911
Copy link
Author

snir911 commented May 14, 2020

Does Removing state means container might still be running? (in this patch i changed it so be set to Removing only if it's not Running)

#6229 won't help as cleanupRuntime checks if it's Stopped or Created state on the beginning, hence it immediately returns since the state is Removing.

The problem seems to be with teardownStorage not cleanupStorage, what is actually the difference?

@mheon
Copy link
Member

mheon commented May 14, 2020

Removing means the container is in the process of being deleted. I'm really confused as to how we're getting into Removing while cleanup hasn't been completed, though, Removing should guarantee cleanup already ran...

@mheon
Copy link
Member

mheon commented May 14, 2020

Nevermind, think I know what's going on here. If cleanupRuntime() is happening as part of remove() we are never in a good state to actually remove the runtime if it exists (IE, the container was in ContainerStateStopped).

@mheon
Copy link
Member

mheon commented May 14, 2020

Pushed one more commit that might resolve this

@snir911
Copy link
Author

snir911 commented May 14, 2020

fixed by #6229

@snir911 snir911 closed this as completed May 14, 2020
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 23, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

No branches or pull requests

4 participants