Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filling up /run/libpod/socket #3962

Closed
yangm97 opened this issue Sep 6, 2019 · 13 comments · Fixed by #4009
Closed

Filling up /run/libpod/socket #3962

yangm97 opened this issue Sep 6, 2019 · 13 comments · Fixed by #4009
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@yangm97
Copy link
Contributor

yangm97 commented Sep 6, 2019

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

Steps to reproduce the issue:

  1. launch a container containing a healthcheck instruction

Describe the results you received:
It appears that socket files from health checks aren't being cleaned up.

Describe the results you expected:
Not fill up /run.

Additional information you deem important (e.g. issue happens only occasionally):
Still reproducible across reboots, podman system prunes, etc.

Output of podman version:

Version:            1.5.1
RemoteAPI Version:  1
Go Version:         go1.10.4
OS/Arch:            linux/arm

Output of podman info --debug:

debug:
  compiler: gc
  git commit: ""
  go version: go1.10.4
  podman version: 1.5.1
host:
  BuildahVersion: 1.10.1
  Conmon:
    package: 'conmon: /usr/libexec/podman/conmon'
    path: /usr/libexec/podman/conmon
    version: 'conmon version 2.0.0, commit: unknown'
  Distribution:
    distribution: debian
    version: "9"
  MemFree: 119107584
  MemTotal: 516104192
  OCIRuntime:
    package: 'cri-o-runc: /usr/lib/cri-o-runc/sbin/runc'
    path: /usr/lib/cri-o-runc/sbin/runc
    version: 'runc version spec: 1.0.1-dev'
  SwapFree: 87924736
  SwapTotal: 258048000
  arch: arm
  cpus: 4
  eventlogger: journald
  hostname: hubot
  kernel: 4.19.62-sunxi
  os: linux
  rootless: false
  uptime: 17h 50m 12.09s (Approximately 0.71 days)
registries:
  blocked: null
  insecure: null
  search:
  - docker.io
store:
  ConfigFile: /etc/containers/storage.conf
  ContainerStore:
    number: 9
  GraphDriverName: overlay
  GraphOptions: null
  GraphRoot: /var/lib/containers/storage
  GraphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  ImageStore:
    number: 12
  RunRoot: /var/run/containers/storage
  VolumePath: /var/lib/containers/storage/volumes

Package info (e.g. output of rpm -q podman or apt list podman):

podman/bionic,now 1.5.1-1~ubuntu18.04~ppa1 armhf [installed]

Additional environment details (AWS, VirtualBox, physical, etc.):
Orange Pi Zero running armbian stretch (I know)

lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 9.9 (stretch)
Release:        9.9
Codename:       stretch
uname -a
Linux hubot 4.19.62-sunxi #5.92 SMP Wed Jul 31 22:07:23 CEST 2019 armv7l GNU/Linux
@openshift-ci-robot openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 6, 2019
@mheon
Copy link
Member

mheon commented Sep 6, 2019

This sounds like Conmon control FIFOs and attach sockets - but those should be deleted once the container exits.

Are you running Podman as root, or without? Can you do an ls of the directory, so we can confirm what files are accumulating?

@yangm97
Copy link
Contributor Author

yangm97 commented Sep 6, 2019

This sounds like Conmon control FIFOs and attach sockets - but those should be deleted once the container exits.

All containers were up, none of them exited during the period files were piling up. This is a new install, never done any upgrade or anything, but stopping all containers and running system renumber as per #3436 appears to have calmed it down for now.

Are you running Podman as root, or without? Can you do an ls of the directory, so we can confirm what files are accumulating?

I'm running as root. I did a ls, since it had so many files it blew up my tmux scrollback but I guess you don't need the full output anyway so here's a fragment of it: https://pastebin.com/bKHPVKMe

@yangm97
Copy link
Contributor Author

yangm97 commented Sep 7, 2019

It is doing that again. So far I remember doing some pull, rm and run. Also this happens:

# podman stats --all
Error: unable to obtain cgroup stats: parse 1313181801506 from /sys/fs/cgroup/cpuacct/machine.slice/libpod-2aa8180a744434fc4d8682a1842ebff6095bb2657d0274d0656480a124893a1f.scope/cpuacct.usage: strconv.ParseUint: parsing "1313181801506": value out of range

@yangm97
Copy link
Contributor Author

yangm97 commented Sep 9, 2019

I think I've been able to isolate it a little further. there's one container on my stack which has a healthcheck instruction, these socket files start piling up exactly after it has started.

@mheon
Copy link
Member

mheon commented Sep 9, 2019

Probably artifacts from exec then - @haircommander those ought to be cleaned up automatically, I believe?

@yangm97
Copy link
Contributor Author

yangm97 commented Sep 9, 2019

The healthcheck instruction looks like this:

HEALTHCHECK --interval=3s --timeout=3s CMD ["/usr/local/bin/healthchecker"]

It's currently failing but because of reasons unrelated to docker/podman.

@mheon
Copy link
Member

mheon commented Sep 9, 2019

It makes sense how easily it's filling the tmpfs - we're getting... I think 3? separate files per exec session, and for some reason they are not being cleaned up. Add to that an exec every 3 seconds and we rapidly begin to accumulate files.

@haircommander
Copy link
Collaborator

yeah the exec artifacts should removed after an exec session. can you exec that command in the container and let me know if an error along the lines of Error removing exec session... pops up?

@mheon
Copy link
Member

mheon commented Sep 9, 2019

There should also be logs available in podman inspect $CID related to the healthchecks

@yangm97
Copy link
Contributor Author

yangm97 commented Sep 9, 2019

podman inspect shows healthchecks as failing (as expected)

# podman healthcheck run erlradio
unhealthy
# podman exec erlradio /usr/local/bin/healthchecker
Error: exec failed: container_linux.go:346: starting container process caused "exec format error": OCI runtime error

@mheon
Copy link
Member

mheon commented Sep 9, 2019

Interesting.

My assumption here would be that podman exec probably isn't cleaning up properly after an error, which is the original bug report here; and we have a bug causing a 100% failure rate for this type of exec causing massive amounts of files to be strewn across the system.

@yangm97
Copy link
Contributor Author

yangm97 commented Sep 9, 2019

Looks like fixing my broken healthchecker did not have an effect on this issue.

@mheon
Copy link
Member

mheon commented Sep 12, 2019

Investigating further - going to try to get a fix into 1.6.0

baude added a commit to baude/podman that referenced this issue Sep 12, 2019
when executing a healthcheck, we were not cleaning up after exec's use
of a socket.  we now remove the socket file and ignore if for reason it
does not exist.

Fixes: containers#3962

Signed-off-by: baude <[email protected]>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 23, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants