Filling up /run/libpod/socket #3962

yangm97 · 2019-09-06T21:00:30Z

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

Steps to reproduce the issue:

launch a container containing a healthcheck instruction

Describe the results you received:
It appears that socket files from health checks aren't being cleaned up.

Describe the results you expected:
Not fill up /run.

Additional information you deem important (e.g. issue happens only occasionally):
Still reproducible across reboots, podman system prunes, etc.

Output of podman version:

Version:            1.5.1
RemoteAPI Version:  1
Go Version:         go1.10.4
OS/Arch:            linux/arm

Output of podman info --debug:

debug:
  compiler: gc
  git commit: ""
  go version: go1.10.4
  podman version: 1.5.1
host:
  BuildahVersion: 1.10.1
  Conmon:
    package: 'conmon: /usr/libexec/podman/conmon'
    path: /usr/libexec/podman/conmon
    version: 'conmon version 2.0.0, commit: unknown'
  Distribution:
    distribution: debian
    version: "9"
  MemFree: 119107584
  MemTotal: 516104192
  OCIRuntime:
    package: 'cri-o-runc: /usr/lib/cri-o-runc/sbin/runc'
    path: /usr/lib/cri-o-runc/sbin/runc
    version: 'runc version spec: 1.0.1-dev'
  SwapFree: 87924736
  SwapTotal: 258048000
  arch: arm
  cpus: 4
  eventlogger: journald
  hostname: hubot
  kernel: 4.19.62-sunxi
  os: linux
  rootless: false
  uptime: 17h 50m 12.09s (Approximately 0.71 days)
registries:
  blocked: null
  insecure: null
  search:
  - docker.io
store:
  ConfigFile: /etc/containers/storage.conf
  ContainerStore:
    number: 9
  GraphDriverName: overlay
  GraphOptions: null
  GraphRoot: /var/lib/containers/storage
  GraphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  ImageStore:
    number: 12
  RunRoot: /var/run/containers/storage
  VolumePath: /var/lib/containers/storage/volumes

Package info (e.g. output of rpm -q podman or apt list podman):

podman/bionic,now 1.5.1-1~ubuntu18.04~ppa1 armhf [installed]

Additional environment details (AWS, VirtualBox, physical, etc.):
Orange Pi Zero running armbian stretch (I know)

lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 9.9 (stretch)
Release:        9.9
Codename:       stretch

uname -a
Linux hubot 4.19.62-sunxi #5.92 SMP Wed Jul 31 22:07:23 CEST 2019 armv7l GNU/Linux

The text was updated successfully, but these errors were encountered:

mheon · 2019-09-06T21:33:47Z

This sounds like Conmon control FIFOs and attach sockets - but those should be deleted once the container exits.

Are you running Podman as root, or without? Can you do an ls of the directory, so we can confirm what files are accumulating?

yangm97 · 2019-09-06T22:48:48Z

This sounds like Conmon control FIFOs and attach sockets - but those should be deleted once the container exits.

All containers were up, none of them exited during the period files were piling up. This is a new install, never done any upgrade or anything, but stopping all containers and running system renumber as per #3436 appears to have calmed it down for now.

Are you running Podman as root, or without? Can you do an ls of the directory, so we can confirm what files are accumulating?

I'm running as root. I did a ls, since it had so many files it blew up my tmux scrollback but I guess you don't need the full output anyway so here's a fragment of it: https://pastebin.com/bKHPVKMe

yangm97 · 2019-09-07T03:58:06Z

It is doing that again. So far I remember doing some pull, rm and run. Also this happens:

# podman stats --all
Error: unable to obtain cgroup stats: parse 1313181801506 from /sys/fs/cgroup/cpuacct/machine.slice/libpod-2aa8180a744434fc4d8682a1842ebff6095bb2657d0274d0656480a124893a1f.scope/cpuacct.usage: strconv.ParseUint: parsing "1313181801506": value out of range

yangm97 · 2019-09-09T02:51:18Z

I think I've been able to isolate it a little further. there's one container on my stack which has a healthcheck instruction, these socket files start piling up exactly after it has started.

mheon · 2019-09-09T05:44:54Z

Probably artifacts from exec then - @haircommander those ought to be cleaned up automatically, I believe?

yangm97 · 2019-09-09T13:24:22Z

The healthcheck instruction looks like this:

HEALTHCHECK --interval=3s --timeout=3s CMD ["/usr/local/bin/healthchecker"]

It's currently failing but because of reasons unrelated to docker/podman.

mheon · 2019-09-09T13:26:08Z

It makes sense how easily it's filling the tmpfs - we're getting... I think 3? separate files per exec session, and for some reason they are not being cleaned up. Add to that an exec every 3 seconds and we rapidly begin to accumulate files.

haircommander · 2019-09-09T13:36:45Z

yeah the exec artifacts should removed after an exec session. can you exec that command in the container and let me know if an error along the lines of Error removing exec session... pops up?

mheon · 2019-09-09T13:39:35Z

There should also be logs available in podman inspect $CID related to the healthchecks

yangm97 · 2019-09-09T13:56:55Z

podman inspect shows healthchecks as failing (as expected)

# podman healthcheck run erlradio
unhealthy

# podman exec erlradio /usr/local/bin/healthchecker
Error: exec failed: container_linux.go:346: starting container process caused "exec format error": OCI runtime error

mheon · 2019-09-09T14:09:09Z

Interesting.

My assumption here would be that podman exec probably isn't cleaning up properly after an error, which is the original bug report here; and we have a bug causing a 100% failure rate for this type of exec causing massive amounts of files to be strewn across the system.

yangm97 · 2019-09-09T17:16:02Z

Looks like fixing my broken healthchecker did not have an effect on this issue.

mheon · 2019-09-12T18:21:34Z

Investigating further - going to try to get a fix into 1.6.0

when executing a healthcheck, we were not cleaning up after exec's use of a socket. we now remove the socket file and ignore if for reason it does not exist. Fixes: containers#3962 Signed-off-by: baude <[email protected]>

openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 6, 2019

baude mentioned this issue Sep 12, 2019

clean up after healthcheck execs #4009

Merged

openshift-merge-robot closed this as completed in #4009 Sep 12, 2019

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 23, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filling up /run/libpod/socket #3962

Filling up /run/libpod/socket #3962

yangm97 commented Sep 6, 2019 •

edited

Loading

mheon commented Sep 6, 2019

yangm97 commented Sep 6, 2019 •

edited

Loading

yangm97 commented Sep 7, 2019

yangm97 commented Sep 9, 2019

mheon commented Sep 9, 2019

yangm97 commented Sep 9, 2019

mheon commented Sep 9, 2019

haircommander commented Sep 9, 2019

mheon commented Sep 9, 2019

yangm97 commented Sep 9, 2019

mheon commented Sep 9, 2019

yangm97 commented Sep 9, 2019 •

edited

Loading

mheon commented Sep 12, 2019

Filling up /run/libpod/socket #3962

Filling up /run/libpod/socket #3962

Comments

yangm97 commented Sep 6, 2019 • edited Loading

mheon commented Sep 6, 2019

yangm97 commented Sep 6, 2019 • edited Loading

yangm97 commented Sep 7, 2019

yangm97 commented Sep 9, 2019

mheon commented Sep 9, 2019

yangm97 commented Sep 9, 2019

mheon commented Sep 9, 2019

haircommander commented Sep 9, 2019

mheon commented Sep 9, 2019

yangm97 commented Sep 9, 2019

mheon commented Sep 9, 2019

yangm97 commented Sep 9, 2019 • edited Loading

mheon commented Sep 12, 2019

yangm97 commented Sep 6, 2019 •

edited

Loading

yangm97 commented Sep 6, 2019 •

edited

Loading

yangm97 commented Sep 9, 2019 •

edited

Loading