Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman hangs when sending large amounts of data on stdout #13779

Open
allisonkarlitskaya opened this issue Apr 5, 2022 · 36 comments
Open

podman hangs when sending large amounts of data on stdout #13779

allisonkarlitskaya opened this issue Apr 5, 2022 · 36 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@allisonkarlitskaya
Copy link

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

When trying to pipe large amounts of data via stdout, podman often hangs. There was a similar issue #9183 about that before and apparently it was fixed in containers/conmon#236 but I still see an extremely similar issue here, a year later, on new versions.

Steps to reproduce the issue:

The is more or less 100% reliable hang for me:

podman run --rm docker.io/library/node:latest tar c / >/dev/null

If you want, you can also run it through pv to see how much data was written. It tends to stop after 3.23MiB for me:

bash-5.1$ podman run --rm docker.io/library/node:latest tar c / | pv >/dev/null
tar: Removing leading `/' from member names
tar: Removing leading `/' from hard link targets
3.23MiB 0:00:14 [0.00 B/s] [<=>                                               ]

Describe the results you received:

Both of those hang.

I'm pretty sure this isn't caused by tar getting stuck on reading a file, since, for example podman run --rm docker.io/library/node:latest tar cf /dev/null / >/dev/null works fine and exits quickly.

Note as well that tar isn't running anymore by the time of the hang:

...
 716226 ?        Ssl    0:04  \_ /usr/libexec/gnome-terminal-server
 716276 pts/3    Ss     0:00      \_ bash
 717560 pts/3    Sl+    0:00      |   \_ podman run --rm docker.io/library/node:latest tar c /
 717574 pts/3    Z      0:00      |       \_ [slirp4netns] <defunct>
...

The hang can be interrupted by pressing ^C. When you do that, you immediately see this message:

^C2022-04-05T13:24:41.000713801Z: open pidfd: No such process

followed exactly 5 seconds later by this:

ERRO[0094] No exit file for container 60fd89b9fad660d41b7d2d0f00c27b732aa5c77beb7cfc65db905a25ffaa332b found: timed out waiting for file /run/user/1001/libpod/tmp/exits/60fd89b9fad660d41b7d2d0f00c27b732aa5c77beb7cfc65db905a25ffaa332b: internal libpod error 
ERRO[0094] forwarding signal 2 to container 60fd89b9fad660d41b7d2d0f00c27b732aa5c77beb7cfc65db905a25ffaa332b: error sending signal to container 60fd89b9fad660d41b7d2d0f00c27b732aa5c77beb7cfc65db905a25ffaa332b: `/usr/bin/crun kill 60fd89b9fad660d41b7d2d0f00c27b732aa5c77beb7cfc65db905a25ffaa332b 2` failed: exit status 1 

...followed by an exit with status 130.

Describe the results you expected:

Not hanging.

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

Client:       Podman Engine
Version:      4.0.0-rc4
API Version:  4.0.0-rc4
Go Version:   go1.18beta2

Built:      Fri Feb 11 15:51:09 2022
OS/Arch:    linux/amd64

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.24.0
  cgroupControllers:
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.0-2.fc36.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.0, commit: '
  cpus: 8
  distribution:
    distribution: fedora
    variant: silverblue
    version: "36"
  eventLogger: journald
  hostname: x1
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1001
      size: 1
    - container_id: 1
      host_id: 165536
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1001
      size: 1
    - container_id: 1
      host_id: 165536
      size: 65536
  kernel: 5.17.0-0.rc7.116.fc36.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 13072580608
  memTotal: 33350156288
  networkBackend: netavark
  ociRuntime:
    name: crun
    package: crun-1.4.2-2.fc36.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.4.2
      commit: f6fbc8f840df1a414f31a60953ae514fa497c748
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    path: /run/user/1001/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-0.2.beta.0.fc36.x86_64
    version: |-
      slirp4netns version 1.2.0-beta.0
      commit: 477db14a24ff1a3de3a705e51ca2c4c1fe3dda64
      libslirp: 4.6.1
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.3
  swapFree: 8000892928
  swapTotal: 8589930496
  uptime: 165h 18m 1.58s (Approximately 6.88 days)
plugins:
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /var/home/lis/.config/containers/storage.conf
  containerStore:
    number: 43
    paused: 0
    running: 20
    stopped: 23
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /var/home/lis/.local/share/containers/storage
  graphStatus:
    Backing Filesystem: btrfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 8
  runRoot: /run/user/1001/containers
  volumePath: /var/home/lis/.local/share/containers/storage/volumes
version:
  APIVersion: 4.0.0-rc4
  Built: 1644591069
  BuiltTime: Fri Feb 11 15:51:09 2022
  GitCommit: ""
  GoVersion: go1.18beta2
  OsArch: linux/amd64
  Version: 4.0.0-rc4

Package info (e.g. output of rpm -q podman or apt list podman):

podman-4.0.0-0.6.rc4.fc36.x86_64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/main/troubleshooting.md)

No

Additional environment details (AWS, VirtualBox, physical, etc.):

Default install of Silverblue 36, standard issue RH ThinkPad X1 Gen 9.

Since this is potentially a bug in conmon again, the relevant versions:

bash-5.1$ conmon --version
conmon version 2.1.0
commit: 
bash-5.1$ rpm -q conmon
conmon-2.1.0-2.fc36.x86_64
@openshift-ci openshift-ci bot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 5, 2022
@Luap99
Copy link
Member

Luap99 commented Apr 5, 2022

I can reproduce. Conmon crashes with SIGSEGV.
I see the following in strace output:

24733 write(2, "[conmon:w]: Unknown error -14 Bad file descriptor\n", 50) = 50
24733 close(18)                         = 0
24733 write(2, "[conmon:w]: write_journald failed\n", 34) = 34
...
24733 sendmsg(19, {msg_name={sa_family=AF_UNIX, sun_path="/run/systemd/journal/socket"}, msg_namelen=30, msg_iov=[{iov_base="MESSAGE=\350\253\341\376\377\213\0\0\0\37\204\0\0\0\0\0\221\1\0\0\0\0\0\0I\33\0\0\0\0\0\0G\320*\274\362O\200k`\232\264\1\0\0\0\0`\232\264\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=851}, {iov_base="\n", iov_len=1}, {iov_base="CONTAINER_ID_FULL=6d3a03f10916fa47418398fdc88b1d023c57bb93557027ea1d89063f08671473", iov_len=82}, {iov_base="\n", iov_len=1}, {iov_base="PRIORITY=6", iov_len=10}, {iov_base="\n", iov_len=1}, {iov_base="CONTAINER_ID=6d3a03f10916", iov_len=25}, {iov_base="\n", iov_len=1}, {iov_base="CONTAINER_NAME=strange_curie", iov_len=28}, {iov_base="\n", iov_len=1}, {iov_base="SYSLOG_IDENTIFIER=strange_curie", iov_len=31}, {iov_base="\n", iov_len=1}, {iov_base="CODE_FILE=src/ctr_logging.c", iov_len=27}, {iov_base="\n", iov_len=1}, {iov_base="CODE_LINE=323", iov_len=13}, {iov_base="\n", iov_len=1}, {iov_base="CODE_FUNC=write_journald", iov_len=24}, {iov_base="\n", iov_len=1}], msg_iovlen=18, msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 1100
24733 --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x1b5f000} ---
...
24733 +++ killed by SIGSEGV (core dumped) +++

It works correctly with --log-driver k8s-file instead of the journald driver.

@mheon @haircommander PTAL

@allisonkarlitskaya
Copy link
Author

Adding --log-driver=none also works, and is enough to unblock what I'm working on. Thanks for the help!

@mheon
Copy link
Member

mheon commented Apr 5, 2022

Did journald terminate our connection because it felt we were spamming the logs? I was suspecting it might be a log pressure thing, denying the write because journald felt it was running out of cache or wasn't able to keep up with writes to disk, but the bad file descriptor errno makes it seem a lot more like the connection was entirely closed?

@mheon
Copy link
Member

mheon commented Apr 5, 2022

Bit of googling leads me to https://docs.openshift.com/container-platform/4.7/logging/config/cluster-logging-systemd.html

Most notable passage: "We recommend setting RateLimitIntervalSec=30s and RateLimitBurst=10000 (or even higher if necessary) to prevent the journal from losing entries."

I think the journal is rate-limiting us. We should probably respond to this by applying backpressure on the logs (refusing to read from the TTY and letting it fill, while we wait for journald to start accepting our logs again. Exact details on the "wait for journald to start accepting logs again" will need to be figured out - do we just keep retrying until it succeeds? How do we tell if journald is rate-limiting us - it could just be broken (user SIGKILL'd the process, for example).

@allisonkarlitskaya
Copy link
Author

Why are we spamming the journal with stdout by default in the first place? I have to admit that this comes as a huge surprise to me. I can kinda understand stderr (at least as a default — systemd itself does the same), but it's quite reasonable (as in this case) to imagine that stdout is bulk binary data.

There are also significant security/privacy implications here if we end up logging unhashed passwords, for example...

@Luap99
Copy link
Member

Luap99 commented Apr 6, 2022

The default log driver is journald, by default all stdout/stderr will be captured. You can change it to file but again eveything will be captured.

This is required to make podman logs work for example. You can use --log-driver=none but this will break podman logs.

There are also significant security/privacy implications here if we end up logging unhashed passwords, for example...

No application should ever log sensitive data!!! Podman just stores the output. It is no different than running your application directly in a systemd unit for example where systemd would store the output in the journal.

@allisonkarlitskaya
Copy link
Author

No application should ever log sensitive data!!! Podman just stores the output. It is no different than running your application directly in a systemd unit for example where systemd would store the output in the journal.

My point is that plenty of applications might reasonably output sensitive data to the tty or stdout, be it either writing a password or some kind of other authentication token to the screen, or private messages from a secure messaging app that the user expected to be ephemeral. This logging (by podman) happens even in --interactive mode. Logging these things by default and requiring an opt-out is a pretty big violation of expectation.

@rhatdan
Copy link
Member

rhatdan commented Apr 6, 2022

Well this is the default for Kubernetes, Podman and Docker. As well as systemd unit files.

lack added a commit to lack/cnf-features-deploy that referenced this issue Apr 8, 2022
As described in containers/podman#13779,
podman by default logs all of a container's stdout in order to retain
info to make 'podman logs' work. But this extra copy of stdout can be
overrun and cause hangs when shipping large amounts of data through
stdout (as in the case when piping out a tar stream).  Changing the
instructions and documentation to add '--log-driver=none' to the podman
commandline disables this extra unneeded logging and prevents hangs.

Signed-off-by: Jim Ramsay <[email protected]>
@lack
Copy link

lack commented Apr 8, 2022

I'd vote for changing the default for --interactive mode to --log-driver=none, and allowing opt-in to interactive-mode logging by setting it to some other value.

lack added a commit to lack/cnf-features-deploy that referenced this issue Apr 8, 2022
As described in containers/podman#13779,
podman by default logs all of a container's stdout in order to retain
info to make 'podman logs' work. But this extra copy of stdout can be
overrun and cause hangs when shipping large amounts of data through
stdout (as in the case when piping out a tar stream).  Changing the
instructions and documentation to add '--log-driver=none' to the podman
commandline disables this extra unneeded logging and prevents hangs.

Signed-off-by: Jim Ramsay <[email protected]>
(cherry picked from commit a2124e0)
@rlanting
Copy link

rlanting commented Apr 9, 2022

Today I ran into the Error: timed out waiting for file error. But not in the same way as the original reporter. Feel free to split this comment into another issue if it has a different root cause. This is on rootless podman 4.0.1 on debian with cni.

Steps to reproduce

  • podman run --rm --env ALLOW_EMPTY_PASSWORD=yes --name=mysql-test quay.io/bitnami/mysql:8.0
  • Wait for [Server] /opt/bitnami/mysql/bin/mysqld: ready for connections. to appear at the end of the log
  • Open another shell because the container will stay in the foreground so you can see when it's ready
  • podman exec -it mysql-test /bin/bash
  • mysql
  • show status;
  • Cleanup: podman kill mysql-test

Result

Error: timed out waiting for file /home/username/.local/share/containers/storage/overlay-containers/<sha256 1>/userdata/<sha256 2/exit/<sha256 1>: internal libpod error

Expected

Full output and not getting kicked out of the container.

Additional information

Comments here seem to indicate that setting --log-driver=none (or k8s-file) for podman-run will workaround the issue. I could not reproduce that with this setup.

For quicker reproduction during testing, you can use podman exec -t mysql-test mysql -e "show status;" as command. It will cut off the output. Last output line should contain Uptime_since_flush_status

@Luap99
Copy link
Member

Luap99 commented Apr 11, 2022

@rlanting Your problem should be fixed in the latest release: #13227

SchSeba pushed a commit to SchSeba/cnf-features-deploy that referenced this issue May 1, 2022
As described in containers/podman#13779,
podman by default logs all of a container's stdout in order to retain
info to make 'podman logs' work. But this extra copy of stdout can be
overrun and cause hangs when shipping large amounts of data through
stdout (as in the case when piping out a tar stream).  Changing the
instructions and documentation to add '--log-driver=none' to the podman
commandline disables this extra unneeded logging and prevents hangs.

Signed-off-by: Jim Ramsay <[email protected]>
nishant-parekh pushed a commit to nishant-parekh/cnf-features-deploy that referenced this issue May 10, 2022
As described in containers/podman#13779,
podman by default logs all of a container's stdout in order to retain
info to make 'podman logs' work. But this extra copy of stdout can be
overrun and cause hangs when shipping large amounts of data through
stdout (as in the case when piping out a tar stream).  Changing the
instructions and documentation to add '--log-driver=none' to the podman
commandline disables this extra unneeded logging and prevents hangs.

Signed-off-by: Jim Ramsay <[email protected]>
Signed-off-by: Nishant Parekh <[email protected]>
@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented May 12, 2022

I believe this is fixed in the podman 4.1

@rhatdan rhatdan closed this as completed May 12, 2022
@Luap99
Copy link
Member

Luap99 commented May 13, 2022

This is not fixed,
a) conmon segfault when send large amount of data to journald
b) podman hangs forever when conmon crashes

@Luap99 Luap99 reopened this May 13, 2022
@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Jul 15, 2022

@Luap99 @haircommander Do we know if this is fixed in conmon-rs?

@haircommander
Copy link
Collaborator

I don't know! though, it would be sweet if it was

@Luap99
Copy link
Member

Luap99 commented Jul 15, 2022

Yeah unless it is integrated into podman we cannot really test it.

Although there is also still the podman issue, itjust hangs when conmon died. I know @vrothberg reworked the wait logic recently so maybe this is fixed now?

@vrothberg
Copy link
Member

Which logic are you referring to @Luap99?

@Luap99
Copy link
Member

Luap99 commented Jul 15, 2022

You exit code work, I know there were changes to how we wait for the container exit as well in there.
Run the reproducer and see if podman run just hangs once conmon crashes.

@Luap99
Copy link
Member

Luap99 commented Jul 15, 2022

I just tried it on main and it no longer hangs:

ERRO[0136] Waiting for container bf2c2b81327a30a4486ca110662bfe3d9f4275292673f5ef6df4d45c90af2a4f: getting exit code of container bf2c2b81327a30a4486ca110662bfe3d9f4275292673f5ef6df4d45c90af2a4f from DB: no such exit code 
ERRO[0136] Removing container bf2c2b81327a30a4486ca110662bfe3d9f4275292673f5ef6df4d45c90af2a4f: cannot remove container bf2c2b81327a30a4486ca110662bfe3d9f4275292673f5ef6df4d45c90af2a4f as it is running - running or paused containers cannot be removed without force: container state improper 

@vrothberg
Copy link
Member

Nice catch connecting the dots, @Luap99 !

@boagg
Copy link

boagg commented Jul 29, 2022

Hello,

It seems that we are impacted by this bug. Can we know when the new release will come out (with the correction of this bug?)

As well as the commit associated with this correction?

Thanks in advance,

@mheon
Copy link
Member

mheon commented Jul 29, 2022

The fix will be in Podman 4.2. It was resolved by #14685

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

ffromani pushed a commit to ffromani/cnf-features-deploy that referenced this issue Sep 22, 2022
As described in containers/podman#13779,
podman by default logs all of a container's stdout in order to retain
info to make 'podman logs' work. But this extra copy of stdout can be
overrun and cause hangs when shipping large amounts of data through
stdout (as in the case when piping out a tar stream).  Changing the
instructions and documentation to add '--log-driver=none' to the podman
commandline disables this extra unneeded logging and prevents hangs.

Signed-off-by: Jim Ramsay <[email protected]>
@yveszoundi
Copy link

yveszoundi commented Jan 7, 2023

I've been experiencing this issue lately. The problem "seems" to vanish after removing any --rm flag.

What I am doing

I'm building a set of Live CD components for both amd64 and aarch64. There's a lot of output and my builds take 3+ hours locally (aarch64 emulation, rust compile times do not help, custom kernel, etc.).

How the issue manifests itself

When the builds take a long time (especially with aarch64 emulation), podman becomes sometimes unresponsive. I typically wait for 10 minutes to confirm, as qemu emulation is really slow for aarch64.

What seems to work around the problem

I "think" that I have successfully mitigated the issue by removing the --rm flag from podman run commands. My "wild guess" is that there's a race condition, failure to process notifications or flush buffers.

What I have tried without success

  • Use the latest 4.3.1 build provided by podman-static and hope for the best
  • Switch between crun and runc
  • Switch between my Ubuntu version of conmon and conmon 2.1.5 from podman-static
  • Switch between my Ubuntu version of fuse-overlayfs and the version from podman-static
  • Set the --log-driver flag to none wherever possible, as suggested in some comments

@MartinX3
Copy link

MartinX3 commented Mar 28, 2023

Still an issue with 4.4.3 with a kube yaml with a bash script as entrypoint to compile aosp.

Compile aosp has much logging output to the terminal.
The container just "freezes".
I can still access it with exec, but the processes just consume 0% CPU.
The logs output hangs of course since there is 0% CPU usage.

@thiago-scherrer
Copy link

thiago-scherrer commented May 22, 2023

Same here :/
I tried both options, but neither worked: --log-driver=none or --log-driver=k8s-file

cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 20 | head -n 9999
...
Error: timed out waiting for file /home/<my_user_here>/.local/share/containers/storage/overlay-containers/xxxxx/userdata/xxxxx/exit/xxxxxx: internal libpod error
[~]$ (Here, I am dropped from inside the container to the terminal of my machine.)
$ podman version
Version:      3.4.4
API Version:  3.4.4
Go Version:   go1.17.3
Built:        Wed Dec 31 21:00:00 1969
OS/Arch:      linux/amd64
$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.2 LTS
Release:	22.04
Codename:	jammy
$ dpkg -l podman
||/ Nome           Versão             Arquitectura Descrição
+++-==============-==================-============-==========================================
ii  podman         3.4.4+ds1-1ubuntu1 amd64        engine to run OCI-based containers in Pods

@thiago-scherrer
Copy link

thiago-scherrer commented Jun 27, 2023

I've been experiencing this issue lately. The problem "seems" to vanish after removing any --rm flag.

@yveszoundi I am not using the --rm option: podman exec -it xxxx zsh

@thiago-scherrer
Copy link

The fix will be in Podman 4.2. It was resolved by #14685

@mheon I am using podman version 3.4.4.

@MartinX3
Copy link

I fixed it by redirecting the log output into a file in my entrypoint script.
(Using podman 4.5.1)

@rhatdan
Copy link
Member

rhatdan commented Jul 30, 2023

Is this still an issue, lost in the issue flod?

@thiago-scherrer
Copy link

Is this still an issue, lost in the issue flod?

@rhatdan I still have this problem :/

@Luap99
Copy link
Member

Luap99 commented Jul 31, 2023

This is not really a podman problem, the issue is the conmon crashes with segfault so it has to be fixed in conmon.

@alangalvino
Copy link

I still have this problem, any news?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests