Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow performances depending on the volume used by the podman machine #19467

Closed
slemeur opened this issue Aug 1, 2023 · 14 comments · Fixed by #19709
Closed

Slow performances depending on the volume used by the podman machine #19467

slemeur opened this issue Aug 1, 2023 · 14 comments · Fixed by #19709
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. machine podman-desktop remote Problem is in podman-remote

Comments

@slemeur
Copy link

slemeur commented Aug 1, 2023

Issue Description

When using Podman Desktop I'm getting a strange behavior where the podman service running in the podman machine is becoming unresponsive.

Please see the corresponding Podman Desktop issue:
Podman Desktop Issue

Checking in the podman machine, I can see that the podman service is consuming a lot of CPU. There are probably requests being sent by podman desktop, while podman is already overloaded. When stopping podman desktop, the podman service is getting back normal CPU usage.

Maybe this could be due to the time podman takes to calculate something which is constantly asked by podman desktop - it overload podman and as there is no cache, it does go off.

Steps to reproduce the issue

I'm not entirely sure on how to reproduce the issue. I think it appears after I have been using the podman save and podman load, which create a large file.

On the other side:

  • When I mount my $HOME:$HOME folder I'm hitting the issue
  • When I create an empty folder and mount it with the podman machine, I don't get the issue
  • When I moved the tar file from podman save into an empty folder and mounting it to podman machine, I don't get the issue

Describe the results you received

Slow performance

Describe the results you expected

Podman stays responsive

podman info output

host:
  arch: amd64
  buildahVersion: 1.30.0
  cgroupControllers:
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.7-2.fc38.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.7, commit: '
  cpuUtilization:
    idlePercent: 99.57
    systemPercent: 0.25
    userPercent: 0.18
  cpus: 4
  databaseBackend: boltdb
  distribution:
    distribution: fedora
    variant: coreos
    version: "38"
  eventLogger: journald
  hostname: localhost.localdomain
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 1000000
    uidmap:
    - container_id: 0
      host_id: 501
      size: 1
    - container_id: 1
      host_id: 100000
      size: 1000000
  kernel: 6.3.12-200.fc38.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 7743348736
  memTotal: 8113209344
  networkBackend: netavark
  networkBackendInfo:
    backend: ""
    dns: {}
  ociRuntime:
    name: crun
    package: crun-1.8.5-1.fc38.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.8.5
      commit: b6f80f766c9a89eb7b1440c0a70ab287434b17ed
      rundir: /run/user/501/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  pasta:
    executable: ""
    package: ""
    version: ""
  remoteSocket:
    exists: true
    path: /run/user/501/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: true
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-12.fc38.x86_64
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.3
  swapFree: 0
  swapTotal: 0
  uptime: 1h 50m 3.00s (Approximately 0.04 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - docker.io
store:
  configFile: /var/home/core/.config/containers/storage.conf
  containerStore:
    number: 7
    paused: 0
    running: 0
    stopped: 7
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /var/home/core/.local/share/containers/storage
  graphRootAllocated: 106769133568
  graphRootUsed: 8531062784
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 9
  runRoot: /run/user/501/containers
  transientStore: false
  volumePath: /var/home/core/.local/share/containers/storage/volumes
version:
  APIVersion: 4.5.1
  Built: 1685123928
  BuiltTime: Fri May 26 19:58:48 2023
  GitCommit: ""
  GoVersion: go1.20.4
  Os: linux
  OsArch: linux/amd64
  Version: 4.5.1

Podman in a container

No

Privileged Or Rootless

Privileged

Upstream Latest Release

Yes

Additional environment details

Mac OS Ventura Intel

Additional information

Additional information like issue happens only occasionally or issue happens with a particular architecture or on a particular setting

@slemeur slemeur added the kind/bug Categorizes issue or PR as related to a bug. label Aug 1, 2023
@github-actions github-actions bot added the remote Problem is in podman-remote label Aug 1, 2023
@Luap99
Copy link
Member

Luap99 commented Aug 1, 2023

@vrothberg @ashley-cui Can one of you take a look and try to reproduce?
If you podman machine ssh you can observe a high cpu usage for the system service. It may only be triggered when running podman-desktop as this keeps the service process up due the constant event listening.

@vrothberg
Copy link
Member

Sorry, I am currently flooded.

@rhatdan
Copy link
Member

rhatdan commented Aug 1, 2023

Perhaps this is podman machine attempting to figure out the disk space used in a Volume, which would run lots of CPU and hammer the disk as it walks disk tree?

If podman desktop is asking for the size of volumes or containers modifications on images, this could cause Podman to get very active.

@baude
Copy link
Member

baude commented Aug 2, 2023

I looked at this; hold tight. My observation is in the case where it is slow, the service is being flooded with requests including a lot of df (expensive!) requests. PD folks are looking at that and I will keep an eye on it.

@vrothberg
Copy link
Member

Out of curiosity, is there an update on this issue?

@benoitf
Copy link
Contributor

benoitf commented Aug 9, 2023

I can reproduce using only podman on macOS

And there is a huge difference using rootless or rootful.

pre-reqs:

podman machine init
podman machine start
podman machine ssh

then if I run 5 containers like

podman run -d docker.io/redis/redis-stack:latest
podman run -d docker.io/redis/redis-stack:latest
podman run -d docker.io/redis/redis-stack:latest
podman run -d docker.io/redis/redis-stack:latest
podman run -d docker.io/redis/redis-stack:latest

and do the df command

[core@localhost ~]$ time podman system df
TYPE           TOTAL       ACTIVE      SIZE        RECLAIMABLE
Images         1           1           527.2MB     0B (0%)
Containers     5           5           9.578MB     0B (0%)
Local Volumes  0           0           0B          0B (0%)

real	0m0.083s
user	0m0.061s
sys	0m0.000s

so, very fast

now if I switch to rootful

sudo podman run -d docker.io/redis/redis-stack:latest
sudo podman run -d docker.io/redis/redis-stack:latest
sudo podman run -d docker.io/redis/redis-stack:latest
sudo podman run -d docker.io/redis/redis-stack:latest
sudo podman run -d docker.io/redis/redis-stack:latest
[core@localhost ~]$ time sudo podman system df
TYPE           TOTAL       ACTIVE      SIZE        RECLAIMABLE
Images         1           1           527.2MB     0B (0%)
Containers     5           5           9.578MB     0B (0%)
Local Volumes  0           0           0B          0B (0%)

real	0m6.397s
user	0m2.324s
sys	0m3.663s

then it's like 7s

@vrothberg
Copy link
Member

vrothberg commented Aug 10, 2023

Thanks @benoitf! I can even reproduce on Fedora with Podman v4.6.

Very similar numbers: 0.071s rootless vs 7.562s root

@vrothberg
Copy link
Member

vrothberg commented Aug 10, 2023

With an instrumented binary:

podman (fix-19467) $ sudo ./bin/podman system df
[sudo] password for vrothberg:                  
ERRO[0000] image 1.666118ms                     
ERRO[0007] containers 7.470431037s              
ERRO[0007] volumes 114.444µs                    

Processing containers seems to be the bottleneck:
https://github.com/containers/podman/blob/main/pkg/domain/infra/abi/system.go#L284-L324

image

@vrothberg
Copy link
Member

@giuseppe @nalind can you check what's going on?

@veillard
Copy link

I did an strace on the podman machine df on Stevan machine, 2GB of scan of /var/lib/containers/storage/ is where the time is spent, each file get an newfstatat() which sounds normal, then 2 lgetxattr security.capability and 2 llistxattr securi
ty.selinux which is a bit weird to me, ltrace shows malloc()/free() opendir()/dirfd()/closedir() which looks all fine but wit bursts of SIGURG() signals across the 3 processes involved, so that could be something weird in the way the processes are communicating together:
....
6550 --- SIGURG (Urgent I/O condition) ---
6548 --- SIGURG (Urgent I/O condition) ---
6548 --- SIGURG (Urgent I/O condition) ---
6552 --- SIGURG (Urgent I/O condition) ---
6552 --- SIGURG (Urgent I/O condition) ---
6552 --- SIGURG (Urgent I/O condition) ---
6552 --- SIGURG (Urgent I/O condition) ---
6548 --- SIGURG (Urgent I/O condition) ---
6552 --- SIGURG (Urgent I/O condition) ---
....
Also clearly podman machine df puts a lock, which blocks parallel operations, on one hand it's sensibe
because running those du(1) like operations should return the same and runnning multiple will lead to
trashing, but at the same time this is read only and serializing them is useless that would obviously benefit from
caching .

Daniel

@veillard
Copy link

Also while the podman machine df is run, top on the host only shows qemu using more CPU than normal

@giuseppe
Copy link
Member

rootless doesn't use naive diff since metacopy up is not supported for overlay in a user namespace.

We could probably drop the naive diff in any case and just look at the directory size:

diff --git a/drivers/overlay/overlay.go b/drivers/overlay/overlay.go
index 6b6f20637..7e91dd161 100644
--- a/drivers/overlay/overlay.go
+++ b/drivers/overlay/overlay.go
@@ -2162,10 +2162,6 @@ func (d *Driver) getLowerDiffPaths(id string) ([]string, error) {
 // and its parent and returns the size in bytes of the changes
 // relative to its base filesystem directory.
 func (d *Driver) DiffSize(id string, idMappings *idtools.IDMappings, parent string, parentMappings *idtools.IDMappings, mountLabel string) (size int64, err error) {
-       if d.options.mountProgram == "" && (d.useNaiveDiff() || !d.isParent(id, parent)) {
-               return d.naiveDiff.DiffSize(id, idMappings, parent, parentMappings, mountLabel)
-       }
-
        p, err := d.getDiffPath(id)
        if err != nil {
                return 0, err

@vrothberg
Copy link
Member

We could probably drop the naive diff in any case and just look at the directory size:

Yes, that does the trick. At the moment, that drops it from 19.3s to 0.11s on my machine. A mind-blowing difference.

I think it's OK to just look at the directory size; at least for this use case.

vrothberg added a commit to vrothberg/storage that referenced this issue Aug 14, 2023
Computing the diff size for the rootful overlay storage driver used the
naive diff.  The reasoning was that we have made use of rootful copy-up.
The downside is a mind-blowing performance penalty in comparison to the
rootless case.

Hence, drop the naive diff and only compute the size of the directory
which is absolutely sufficient for the motivating use case of
podman-system-df.  This drops the execution of system-df from 19.3s to
0.11s listing 5 containers and 1 image.

Fixes: github.com/containers/podman/issues/19467
Signed-off-by: Giuseppe Scrivano <[email protected]>
Signed-off-by: Valentin Rothberg <[email protected]>
@vrothberg
Copy link
Member

I opened containers/storage#1688.

vrothberg added a commit to vrothberg/storage that referenced this issue Aug 14, 2023
Computing the diff size for the rootful overlay storage driver used the
naive diff.  The reasoning was that we have made use of rootful copy-up.
The downside is a mind-blowing performance penalty in comparison to the
rootless case.

Hence, drop the naive diff and only compute the size of the directory
which is absolutely sufficient for the motivating use case of
podman-system-df.  This drops the execution of system-df from 19.3s to
0.11s listing 5 containers and 1 image.

Fixes: github.com/containers/podman/issues/19467
Signed-off-by: Valentin Rothberg <[email protected]>
vrothberg added a commit to vrothberg/libpod that referenced this issue Aug 23, 2023
Mainly to merge fixes for containers#19467 into the main branch.

Fixes: containers#19467
Signed-off-by: Valentin Rothberg <[email protected]>
vrothberg added a commit to vrothberg/libpod that referenced this issue Aug 23, 2023
The performance issue in containers#19467 drove me to add a benchmark for
system-df to avoid regressing on it in the future.

Comparing current HEAD to v4.6.0 yields

```
/home/vrothberg/containers/podman/bin/podman system df ran
201.47 times faster than /usr/bin/podman system df
```

Signed-off-by: Valentin Rothberg <[email protected]>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Nov 22, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. machine podman-desktop remote Problem is in podman-remote
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants