Frequent setup/cleanup failures in tests #1591

martinpitt · 2024-02-29T07:54:48Z

We keep getting failures due to our super-complicated setUp/tearDown/restore_dir() actions. A few days ago I tried commit 857ed44 and ab8108e , but that still didn't help enough.

E.g. yesterday/today we got failures in containers/podman#21868 (log, log) due to

cp: cannot stat '/home/admin/.local/share/containers/storage/overlay/compat475205185': No such file or directory
Traceback (most recent call last):
  File "/var/ARTIFACTS/work-podman-userpjbprz5f/plans/cockpit-podman/podman-user/discover/default-0/tests/test/check-application", line 145, in setUp
    self.restore_dir("/home/admin/.local/share/containers")
  File "/var/ARTIFACTS/work-podman-userpjbprz5f/plans/cockpit-podman/podman-user/discover/default-0/tests/test/common/testlib.py", line 2050, in restore_dir
    exe(f"mkdir -p {self.vm_tmpdir}; cp -a {path}/ {backup}/")

Same thing happened in containers/podman#21778 log, or here.

I asked in containers/podman#21592 (comment) , but we are doing way too much here. I have an idea how to simplify/robustify this, let's see if it works out.

The text was updated successfully, but these errors were encountered:

Fixes cockpit-project#1591

Our previous approach of `restore_dir("/var/lib/containers")` and the find/unmount/kill hacks around it keep causing trouble, see cockpit-project#1591. Give up on this, and move to a model that centers around `podman system reset`. This works reasaonably well except for that being slow (podman#21874) and leaking conmon (TODO). Keep these hacks. Load our static test images with `podman save/load` instead. Also factorize system and user cleanup, so that we do the same thing on both. Fixes cockpit-project#1591

Our previous approach of `restore_dir("/var/lib/containers")` and the find/unmount/kill hacks around it keep causing trouble, see cockpit-project#1591. Give up on this, and move to a model that centers around `podman system reset`. This works reasaonably well except for that being slow (podman#21874) and leaking conmon (see next commit). Load our static test images with `podman save/load` instead. Also factorize system and user cleanup, so that we do the same thing on both. Fixes cockpit-project#1591

The `restore_dir()` for podman's data directory is highly problematic: This interferes with btrfs subvolumes and overlayfs mounts, and often causes `cp` failures like ``` cp: cannot stat '/home/admin/.local/share/containers/storage/overlay/compat3876082856': No such file or directory ``` So move to `podman system reset`, and restore the test images with `podman load` for each test. Unfortunately `podman system reset` defaults to the 10 s wait timeout (containers/podman#21874), so we still need the separate `rm --time 0` hack. But conceptually that can go away once that bug is fixed. This approach would also be nice on the system podman side, but it is super hard to get right there especially on CoreOS: There we simultaneously want a thorough cleanup, but also rely on the running cockpit/ws container. It also collides with the "force unmount everything below /var/lib/containers" hack that we unfortunately still need for some OSes. But doing it for the user at least solves half of the problem. The observed failures in the field all occurred on the user directory, anyway. Fixes cockpit-project#1591

The `restore_dir()` for podman's data directory is highly problematic: This interferes with btrfs subvolumes and overlayfs mounts, and often causes `cp` failures like ``` cp: cannot stat '/home/admin/.local/share/containers/storage/overlay/compat3876082856': No such file or directory ``` So move to `podman system reset`, and restore the test images with `podman load` for each test. Unfortunately `podman system reset` defaults to the 10 s wait timeout (containers/podman#21874), so we still need the separate `rm --time 0` hack. But conceptually that can go away once that bug is fixed. This approach would also be nice on the system podman side, but it is super hard to get right there especially on CoreOS: There we simultaneously want a thorough cleanup, but also rely on the running cockpit/ws container. It also collides with the "force unmount everything below /var/lib/containers" hack that we unfortunately still need for some OSes. But doing it for the user at least solves half of the problem. The observed failures in the field all occurred on the user directory, anyway. Fixes #1591

martinpitt added the flake unstable test label Feb 29, 2024

martinpitt self-assigned this Feb 29, 2024

martinpitt added this to Pilot tasks Feb 29, 2024

martinpitt moved this to new in Pilot tasks Feb 29, 2024

martinpitt added a commit to martinpitt/cockpit-podman that referenced this issue Feb 29, 2024

test: Replace setup/teardown

52ce036

Fixes cockpit-project#1591

martinpitt mentioned this issue Feb 29, 2024

test: Replace setup/teardown #1592

Closed

martinpitt changed the title ~~Frequent cleanup failures in tests~~ Frequent setup/cleanup failures in tests Feb 29, 2024

martinpitt added a commit to martinpitt/cockpit-podman that referenced this issue Feb 29, 2024

test: Replace setup/teardown

26a9299

Fixes cockpit-project#1591

martinpitt added a commit to martinpitt/cockpit-podman that referenced this issue Feb 29, 2024

test: Replace setup/teardown

fb00cb8

Fixes cockpit-project#1591

martinpitt added a commit to martinpitt/cockpit-podman that referenced this issue Feb 29, 2024

test: Replace setup/teardown

b5a6468

Fixes cockpit-project#1591

martinpitt added a commit to martinpitt/cockpit-podman that referenced this issue Feb 29, 2024

test: Replace setup/teardown

c32b6d1

Fixes cockpit-project#1591

martinpitt mentioned this issue Feb 29, 2024

podman system reset speedup: Imply --time 0 or provide option for it containers/podman#21874

Closed

martinpitt mentioned this issue Mar 1, 2024

test: Move from restore_dir() to podman system reset for user #1598

Merged

martinpitt closed this as completed in #1598 Mar 1, 2024

github-project-automation bot moved this from new to easy in Pilot tasks Mar 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequent setup/cleanup failures in tests #1591

Frequent setup/cleanup failures in tests #1591

martinpitt commented Feb 29, 2024 •

edited

Loading

Frequent setup/cleanup failures in tests #1591

Frequent setup/cleanup failures in tests #1591

Comments

martinpitt commented Feb 29, 2024 • edited Loading

martinpitt commented Feb 29, 2024 •

edited

Loading