Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequent setup/cleanup failures in tests #1591

Closed
martinpitt opened this issue Feb 29, 2024 · 0 comments · Fixed by #1598
Closed

Frequent setup/cleanup failures in tests #1591

martinpitt opened this issue Feb 29, 2024 · 0 comments · Fixed by #1598
Assignees
Labels
flake unstable test

Comments

@martinpitt
Copy link
Member

martinpitt commented Feb 29, 2024

We keep getting failures due to our super-complicated setUp/tearDown/restore_dir() actions. A few days ago I tried commit 857ed44 and ab8108e , but that still didn't help enough.

E.g. yesterday/today we got failures in containers/podman#21868 (log, log) due to

cp: cannot stat '/home/admin/.local/share/containers/storage/overlay/compat475205185': No such file or directory
Traceback (most recent call last):
  File "/var/ARTIFACTS/work-podman-userpjbprz5f/plans/cockpit-podman/podman-user/discover/default-0/tests/test/check-application", line 145, in setUp
    self.restore_dir("/home/admin/.local/share/containers")
  File "/var/ARTIFACTS/work-podman-userpjbprz5f/plans/cockpit-podman/podman-user/discover/default-0/tests/test/common/testlib.py", line 2050, in restore_dir
    exe(f"mkdir -p {self.vm_tmpdir}; cp -a {path}/ {backup}/")

Same thing happened in containers/podman#21778 log, or here.

I asked in containers/podman#21592 (comment) , but we are doing way too much here. I have an idea how to simplify/robustify this, let's see if it works out.

@martinpitt martinpitt added the flake unstable test label Feb 29, 2024
@martinpitt martinpitt self-assigned this Feb 29, 2024
@martinpitt martinpitt moved this to new in Pilot tasks Feb 29, 2024
martinpitt added a commit to martinpitt/cockpit-podman that referenced this issue Feb 29, 2024
@martinpitt martinpitt changed the title Frequent cleanup failures in tests Frequent setup/cleanup failures in tests Feb 29, 2024
martinpitt added a commit to martinpitt/cockpit-podman that referenced this issue Feb 29, 2024
martinpitt added a commit to martinpitt/cockpit-podman that referenced this issue Feb 29, 2024
martinpitt added a commit to martinpitt/cockpit-podman that referenced this issue Feb 29, 2024
martinpitt added a commit to martinpitt/cockpit-podman that referenced this issue Feb 29, 2024
martinpitt added a commit to martinpitt/cockpit-podman that referenced this issue Feb 29, 2024
Our previous approach of `restore_dir("/var/lib/containers")` and the
find/unmount/kill hacks around it keep causing trouble, see cockpit-project#1591.

Give up on this, and move to a model that centers around `podman system reset`.
This works reasaonably well except for that being slow (podman#21874) and
leaking conmon (TODO). Keep these hacks.

Load our static test images with `podman save/load` instead. Also factorize
system and user cleanup, so that we do the same thing on both.

Fixes cockpit-project#1591
martinpitt added a commit to martinpitt/cockpit-podman that referenced this issue Feb 29, 2024
Our previous approach of `restore_dir("/var/lib/containers")` and the
find/unmount/kill hacks around it keep causing trouble, see cockpit-project#1591.

Give up on this, and move to a model that centers around `podman system reset`.
This works reasaonably well except for that being slow (podman#21874) and
leaking conmon (TODO). Keep these hacks.

Load our static test images with `podman save/load` instead. Also factorize
system and user cleanup, so that we do the same thing on both.

Fixes cockpit-project#1591
martinpitt added a commit to martinpitt/cockpit-podman that referenced this issue Feb 29, 2024
Our previous approach of `restore_dir("/var/lib/containers")` and the
find/unmount/kill hacks around it keep causing trouble, see cockpit-project#1591.

Give up on this, and move to a model that centers around `podman system reset`.
This works reasaonably well except for that being slow (podman#21874) and
leaking conmon (see next commit).

Load our static test images with `podman save/load` instead. Also factorize
system and user cleanup, so that we do the same thing on both.

Fixes cockpit-project#1591
martinpitt added a commit to martinpitt/cockpit-podman that referenced this issue Feb 29, 2024
Our previous approach of `restore_dir("/var/lib/containers")` and the
find/unmount/kill hacks around it keep causing trouble, see cockpit-project#1591.

Give up on this, and move to a model that centers around `podman system reset`.
This works reasaonably well except for that being slow (podman#21874) and
leaking conmon (see next commit).

Load our static test images with `podman save/load` instead. Also factorize
system and user cleanup, so that we do the same thing on both.

Fixes cockpit-project#1591
martinpitt added a commit to martinpitt/cockpit-podman that referenced this issue Mar 1, 2024
The `restore_dir()` for podman's data directory is highly problematic:
This interferes with btrfs subvolumes and overlayfs mounts, and often
causes `cp` failures like

```
cp: cannot stat '/home/admin/.local/share/containers/storage/overlay/compat3876082856': No such file or directory
```

So move to `podman system reset`, and restore the test images
with `podman load` for each test.

Unfortunately `podman system reset` defaults to the 10 s wait timeout
(containers/podman#21874), so we still need
the separate `rm --time 0` hack. But conceptually that can go away once
that bug is fixed.

This approach would also be nice on the system podman side, but it is super
hard to get right there especially on CoreOS: There we simultaneously want a
thorough cleanup, but also rely on the running cockpit/ws container. It also
collides with the "force unmount everything below /var/lib/containers" hack
that we unfortunately still need for some OSes. But doing it for the user at
least solves half of the problem. The observed failures in the field
all occurred on the user directory, anyway.

Fixes cockpit-project#1591
martinpitt added a commit that referenced this issue Mar 1, 2024
The `restore_dir()` for podman's data directory is highly problematic:
This interferes with btrfs subvolumes and overlayfs mounts, and often
causes `cp` failures like

```
cp: cannot stat '/home/admin/.local/share/containers/storage/overlay/compat3876082856': No such file or directory
```

So move to `podman system reset`, and restore the test images
with `podman load` for each test.

Unfortunately `podman system reset` defaults to the 10 s wait timeout
(containers/podman#21874), so we still need
the separate `rm --time 0` hack. But conceptually that can go away once
that bug is fixed.

This approach would also be nice on the system podman side, but it is super
hard to get right there especially on CoreOS: There we simultaneously want a
thorough cleanup, but also rely on the running cockpit/ws container. It also
collides with the "force unmount everything below /var/lib/containers" hack
that we unfortunately still need for some OSes. But doing it for the user at
least solves half of the problem. The observed failures in the field
all occurred on the user directory, anyway.

Fixes #1591
@github-project-automation github-project-automation bot moved this from new to easy in Pilot tasks Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flake unstable test
Projects
Archived in project
1 participant