-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Frequent setup/cleanup failures in tests #1591
Labels
flake
unstable test
Comments
martinpitt
added a commit
to martinpitt/cockpit-podman
that referenced
this issue
Feb 29, 2024
martinpitt
changed the title
Frequent cleanup failures in tests
Frequent setup/cleanup failures in tests
Feb 29, 2024
martinpitt
added a commit
to martinpitt/cockpit-podman
that referenced
this issue
Feb 29, 2024
martinpitt
added a commit
to martinpitt/cockpit-podman
that referenced
this issue
Feb 29, 2024
martinpitt
added a commit
to martinpitt/cockpit-podman
that referenced
this issue
Feb 29, 2024
martinpitt
added a commit
to martinpitt/cockpit-podman
that referenced
this issue
Feb 29, 2024
martinpitt
added a commit
to martinpitt/cockpit-podman
that referenced
this issue
Feb 29, 2024
Our previous approach of `restore_dir("/var/lib/containers")` and the find/unmount/kill hacks around it keep causing trouble, see cockpit-project#1591. Give up on this, and move to a model that centers around `podman system reset`. This works reasaonably well except for that being slow (podman#21874) and leaking conmon (TODO). Keep these hacks. Load our static test images with `podman save/load` instead. Also factorize system and user cleanup, so that we do the same thing on both. Fixes cockpit-project#1591
martinpitt
added a commit
to martinpitt/cockpit-podman
that referenced
this issue
Feb 29, 2024
Our previous approach of `restore_dir("/var/lib/containers")` and the find/unmount/kill hacks around it keep causing trouble, see cockpit-project#1591. Give up on this, and move to a model that centers around `podman system reset`. This works reasaonably well except for that being slow (podman#21874) and leaking conmon (TODO). Keep these hacks. Load our static test images with `podman save/load` instead. Also factorize system and user cleanup, so that we do the same thing on both. Fixes cockpit-project#1591
martinpitt
added a commit
to martinpitt/cockpit-podman
that referenced
this issue
Feb 29, 2024
Our previous approach of `restore_dir("/var/lib/containers")` and the find/unmount/kill hacks around it keep causing trouble, see cockpit-project#1591. Give up on this, and move to a model that centers around `podman system reset`. This works reasaonably well except for that being slow (podman#21874) and leaking conmon (see next commit). Load our static test images with `podman save/load` instead. Also factorize system and user cleanup, so that we do the same thing on both. Fixes cockpit-project#1591
martinpitt
added a commit
to martinpitt/cockpit-podman
that referenced
this issue
Feb 29, 2024
Our previous approach of `restore_dir("/var/lib/containers")` and the find/unmount/kill hacks around it keep causing trouble, see cockpit-project#1591. Give up on this, and move to a model that centers around `podman system reset`. This works reasaonably well except for that being slow (podman#21874) and leaking conmon (see next commit). Load our static test images with `podman save/load` instead. Also factorize system and user cleanup, so that we do the same thing on both. Fixes cockpit-project#1591
martinpitt
added a commit
to martinpitt/cockpit-podman
that referenced
this issue
Mar 1, 2024
The `restore_dir()` for podman's data directory is highly problematic: This interferes with btrfs subvolumes and overlayfs mounts, and often causes `cp` failures like ``` cp: cannot stat '/home/admin/.local/share/containers/storage/overlay/compat3876082856': No such file or directory ``` So move to `podman system reset`, and restore the test images with `podman load` for each test. Unfortunately `podman system reset` defaults to the 10 s wait timeout (containers/podman#21874), so we still need the separate `rm --time 0` hack. But conceptually that can go away once that bug is fixed. This approach would also be nice on the system podman side, but it is super hard to get right there especially on CoreOS: There we simultaneously want a thorough cleanup, but also rely on the running cockpit/ws container. It also collides with the "force unmount everything below /var/lib/containers" hack that we unfortunately still need for some OSes. But doing it for the user at least solves half of the problem. The observed failures in the field all occurred on the user directory, anyway. Fixes cockpit-project#1591
martinpitt
added a commit
that referenced
this issue
Mar 1, 2024
The `restore_dir()` for podman's data directory is highly problematic: This interferes with btrfs subvolumes and overlayfs mounts, and often causes `cp` failures like ``` cp: cannot stat '/home/admin/.local/share/containers/storage/overlay/compat3876082856': No such file or directory ``` So move to `podman system reset`, and restore the test images with `podman load` for each test. Unfortunately `podman system reset` defaults to the 10 s wait timeout (containers/podman#21874), so we still need the separate `rm --time 0` hack. But conceptually that can go away once that bug is fixed. This approach would also be nice on the system podman side, but it is super hard to get right there especially on CoreOS: There we simultaneously want a thorough cleanup, but also rely on the running cockpit/ws container. It also collides with the "force unmount everything below /var/lib/containers" hack that we unfortunately still need for some OSes. But doing it for the user at least solves half of the problem. The observed failures in the field all occurred on the user directory, anyway. Fixes #1591
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
We keep getting failures due to our super-complicated setUp/tearDown/restore_dir() actions. A few days ago I tried commit 857ed44 and ab8108e , but that still didn't help enough.
E.g. yesterday/today we got failures in containers/podman#21868 (log, log) due to
Same thing happened in containers/podman#21778 log, or here.
I asked in containers/podman#21592 (comment) , but we are doing way too much here. I have an idea how to simplify/robustify this, let's see if it works out.
The text was updated successfully, but these errors were encountered: