Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cirrus: Update to F37 CI VM Images #1821

Merged
merged 4 commits into from
Jan 18, 2023
Merged

Conversation

cevich
Copy link
Member

@cevich cevich commented Dec 6, 2022

No description provided.

@cevich
Copy link
Member Author

cevich commented Dec 7, 2022

@mtrmac PTAL, this seems like a really odd error to get on a F36 -> F37 image update. Any idea what I might have screwed up in the image?

@mtrmac
Copy link
Contributor

mtrmac commented Dec 7, 2022

Just to link things together, #1739 is failing similarly.

I don’t really have any idea; notably the later registry tests successfully use podman to run a registry container, so the configuration on the system can’t be that broken.

Purely as a lazy guess, something from

local runtime=
cgroup_type=$(stat -f -c %T /sys/fs/cgroup)
if [[ $cgroup_type == "tmpfs" ]]; then
runtime="--runtime runc"
fi
# cgroup option necessary under podman-in-podman (CI tests),
# and doesn't seem to do any harm otherwise.
PODMAN="podman $runtime --cgroup-manager=cgroupfs"
might make a difference, but I don’t understand at all how that could be related.

@cevich
Copy link
Member Author

cevich commented Dec 7, 2022

Just to link things together, #1739 is failing similarly.

Good, so at least we know it's not related at all to the package removal changes in @lsm5 PR. The issue must be due to an updated package. That diff will be smaller in @lsm5 PR than this one, but still could be large. I'm thinking the best approach is probably to go hands-on with get_ci_vm.sh, reproduce it. Then start pressing random buttons until it either explodes or starts working 😁

@cevich
Copy link
Member Author

cevich commented Dec 7, 2022

Purely as a lazy guess, something from

Hmmm, so I wonder...maybe the difference is in the container image, not the VM image.

@cevich
Copy link
Member Author

cevich commented Dec 9, 2022

Force-push: newer images. Not expecting a different result.

@cevich
Copy link
Member Author

cevich commented Dec 13, 2022

Force-push: newer images. Not expecting a different result.

@mtrmac
Copy link
Contributor

mtrmac commented Dec 13, 2022

(FWIW, I realize that this is important work that will inevitably need to happen. At this point I’m prioritizing the sigstore feature work, given upcoming deadlines; let me know if that’s not the right judgement call.)

@cevich
Copy link
Member Author

cevich commented Dec 13, 2022

At this point I’m prioritizing the sigstore feature work

Please do, and ignore this for now. I think I know what the next step[*] is to move this and @lsm5 PR forward. I'll ping you (likely next year) if I need help.

[*] Use get_ci_vm.sh and reproduce the failure, and try to expose what changed in the environment to cause it. Specifically, paying attention to runc/crun stuffs.

@cevich
Copy link
Member Author

cevich commented Jan 10, 2023

Now this is weird...when I run the system tests manually via get_ci_vm.sh Skopeo Test they all pass. The only significant difference I'm aware of is that I did not first run the integration tests. I'll play with this setup more, to see if maybe the integration tests are somehow breaking the system tests...

@cevich
Copy link
Member Author

cevich commented Jan 10, 2023

...I cannot reproduce the problem manually on a get_ci_vm.sh. The tests pass no matter how or what order I run (int. vs. sys.) them. This is bringing back some memories now. IIRC Lokesh mentioned this fact to me last year about his PR. Maybe I'll have some luck using the 're-run w/ terminal' button...

@cevich
Copy link
Member Author

cevich commented Jan 10, 2023

...random idea: updated to images w/ older kernels (minus the multiple-port bindings bug).

@cevich
Copy link
Member Author

cevich commented Jan 13, 2023

Hmmm, so that change simply kicked the can down the road. It def. seems like the setup is wonky, but I cannot explain why running this stuff manually works perfectly fine. That's what has me really puzzled.

@cevich cevich force-pushed the F37_update branch 3 times, most recently from 718a3d4 to 0e94474 Compare January 16, 2023 16:51
@cevich
Copy link
Member Author

cevich commented Jan 16, 2023

@lsm5 @mtrmac uh huh, well, I don't know how or why 0e94474 did it, but the tests are fixed now 😁

P.S. The commit message I wrote came directly out of the hole which I normally sit upon. It's a complete guess - I was never able to reproduce the test failures manually 😕

@mtrmac
Copy link
Contributor

mtrmac commented Jan 17, 2023

It’s weird that podman system reset reports A storage.conf file exists at /usr/share/containers/storage.conf but then podman run seems to complain that the config file doesn’t contain the standard options.

It’s also weird that the podman run … /bin/true does not contain the error messages that were plaguing us before, on the first run of a podman run command. Does either of --runtime runc (if used) and --cgroup-manager=cgroupfs make a difference in this behavior?


Either way… if there isn’t a way to investigate further, and this works, let’s merge.

Just, please, explicitly write down (in the runner.sh comments and the commit message) that we don’t know why this works / is necessary, and in one of the places document what failures this is working around. Documenting the hypothesis is useful as well, but to me the important thing is that in 3 years we will know not to treat that hypothesis as a certainty.

@cevich
Copy link
Member Author

cevich commented Jan 17, 2023

It’s weird that podman system reset reports

To be clear, it's not the system reset that's new, that's always been there. The thing that "fixed" this issue is the podman run ... true. Suggesting something in the database or other runtime state is mucking with the tests. Likely by changing the output (which system tests are particularly sensitive to).

Either way… if there isn’t a way to investigate further, and this works, let’s merge.

I tried but couldn't reproduce the problem using hack/get_ci_vm.sh, the tests all pass w/o error. I also tried running the tests in a different order, which similarly did not seem to help - so test-order isn't to blame 😕

Just, please, explicitly write down

Good suggestion, thanks.

@mtrmac
Copy link
Contributor

mtrmac commented Jan 17, 2023

it's not the `system reset that's new, that's always been there.

True; I meant that it seems to be saying something about the configuration, as interpreted at that point.

Copy link
Contributor

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM.

@cevich
Copy link
Member Author

cevich commented Jan 17, 2023

I took a quick stroll through the system-test code-path. I believe do to legacy reasons, the system-tests modify the host's container storage configuration. IIRC, the system-tests in particular are very sensitive to unexpected output from podman. If I had to make a guess, the problem is fixed (by running a container) because that prevents certain warning messages from being printed at test-time. It's a wild-guess though.

@cevich cevich marked this pull request as ready for review January 17, 2023 15:19
These are already present in the VM images.  These instructions only
cause the DNF cache to be refreshed, wasting precious developer time.

Signed-off-by: Chris Evich <[email protected]>
For whatever reasons, the podman configuration in CI results in the
inspect test throwing the following error:

```
not ok 4 inspect: image manifest list w/ diff platform
125
configuration is unset - using hardcoded default graph root
\"/var/lib/containers/storage\""
configuration is unset - using hardcoded default graph root
\"/var/lib/containers/storage\""
StoreOptions
```

Fix this by not using `podman`. It's unnecessary, since all the test
needs is the golang-flavor of the current system's architecture name.
That can easily be obtained by asking the go tool directly.

Signed-off-by: Chris Evich <[email protected]>
This is necessary, since running the skopeo tests modifies the host
environment.  This can result in some warning messages the first time
a container is started.  These messages can interfere with tests which
are sensitive to stdout/stderr.  Since many/most tests require a local
image registry, launch it with `/bin/true` after doing a system reset
to clear away any pesky warning messages.

Signed-off-by: Chris Evich <[email protected]>
@cevich
Copy link
Member Author

cevich commented Jan 18, 2023

Force-push: Rebased.

@mtrmac mtrmac merged commit b51eb21 into containers:main Jan 18, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants