CI: runc is not being tested #14833

edsantiago · 2022-07-05T18:18:22Z

Root cause, as best I can tell, is containers/automation_images#115. This was supposed to make ubuntu use runc, but it didn't. ubuntu is using crun. Confirmation: #13376 brought in those new VM images, but the ubuntu sys log clearly shows:

[+0006s] # Arch:amd64 OS:ubuntu21.10 Runtime:crun Rootless:false Events:journald Logdriver:journald Cgroups:v2+systemd Net:cni
                                             ^^^^

Solution is going to require a VM dance, and some way to actually log in to ubuntu to confirm, so it's not something I can do. It will have to wait for @cevich's return.

The text was updated successfully, but these errors were encountered:

cevich · 2022-07-11T14:28:32Z

(I've not made it through all my github mail yet). It looks like you successfully pulled off the VM Image build Dance. You can confirm the fix by opening a podman PR that updates the VM Image ID(s) (remember to do the AWS AMI ID also). Though note, you'll very likely run smack into the issues facing

However, the problems in both those also desperately need fixing anyway so this isn't a completely bad thing.

cevich · 2022-07-11T15:40:06Z

@edsantiago to keep things simple-ish, let's do this: Close #14397 (Lokesh's PR) in favor of #14719. Then, update buildah#4074 and #14719 to use the images you built in containers/automation_images#146

That will focus all the problems in two places instead of having you open up yet another PR for new images only to run into the exact same/similar problems.

edsantiago · 2022-07-11T16:15:43Z

@cevich we cannot do that. criu is completely broken. There is an f35 build but it's not stable yet. And, I have no idea where things stand on Ubuntu.

cevich · 2022-07-11T16:35:41Z

Okay, though we will need to update the CI images at some point, so when CRIU is ready, that PR-wrangling option (above) could make things a bit simpler.

For CRIU in ubuntu, you've got to talk to Adrian about it and/or open an issue in the upstream repo. I won't lie, it's a major PITA and in the past has taken weeks. Though this can sometimes also be the case for Fedora. I'm sorry I really don't have any easy solutions here besides disabling testing of one form or another. But that kind of just kicks the can down the road and opens up a good chance new problems will be introduced in the meanwhile 😢

cevich · 2022-07-11T16:40:29Z

Thinking: Would it help at all to make a separate issue for CRIU?

In the past, getting a runc update in Ubuntu can be really really hard. It's one of the reasons we always turned to Lokesh's kubik OBS repo for a custom one, but I believe he very much wants to get away from this.

One idea could be a CI strategy change: Do runc testing in "prior-fedora" (F35), and let F36 and Ubuntu use their native crun+CGroupsV2 setup. Something to discuss with the team.

edsantiago · 2022-07-11T16:51:02Z

There is a criu issue: checkpoint-restore/criu#1935 (it is referenced in my auto-images PR. Too many confusing places to track).

cevich · 2022-07-11T17:26:54Z

There is a criu issue

Excellent. Yep, just gotta be the "squeeky wheel" sometimes to get things done. As you probably realize, this will take a while to move through the "prior-fedora" repository workflows and be released. However, if you want to speed that up, there's no shame in fudging our image-build scripts to temporary dnf install -y $PRERELEASE_URL for F35.

Too many confusing places to track

OMG, I've had to start grouping and color-coding my browser tabs ☹️

We're still not testing runc in CI (containers#14833), and it may be weeks or months before we can, due to criu/glibc nightmare, but one day we'll be back on track, then later on we'll update VMs again, and screw it up, and lose runc, and not notice, and RHEL will break, and oh noes headless chicken again, repeat repeat. We can do better. Use .cirrus.yml to explicitly define which VMs should use which runtimes, and enforce it early in the CI build step. This should never fail (uh huh) in a PR, only in one of the update-VM PRs. Signed-off-by: Ed Santiago <[email protected]>

edsantiago · 2022-07-12T17:35:58Z

Followup: criu 3.17.1-2.fc35 fixes the f35 problem, but it's stuck in bodhi. Until it gets into stable, there's nothing we can do. And, Ubuntu, who knows.

cevich · 2022-07-12T19:03:31Z

If I wasn't clear (and if you don't want to wait weeks), I'm totally okay with a temporary dnf install -y $BODHI_URL in the cache-image packaging scripts. At least if we can get the Fedora's squared away, adding a skip("TODO: issue https://blah.blah") for Ubuntu tests isn't unprecedented.

mihalicyn · 2022-07-12T21:17:33Z

@edsantiago Ed, please let me know if you have any problems with CRIU and help if help is needed. We haven't changed anything in the CRIU code, this is was just a build issue, because CRIU was build against old version of Glibc. From my point of view, after Adrian @adrianreber detected the problem and triggered CRIU rebuild all should start fork flawlessly.

What's the problem with Ubuntu?

edsantiago · 2022-07-12T21:24:30Z

@mihalicyn thank you! As I wrote in bodhi, the criu f35 build works perfectly. Unfortunately it's stuck in bodhi limbo.

Ubuntu: my recollection is that podman checkpoint tests failed on Ubuntu also. The error messages give no visibility into the reason, so I (perhaps naïvely) assumed that it was the same glibc problem. I have no way to log into an Ubuntu system, so there's no way for me to investigate further. My plan was: as soon as criu f35 gets into stable, re-run my VM-building PR, then run podman tests on those VMs, and see how things go.

If you have a way to un-stuckify bodhi, such as forcing gating tests to run, that would speed things up - but I've been in the bodhi-hell game before and know how difficult it can be to get it unstuck.

mihalicyn · 2022-07-12T22:13:48Z

thank you!

It's my pleasure!

Ubuntu: my recollection is that podman checkpoint tests failed on Ubuntu also.

hm... I can recall some problem with overlayfs on Ubuntu kernels.
https://bugs.launchpad.net/ubuntu/impish/+source/linux/+bug/1967924

If overlayfs was used in the podman containers then I'm almost sure that it's the same problem.

cevich · 2022-07-13T17:22:06Z

If Ed's speaking of our podman CI tests, those all use the vfs driver, not overlay. Though I believe the system tests would use overlay.

@Luap99

...and enable the at-test-time confirmation, the one that double-checks that if CI requests runc we actually use runc. This exposed a nasty surprise in our setup: there are steps to define $OCI_RUNTIME, but that's actually a total fakeout! OCI_RUNTIME is used only in e2e tests, it has no effect whatsoever on actual podman itself as invoked via command line such as in system tests. Solution: use containers.conf Given how fragile all this runtime stuff is, I've also added new tests (e2e and system) that will check $CI_DESIRED_RUNTIME. Image source: containers/automation_images#146 Since we haven't actually been testing with runc, we need to fix a few tests: - handle an error-message change (make it work in both crun and runc) - skip one system test, "survive service stop", that doesn't work with runc and I don't think we care. ...and skip a bunch, filing issues for each: - containers#15013 pod create --share-parent - containers#15014 timeout in dd - containers#15015 checkpoint tests time out under $CONTAINER - containers#15017 networking timeout with registry - containers#15018 restore --pod gripes about missing --pod - containers#15025 run --uidmap broken - containers#15027 pod inspect cgrouppath broken - ...and a bunch more ("podman pause") that probably don't even merit filing an issue. Also, use /dev/urandom in one test (was: /dev/random) because the test is timing out and /dev/urandom does not block. (But the test is still timing out anyway, even with this change) Also, as part of the VM switch we are now using go 1.18 (up from 1.17) and this broke the gitlab tests. Thanks to @Luap99 for a quick fix. Also, slight tweak to containers#15021: include the timeout value, and reword message so command string is at end. Also, fixed a misspelling in a test name. Fixes: containers#14833 Signed-off-by: Ed Santiago <[email protected]>

We're still not testing runc in CI (containers#14833), and it may be weeks or months before we can, due to criu/glibc nightmare, but one day we'll be back on track, then later on we'll update VMs again, and screw it up, and lose runc, and not notice, and RHEL will break, and oh noes headless chicken again, repeat repeat. We can do better. Use .cirrus.yml to explicitly define which VMs should use which runtimes, and enforce it early in the CI build step. This should never fail (uh huh) in a PR, only in one of the update-VM PRs. Signed-off-by: Ed Santiago <[email protected]>

edsantiago assigned cevich Jul 5, 2022

edsantiago mentioned this issue Jul 5, 2022

test: simplify and fix "podman volume: exec/noexec" #14793

Closed

edsantiago mentioned this issue Jul 12, 2022

[CI:DOCS] CI: sanity check for desired runtime #14912

Merged

edsantiago mentioned this issue Jul 19, 2022

Bump VMs, to Ubuntu 2204 with cgroups v1 #14972

Merged

openshift-merge-robot closed this as completed in #14972 Jul 22, 2022

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 20, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: runc is not being tested #14833

CI: runc is not being tested #14833

edsantiago commented Jul 5, 2022

cevich commented Jul 11, 2022 •

edited

Loading

cevich commented Jul 11, 2022

edsantiago commented Jul 11, 2022

cevich commented Jul 11, 2022

cevich commented Jul 11, 2022

edsantiago commented Jul 11, 2022

cevich commented Jul 11, 2022

edsantiago commented Jul 12, 2022

cevich commented Jul 12, 2022

mihalicyn commented Jul 12, 2022

edsantiago commented Jul 12, 2022

mihalicyn commented Jul 12, 2022

cevich commented Jul 13, 2022

CI: runc is not being tested #14833

CI: runc is not being tested #14833

Comments

edsantiago commented Jul 5, 2022

cevich commented Jul 11, 2022 • edited Loading

cevich commented Jul 11, 2022

edsantiago commented Jul 11, 2022

cevich commented Jul 11, 2022

cevich commented Jul 11, 2022

edsantiago commented Jul 11, 2022

cevich commented Jul 11, 2022

edsantiago commented Jul 12, 2022

cevich commented Jul 12, 2022

mihalicyn commented Jul 12, 2022

edsantiago commented Jul 12, 2022

mihalicyn commented Jul 12, 2022

cevich commented Jul 13, 2022

cevich commented Jul 11, 2022 •

edited

Loading