[DO NOT MERGE] fix aarch64 cgroup flake #15179

cdoern · 2022-08-03T17:56:31Z

test aarch64 pod cgroups with main

Signed-off-by: Charlie Doern [email protected]

Does this PR introduce a user-facing change?

test aarch64 pod cgroups with main

cdoern · 2022-08-03T17:57:04Z

/hold

cdoern · 2022-08-03T17:57:29Z

not sure if this will fix it but this is definitely the culprit

cdoern · 2022-08-03T19:26:16Z

different error... promising! @cevich any idea if cpuset is enabled on aarch?

edsantiago · 2022-08-03T20:15:19Z

The fact that this test is a flake, meaning, it passes sometimes, suggests to me that, whatever "cpuset" is, is enabled or at least enabled-enough to work sometimes.

cdoern · 2022-08-03T20:17:23Z

The fact that this test is a flake, meaning, it passes sometimes, suggests to me that, whatever "cpuset" is, is enabled or at least enabled-enough to work sometimes.

good point. I think I figured out the issue with chris, I just need to patch up a bunch of small holes all over the code that break with parallel usages

cevich · 2022-08-03T20:49:29Z

Oh goodie, so it was parallel podman runs and test toe-stepping causing this?

I too am mostly cgroup-ignorant, but this wouldn't be the first time I've seen a shared resource cause a race. It's good CI caught this before a user found it 😄

cdoern · 2022-08-03T20:50:50Z

@cevich I am sure this will keep failing and I'll have to keep squashing bugs of the same sort throughout the code... but getting there!

edsantiago · 2022-08-03T20:53:08Z

I feel really stupid asking, but, where is the parallel coming from? System tests are single-threaded.

cdoern · 2022-08-03T20:58:52Z

@edsantiago honestly not sure, @cevich just said that there is a scenario in which parallel happens with these tests

cevich · 2022-08-03T21:00:17Z

Oh maybe this is my bad. I thought it was the e2e tests where this was flaking. Those run 3-wide, but Ed's right, the system-tests run one by one.

cdoern · 2022-08-03T21:32:00Z

Oh maybe this is my bad. I thought it was the e2e tests where this was flaking. Those run 3-wide, but Ed's right, the system-tests run one by one.

The fact that these patches still have positive impacts makes me think something weird is happening with the test env. It seems like there is some cleanup process deleting everything in sys/fs/cgroup before it can be used.

cevich · 2022-08-04T14:22:10Z

some cleanup process deleting everything in sys/fs/cgroup

During the build process we shutdown and disable a bunch of systemd services, it's possible a new one was added and/or we missed one:

https://github.com/containers/automation_images/blob/main/systemd_banish.sh#L14

cdoern · 2022-08-04T15:22:29Z

hit a wall here. Permission denied error makes no sense @giuseppe any ideas why this would be happening?

 podman --cgroup-manager=cgroupfs pod create --name=resources-cgroupfs --cpus=5 --memory=5m --memory-swap=1g --cpu-shares=1000 --cpuset-cpus=0 --cpuset-mems=0 --device-read-bps=/dev/loop0:1mb --device-write-bps=/dev/loop0:1mb --blkio-weight-device=/dev/loop0:123 --blkio-weight=50
         # Error: write /sys/fs/cgroup/libpod_parent/42a2ab1a0d6b28ed612fd0407f4587d94dbe3f51670745d1b64e4a42afa2781e/cpuset.cpus: open /sys/fs/cgroup/libpod_parent/42a2ab1a0d6b28ed612fd0407f4587d94dbe3f51670745d1b64e4a42afa2781e/cpuset.cpus: permission denied

cevich · 2022-08-04T21:02:57Z

@cdoern FWIW, we do capture the audit.log for most tasks, in the failed one above, I don't see any denials that appear related:
https://api.cirrus-ci.com/v1/task/5125411410542592/logs/audit_log.log

Though it's possible for them to be suppressed by the policy, it's not normally the case. So I have a bit less faith this is caused by SELinux related issues, but it's still worth trying since we need to wait for Giuseppe anyway. The image build is nearly done...

cdoern · 2022-08-05T00:35:05Z

@cevich these system tests have been running for a few hours...

cevich · 2022-08-05T15:45:40Z

I never got a reply on IRC from @giuseppe on this. Summary of where we're at on the aarch64 system test flakes:

Charlie already confirmed that code-fixes did not correct the failures in CI, his conclusion is it must be an environmental problem.
Running tests manually (i.e. hack/get_ci_vm.sh) runs as root uses the unconfined_t type (the failures did NOT reproduce)
I checked the previous failures audit.log but could not find any AVC denials obviously associated with these aarch64 (really EC2) failures.

Summary of changes in aaef571:

Minor updates to several packages (haven't checked for significance)
Changed the SELinux type of the cloud-init service to unconfined_t in the EC2 VM images. Cloud-init is what ends up downloading/running the Cirrus-CI agent process responsible for executing all our scripts.
Previously, CI in our EC2 images ran with cloud_init_t type. GCP has been configured for years to use unconfined_t for the cirrus agent.

So in summary, I'm not super-duper confident in either the SELinux or package updates as "Fixes" for this failure. I can't explain the lack of audit.log entries, but also don't have any idea what the cloud_init_t policy is (maybe it's audit-ignore?).
So if the flake really truly is resolved (through multiple re-runs), maybe the exact cause doesn't matter? Though I'm happy to hear any insights @giuseppe or @rhatdan may have 😁

cevich · 2022-08-05T15:47:03Z

(I just restarted the aarch64 build task, it was holding up the system-test task)

cevich · 2022-08-05T16:02:57Z

Hrmmm, it seems the aarch64 build task is hanging on re-run. https://cirrus-ci.com/task/5667829994225664
I checked in EC2 and I can see the instance is coming up, but if my changes screwed up cloud-init in some way, that could prevent the agent from starting, leading to a hang. Investigating...

cevich · 2022-08-05T16:14:19Z

...ya, damn, my change broke it. Status shows:

cloud-final.service: Failed at step EXEC spawning /usr/bin/cloud-init: Permission denied

Ugh.

cevich · 2022-08-05T16:15:46Z

-rwxr-xr-x. 1 root root system_u:object_r:cloud_init_exec_t:s0 970 Mar 10 21:25 /usr/bin/cloud-init
# chcon -t unconfined_t /usr/bin/cloud-init
chcon: failed to change context of '/usr/bin/cloud-init' to ‘system_u:object_r:unconfined_t:s0’: Permission denied

Oof, I don't know how to fix the failure of the service to start, it's like the policy won't let this service be unconfined. The unit-file modification is correct AFAIK:

### File generated during VM Image build by fedora_base-setup.sh
[Unit]
Description=Execute cloud user/final scripts
After=network-online.target cloud-config.service rc-local.service
Wants=network-online.target cloud-config.service

[Service]
Type=oneshot
SELinuxContext=unconfined_u:unconfined_r:unconfined_t:s0
ExecStart=/usr/bin/cloud-init modules --mode=final
RemainAfterExit=yes
TimeoutSec=0
KillMode=process
TasksMax=infinity

# Output needs to appear in instance console output
StandardOutput=journal+console

[Install]
WantedBy=cloud-init.target

Anyway...maybe this isn't the cause of the test-flake. Ed's point about "Why isn't this causing other SELinux tests to fail" I still cannot explain.

cevich · 2022-08-08T14:55:16Z

I played around on these VMs some more, and it seems if I disable SELinux, I can chcon /usr/bin/cloud-inittosystem_u:object_r:bin_t:s0. That seems to allows the cloud-final.service` to start properly. I'll try doing that for a new set of images, but no guarantee it will even impact this issue 😕

cevich · 2022-08-08T20:13:09Z

Whew! I see one of the EC2 VMs successfully booted and ran tests...so we're on a better track this time hopefully 😅

cevich · 2022-08-09T14:33:47Z

@cdoern is this the flake you're chasing?:

not ok 268 pod resource limits
...cut...
         # # podman --cgroup-manager=cgroupfs pod create --name=resources-cgroupfs --cpus=5 --memory=5m --memory-swap=1g --cpu-shares=1000 --cpuset-cpus=0 --cpuset-mems=0 --device-read-bps=/dev/loop0:1mb --device-write-bps=/dev/loop0:1mb --blkio-weight-device=/dev/loop0:123 --blkio-weight=50
         # Error: error creating cgroup path /libpod_parent/a34e3d3cfae1ba0c586d1648dd66a77fe9404a6c6d178f26b3e781e84e8b08be: write /sys/fs/cgroup/libpod_parent/cgroup.subtree_control: no such file or directory
         # [ rc=125 (** EXPECTED 0 **) ]

?

edsantiago · 2022-08-09T14:36:26Z

Yes, #15074

cevich · 2022-08-09T14:58:57Z

Okay, so then for sure the SELinux fix had no impact, which isn't all that shocking. So is the no such file or directory a red-herring then (I see "permission denied" mentioned in this PR and the issue)?

Why the heck do we only see this on EC2 and/or aarch64 VMs 😖

We have an x86_64 image in EC2, built substantially the same way. Is it worth rigging up the task to run with that to try and rule out the architecture vs EC2 as the contributing factor?

cevich · 2022-08-09T15:05:32Z

Re: "Problem does not reproduce under hack/get_ci_vm.sh.

This is even more perplexing now with the SELiunux execution context ruled out. About the only difference now is executing within an interactive login session (get_ci_vm.sh) vs no session (Cirrus-CI).

Random idea (long-shot): Maybe we should run that loginctl enable-linger root before starting the tests?

cevich · 2022-08-10T15:15:35Z

Does anybody know if this problem reproduces on sys remote fedora-36-aarch64 root host? (note: that task currently blocks on sys podman fedora-36-aarch64 root host passing).

This would be an interesting data-point, as those tests all run through ssh (so they have a session) just like get_ci_vm.sh (where it does not reproduce).

edsantiago · 2022-08-10T15:56:42Z

Does anybody know if this problem reproduces on sys remote fedora-36-aarch64 root host?

Is the important part of your question the remote part? If so, this test is N/A on remote, so it's skipped. If the important part is something else, can you please highlight it?

cevich · 2022-08-10T16:24:02Z

If so, this test is N/A on remote, so it's skipped

Yeah the remote part. Damn. Right I vaguely remember now (in the e2e tests), we do skip these don't we.

It's really bugging me that this only seems to reproduce under Cirrus-CI. I believe there are very few differences (besides a bunch of $CIRRUS-blah env. vars.) to the hack/get_ci_vm.sh environment. The SELinux execution type was one, and having a login session (in the hack script) is another difference. Of course the architecture and cloud-environment are also different, but my understanding is (to date) we've never ever seen this flake on x86_64. I should think the execution timings and/or aarch64-golang threading-model would need to be pretty drastically different for that to be a factor 😕

Hmmmm. I'm about out of ideas, including the hair-brained ones 😩

cevich · 2022-08-10T16:31:47Z

This may be significant: Poking around under test/e2e, there are a few pod_create and at least one play_kube test that should touch these same code-paths. To my understanding, we've never seen these fail on aarch64 (yet), no?

github-actions · 2022-09-11T00:08:37Z

A friendly reminder that this PR had no activity for 30 days.

rhatdan · 2022-09-12T19:21:00Z

@cdoern Whats up with this one?

github-actions · 2022-10-13T00:10:59Z

A friendly reminder that this PR had no activity for 30 days.

edsantiago · 2022-10-13T16:56:39Z

Flake is #15367, the everything-hosed one. I'm no longer restarting that one, because when I do, nobody sees it, and nobody sees how often it's triggering, so nobody is incentivized to fix it. We really, really need that one fixed.

TomSweeneyRedHat · 2022-11-11T19:13:20Z

No action since prior to September. Time to close this?

edsantiago · 2022-11-14T21:05:43Z

@cdoern could you rebase and repush so we can see how the latest CI images behave?

cdoern · 2022-11-14T23:13:26Z

Sure @edsantiago will do that tonight

Signed-off-by: Charlie Doern <[email protected]>

openshift-ci · 2022-11-15T03:23:34Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cdoern
Once this PR has been reviewed and has the lgtm label, please assign vrothberg for approval by writing /assign @vrothberg in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

edsantiago · 2022-11-15T12:05:23Z

Thanks. That answers that:

# podman --cgroup-manager=cgroupfs pod create --name=resources-cgroupfs --cpus=5 --memory=5m --memory-swap=1g --cpu-shares=1000 --cpuset-cpus=0 --cpuset-mems=0 --device-read-bps=/dev/loop0:1mb --device-write-bps=/dev/loop0:1mb --blkio-weight=50
Error: creating cgroup path /libpod_parent/7dfe56eb414ba9ee3f53fbfdde8898d2f8243447cdbc9580e172b16b8baf9654: write /sys/fs/cgroup/libpod_parent/cgroup.subtree_control: no such file or directory

The bug is still present.

vrothberg · 2023-01-17T10:06:30Z

I am going to close the PR to get it off @cdoern's back.

openshift-ci bot added do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None release-note and removed do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None labels Aug 3, 2022

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 3, 2022

cdoern force-pushed the cgroup branch from bd9f9a3 to dfbf2ef Compare August 3, 2022 20:24

cdoern force-pushed the cgroup branch from dfbf2ef to 7b74747 Compare August 3, 2022 21:30

cdoern force-pushed the cgroup branch from 7b74747 to cc5c69a Compare August 4, 2022 13:49

cevich mentioned this pull request Aug 4, 2022

Set EC2 cirrus-agent SELinux context to unconfined containers/automation_images#163

Merged

cdoern force-pushed the cgroup branch from cc5c69a to aaef571 Compare August 4, 2022 21:35

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 11, 2022

github-actions bot added the stale-pr label Sep 11, 2022

github-actions bot removed the stale-pr label Sep 13, 2022

github-actions bot added the stale-pr label Oct 13, 2022

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 13, 2022

github-actions bot removed the stale-pr label Oct 14, 2022

test aarch64 pod cgroups

19aaa42

Signed-off-by: Charlie Doern <[email protected]>

cdoern force-pushed the cgroup branch from 968198f to 19aaa42 Compare November 15, 2022 03:23

vrothberg closed this Jan 17, 2023

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 14, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 14, 2023

[DO NOT MERGE] fix aarch64 cgroup flake #15179

[DO NOT MERGE] fix aarch64 cgroup flake #15179

Conversation

cdoern commented Aug 3, 2022 • edited Loading

Does this PR introduce a user-facing change?

cdoern commented Aug 3, 2022

cdoern commented Aug 3, 2022

cdoern commented Aug 3, 2022

edsantiago commented Aug 3, 2022

cdoern commented Aug 3, 2022

cevich commented Aug 3, 2022

cdoern commented Aug 3, 2022

edsantiago commented Aug 3, 2022

cdoern commented Aug 3, 2022 • edited Loading

cevich commented Aug 3, 2022

cdoern commented Aug 3, 2022

cevich commented Aug 4, 2022

cdoern commented Aug 4, 2022

cevich commented Aug 4, 2022

cdoern commented Aug 5, 2022

cevich commented Aug 5, 2022

cevich commented Aug 5, 2022

cevich commented Aug 5, 2022

cevich commented Aug 5, 2022 • edited Loading

cevich commented Aug 5, 2022 • edited Loading

cevich commented Aug 8, 2022

cevich commented Aug 8, 2022

cevich commented Aug 9, 2022

edsantiago commented Aug 9, 2022

cevich commented Aug 9, 2022

cevich commented Aug 9, 2022

cevich commented Aug 10, 2022

edsantiago commented Aug 10, 2022

cevich commented Aug 10, 2022 • edited Loading

cevich commented Aug 10, 2022

github-actions bot commented Sep 11, 2022

rhatdan commented Sep 12, 2022

github-actions bot commented Oct 13, 2022

edsantiago commented Oct 13, 2022

TomSweeneyRedHat commented Nov 11, 2022

edsantiago commented Nov 14, 2022

cdoern commented Nov 14, 2022

openshift-ci bot commented Nov 15, 2022

edsantiago commented Nov 15, 2022

vrothberg commented Jan 17, 2023

cdoern commented Aug 3, 2022 •

edited

Loading

cdoern commented Aug 3, 2022 •

edited

Loading

cevich commented Aug 5, 2022 •

edited

Loading

cevich commented Aug 5, 2022 •

edited

Loading

cevich commented Aug 10, 2022 •

edited

Loading