Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] fix aarch64 cgroup flake #15179

Closed
wants to merge 1 commit into from
Closed

Conversation

cdoern
Copy link
Contributor

@cdoern cdoern commented Aug 3, 2022

test aarch64 pod cgroups with main

Signed-off-by: Charlie Doern [email protected]

Does this PR introduce a user-facing change?

test aarch64 pod cgroups with main

@openshift-ci openshift-ci bot added do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None release-note and removed do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None labels Aug 3, 2022
@cdoern
Copy link
Contributor Author

cdoern commented Aug 3, 2022

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 3, 2022
@cdoern
Copy link
Contributor Author

cdoern commented Aug 3, 2022

not sure if this will fix it but this is definitely the culprit

@cdoern
Copy link
Contributor Author

cdoern commented Aug 3, 2022

different error... promising! @cevich any idea if cpuset is enabled on aarch?

@edsantiago
Copy link
Member

The fact that this test is a flake, meaning, it passes sometimes, suggests to me that, whatever "cpuset" is, is enabled or at least enabled-enough to work sometimes.

@cdoern
Copy link
Contributor Author

cdoern commented Aug 3, 2022

The fact that this test is a flake, meaning, it passes sometimes, suggests to me that, whatever "cpuset" is, is enabled or at least enabled-enough to work sometimes.

good point. I think I figured out the issue with chris, I just need to patch up a bunch of small holes all over the code that break with parallel usages

@cevich
Copy link
Member

cevich commented Aug 3, 2022

Oh goodie, so it was parallel podman runs and test toe-stepping causing this?

I too am mostly cgroup-ignorant, but this wouldn't be the first time I've seen a shared resource cause a race. It's good CI caught this before a user found it 😄

@cdoern
Copy link
Contributor Author

cdoern commented Aug 3, 2022

@cevich I am sure this will keep failing and I'll have to keep squashing bugs of the same sort throughout the code... but getting there!

@edsantiago
Copy link
Member

I feel really stupid asking, but, where is the parallel coming from? System tests are single-threaded.

@cdoern
Copy link
Contributor Author

cdoern commented Aug 3, 2022

@edsantiago honestly not sure, @cevich just said that there is a scenario in which parallel happens with these tests

@cevich
Copy link
Member

cevich commented Aug 3, 2022

Oh maybe this is my bad. I thought it was the e2e tests where this was flaking. Those run 3-wide, but Ed's right, the system-tests run one by one.

@cdoern
Copy link
Contributor Author

cdoern commented Aug 3, 2022

Oh maybe this is my bad. I thought it was the e2e tests where this was flaking. Those run 3-wide, but Ed's right, the system-tests run one by one.

The fact that these patches still have positive impacts makes me think something weird is happening with the test env. It seems like there is some cleanup process deleting everything in sys/fs/cgroup before it can be used.

@cevich
Copy link
Member

cevich commented Aug 4, 2022

some cleanup process deleting everything in sys/fs/cgroup

During the build process we shutdown and disable a bunch of systemd services, it's possible a new one was added and/or we missed one:

https://github.com/containers/automation_images/blob/main/systemd_banish.sh#L14

@cdoern
Copy link
Contributor Author

cdoern commented Aug 4, 2022

hit a wall here. Permission denied error makes no sense @giuseppe any ideas why this would be happening?

 podman --cgroup-manager=cgroupfs pod create --name=resources-cgroupfs --cpus=5 --memory=5m --memory-swap=1g --cpu-shares=1000 --cpuset-cpus=0 --cpuset-mems=0 --device-read-bps=/dev/loop0:1mb --device-write-bps=/dev/loop0:1mb --blkio-weight-device=/dev/loop0:123 --blkio-weight=50
         # Error: write /sys/fs/cgroup/libpod_parent/42a2ab1a0d6b28ed612fd0407f4587d94dbe3f51670745d1b64e4a42afa2781e/cpuset.cpus: open /sys/fs/cgroup/libpod_parent/42a2ab1a0d6b28ed612fd0407f4587d94dbe3f51670745d1b64e4a42afa2781e/cpuset.cpus: permission denied

@cevich
Copy link
Member

cevich commented Aug 4, 2022

@cdoern FWIW, we do capture the audit.log for most tasks, in the failed one above, I don't see any denials that appear related:
https://api.cirrus-ci.com/v1/task/5125411410542592/logs/audit_log.log

Though it's possible for them to be suppressed by the policy, it's not normally the case. So I have a bit less faith this is caused by SELinux related issues, but it's still worth trying since we need to wait for Giuseppe anyway. The image build is nearly done...

@cdoern
Copy link
Contributor Author

cdoern commented Aug 5, 2022

@cevich these system tests have been running for a few hours...

@cevich
Copy link
Member

cevich commented Aug 5, 2022

I never got a reply on IRC from @giuseppe on this. Summary of where we're at on the aarch64 system test flakes:

  • Charlie already confirmed that code-fixes did not correct the failures in CI, his conclusion is it must be an environmental problem.
  • Running tests manually (i.e. hack/get_ci_vm.sh) runs as root uses the unconfined_t type (the failures did NOT reproduce)
  • I checked the previous failures audit.log but could not find any AVC denials obviously associated with these aarch64 (really EC2) failures.

Summary of changes in aaef571:

  • Minor updates to several packages (haven't checked for significance)
  • Changed the SELinux type of the cloud-init service to unconfined_t in the EC2 VM images. Cloud-init is what ends up downloading/running the Cirrus-CI agent process responsible for executing all our scripts.
  • Previously, CI in our EC2 images ran with cloud_init_t type. GCP has been configured for years to use unconfined_t for the cirrus agent.

So in summary, I'm not super-duper confident in either the SELinux or package updates as "Fixes" for this failure. I can't explain the lack of audit.log entries, but also don't have any idea what the cloud_init_t policy is (maybe it's audit-ignore?).
So if the flake really truly is resolved (through multiple re-runs), maybe the exact cause doesn't matter? Though I'm happy to hear any insights @giuseppe or @rhatdan may have 😁

@cevich
Copy link
Member

cevich commented Aug 5, 2022

(I just restarted the aarch64 build task, it was holding up the system-test task)

@cevich
Copy link
Member

cevich commented Aug 5, 2022

Hrmmm, it seems the aarch64 build task is hanging on re-run. https://cirrus-ci.com/task/5667829994225664
I checked in EC2 and I can see the instance is coming up, but if my changes screwed up cloud-init in some way, that could prevent the agent from starting, leading to a hang. Investigating...

@cevich
Copy link
Member

cevich commented Aug 5, 2022

...ya, damn, my change broke it. Status shows:

cloud-final.service: Failed at step EXEC spawning /usr/bin/cloud-init: Permission denied

Ugh.

@cevich
Copy link
Member

cevich commented Aug 5, 2022

-rwxr-xr-x. 1 root root system_u:object_r:cloud_init_exec_t:s0 970 Mar 10 21:25 /usr/bin/cloud-init
# chcon -t unconfined_t /usr/bin/cloud-init
chcon: failed to change context of '/usr/bin/cloud-init' to ‘system_u:object_r:unconfined_t:s0’: Permission denied

Oof, I don't know how to fix the failure of the service to start, it's like the policy won't let this service be unconfined. The unit-file modification is correct AFAIK:

### File generated during VM Image build by fedora_base-setup.sh
[Unit]
Description=Execute cloud user/final scripts
After=network-online.target cloud-config.service rc-local.service
Wants=network-online.target cloud-config.service

[Service]
Type=oneshot
SELinuxContext=unconfined_u:unconfined_r:unconfined_t:s0
ExecStart=/usr/bin/cloud-init modules --mode=final
RemainAfterExit=yes
TimeoutSec=0
KillMode=process
TasksMax=infinity

# Output needs to appear in instance console output
StandardOutput=journal+console

[Install]
WantedBy=cloud-init.target

Anyway...maybe this isn't the cause of the test-flake. Ed's point about "Why isn't this causing other SELinux tests to fail" I still cannot explain.

@cevich
Copy link
Member

cevich commented Aug 8, 2022

I played around on these VMs some more, and it seems if I disable SELinux, I can chcon /usr/bin/cloud-inittosystem_u:object_r:bin_t:s0. That seems to allows the cloud-final.service` to start properly. I'll try doing that for a new set of images, but no guarantee it will even impact this issue 😕

@cevich
Copy link
Member

cevich commented Aug 8, 2022

Whew! I see one of the EC2 VMs successfully booted and ran tests...so we're on a better track this time hopefully 😅

@cevich
Copy link
Member

cevich commented Aug 9, 2022

@cdoern is this the flake you're chasing?:

not ok 268 pod resource limits
...cut...
         # # podman --cgroup-manager=cgroupfs pod create --name=resources-cgroupfs --cpus=5 --memory=5m --memory-swap=1g --cpu-shares=1000 --cpuset-cpus=0 --cpuset-mems=0 --device-read-bps=/dev/loop0:1mb --device-write-bps=/dev/loop0:1mb --blkio-weight-device=/dev/loop0:123 --blkio-weight=50
         # Error: error creating cgroup path /libpod_parent/a34e3d3cfae1ba0c586d1648dd66a77fe9404a6c6d178f26b3e781e84e8b08be: write /sys/fs/cgroup/libpod_parent/cgroup.subtree_control: no such file or directory
         # [ rc=125 (** EXPECTED 0 **) ]

?

@edsantiago
Copy link
Member

Yes, #15074

@cevich
Copy link
Member

cevich commented Aug 9, 2022

Okay, so then for sure the SELinux fix had no impact, which isn't all that shocking. So is the no such file or directory a red-herring then (I see "permission denied" mentioned in this PR and the issue)?

Why the heck do we only see this on EC2 and/or aarch64 VMs 😖

We have an x86_64 image in EC2, built substantially the same way. Is it worth rigging up the task to run with that to try and rule out the architecture vs EC2 as the contributing factor?

@cevich
Copy link
Member

cevich commented Aug 9, 2022

Re: "Problem does not reproduce under hack/get_ci_vm.sh.

This is even more perplexing now with the SELiunux execution context ruled out. About the only difference now is executing within an interactive login session (get_ci_vm.sh) vs no session (Cirrus-CI).

Random idea (long-shot): Maybe we should run that loginctl enable-linger root before starting the tests?

@cevich
Copy link
Member

cevich commented Aug 10, 2022

Does anybody know if this problem reproduces on sys remote fedora-36-aarch64 root host? (note: that task currently blocks on sys podman fedora-36-aarch64 root host passing).

This would be an interesting data-point, as those tests all run through ssh (so they have a session) just like get_ci_vm.sh (where it does not reproduce).

@edsantiago
Copy link
Member

Does anybody know if this problem reproduces on sys remote fedora-36-aarch64 root host?

Is the important part of your question the remote part? If so, this test is N/A on remote, so it's skipped. If the important part is something else, can you please highlight it?

@cevich
Copy link
Member

cevich commented Aug 10, 2022

If so, this test is N/A on remote, so it's skipped

Yeah the remote part. Damn. Right I vaguely remember now (in the e2e tests), we do skip these don't we.

It's really bugging me that this only seems to reproduce under Cirrus-CI. I believe there are very few differences (besides a bunch of $CIRRUS-blah env. vars.) to the hack/get_ci_vm.sh environment. The SELinux execution type was one, and having a login session (in the hack script) is another difference. Of course the architecture and cloud-environment are also different, but my understanding is (to date) we've never ever seen this flake on x86_64. I should think the execution timings and/or aarch64-golang threading-model would need to be pretty drastically different for that to be a factor 😕

Hmmmm. I'm about out of ideas, including the hair-brained ones 😩

@cevich
Copy link
Member

cevich commented Aug 10, 2022

This may be significant: Poking around under test/e2e, there are a few pod_create and at least one play_kube test that should touch these same code-paths. To my understanding, we've never seen these fail on aarch64 (yet), no?

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 11, 2022
@github-actions
Copy link

A friendly reminder that this PR had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Sep 12, 2022

@cdoern Whats up with this one?

@github-actions github-actions bot removed the stale-pr label Sep 13, 2022
@github-actions
Copy link

A friendly reminder that this PR had no activity for 30 days.

@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 13, 2022
@edsantiago
Copy link
Member

Flake is #15367, the everything-hosed one. I'm no longer restarting that one, because when I do, nobody sees it, and nobody sees how often it's triggering, so nobody is incentivized to fix it. We really, really need that one fixed.

@github-actions github-actions bot removed the stale-pr label Oct 14, 2022
@TomSweeneyRedHat
Copy link
Member

No action since prior to September. Time to close this?

@edsantiago
Copy link
Member

@cdoern could you rebase and repush so we can see how the latest CI images behave?

@cdoern
Copy link
Contributor Author

cdoern commented Nov 14, 2022

Sure @edsantiago will do that tonight

Signed-off-by: Charlie Doern <[email protected]>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 15, 2022

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cdoern
Once this PR has been reviewed and has the lgtm label, please assign vrothberg for approval by writing /assign @vrothberg in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@edsantiago
Copy link
Member

Thanks. That answers that:

# podman --cgroup-manager=cgroupfs pod create --name=resources-cgroupfs --cpus=5 --memory=5m --memory-swap=1g --cpu-shares=1000 --cpuset-cpus=0 --cpuset-mems=0 --device-read-bps=/dev/loop0:1mb --device-write-bps=/dev/loop0:1mb --blkio-weight=50
Error: creating cgroup path /libpod_parent/7dfe56eb414ba9ee3f53fbfdde8898d2f8243447cdbc9580e172b16b8baf9654: write /sys/fs/cgroup/libpod_parent/cgroup.subtree_control: no such file or directory

The bug is still present.

@vrothberg
Copy link
Member

I am going to close the PR to get it off @cdoern's back.

@vrothberg vrothberg closed this Jan 17, 2023
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 14, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants